请求方式:get 和 post
获取AJAX加载的内容 --用post,把数据存储在request请求里的data=
有些网页内容使用AJAX请求加载,这种数据无法直接对网页url进行获取。但是只要记住,AJAX请求一般返回给网页的是JSON文件,只要对AJAX请求地址进行POST或GET,就能返回JSON数据了。
爬取豆瓣热门电影
import urllib.request
import urllib.parse
start=0
b=1
while True:
url_base='http://movie.douban.com/j/chart/top_list?'
url_kw={
'type': 11,
'interval_id': '100:90',
'action':'',
'start':start,
'limit': 20
}
url_all=url_base+urllib.parse.urlencode(url_kw)
print(url_all)
request=urllib.request.Request(url=url_all)
response = urllib.request.urlopen(url=request)
context = response.read()
file_name = 'douban%s.html'%(b)
with open(file_name, 'wb') as file:
file.write(context)
#通过解码得到字符串
ret1 = context.decode('utf-8')
#因为true与前端的True,冲突无法解析,所以要替换
ret2 = ret1.replace('true','True').replace('false','False')
ret3 = eval(ret2)
print(ret3)
print(len(ret3))
if ret3!=[]:
with open(file_name, 'w',encoding='utf-8') as file:
for i in ret3:
file.write(str(i)+'\n')
start=start+20
b+=1
else:
break
自定义opener对象
#自定义url opener对象
import urllib.request
#创建一个http对象
http_handler=urllib.request.HTTPHandler(debuglevel=1)
#创建一个opener对象
http_opener=urllib.request.build_opener(http_handler)
request = urllib.request.Request('http://www.sina.com')
#发送请求,获取影响
response = http_opener.open(request)
content = response.read()
with open('./12_1.html','wb') as file:
file.write(content)
urllib2的异常错误处理
import urllib.request
request = urllib.request.Request(url='http://www.iloveyou.com/')
try:
response = urllib.request.urlopen(url=request)
except urllib.request.URLError as ex:
print(ex)
else:
content = response.read()
print(content)
print('哦了...')
print('*'*100)
# request = urllib.request.Request(url='http://www.douyu.com/Jack_Cui.html')
request = urllib.request.Request(url='https://err.taobao.com/error1.html?c=404&u=https://www.taobao.com/markddddddddddddddddddddddddets/nvzhuang/dddddddddddddddtaobaonvzhuang?spm=a21bo.2017.201867-main.1.1819dddddddddddddddsac8a9XRYCTP&r=')
try:
response = urllib.request.urlopen(url=request)
except urllib.request.HTTPError as ex:
print(ex)
print(dir(ex))
print(ex.code)
print(ex.getcode())
print(ex.info())
print(ex.msg)
print(ex.reason)
else:
content = response.read()
print(content)
print('哦了...')
ProxyBasicAuthHandler(代理授权验证)
如果我们使用之前的代码来使用私密代理,会报HTTP 407 错误,表示代理没有通过身份验证:
urllib.request.HTTPError: HTTP Error 407: Proxy Authentication Required
所以我们需要改写代码,通过:
# 1.构建一个附带Auth验证的的ProxyHandler处理器类对象
proxyauth_handler = urllib.request.ProxyHandler({"http" : "用户名:密码@IP:PORT"})
# 2.通过 build_opener()方法使用这个代理Handler对象,创建自定义opener对象,参数包括构建的 proxy_handler
opener = urllib.request.build_opener(proxyauth_handler)
# 3.构造Request 请求
request = urllib.request.Request("http://www.baidu.com/")
# 4.使用自定义opener发送请求
response = opener.open(request)
# 5.打印响应内容
print(response.read())