python2与python3的区别
- python2
import urllib2
response = urllib2.urlopen('http://www.weibo.com')
- python3
import urllib
response = urllib.request.urlopen('http://www.weibo.com')
urllib的模块
-
urllib.request
for opening and reading URLs -
urllib.error
containing the exceptions raised byurllib.request
-
urllib.parse
for parsing URLs -
urllib.robotparser
for parsingrobots.txt
files
urlopen使用
- get请求
In [10]: response = urllib.request.urlopen('http://wapok.cn')
In [11]: response.status
Out[11]: 200
In [12]: response.read()
Out[12]: b'\xef\xbb\xbf<!DOCTYPE html> ...... </html>\r'
- Post请求
# 需要进行url编码,再转byte编码,python3网络传输都是使用byte格式
In [13]: data = bytes(urllib.parse.urlencode({'hello': 'python'}), encoding='utf8')
In [14]: data
Out[14]: b'hello=python'
In [16]: response = urllib.request.urlopen('http://httpbin.org/post', data=data)
In [17]: response.read()
Out[17]: b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "hello": "python"\n }, \n "headers": {
\n "Accept-Encoding": "identity", \n "Connection": "close", \n "Content-Length": "12", \n "Content-Type": "a
pplication/x-www-form-urlencoded", \n "Host": "httpbin.org", \n "User-Agent": "Python-urllib/3.7"\n }, \n "json"
: null, \n "origin": "183.216.200.80", \n "url": "http://httpbin.org/post"\n}\n'
- 设置超时
# 当没有在规定时间返回数据,会抛出异常
In [18]: response = urllib.request.urlopen('http://www.google.com', timeout=1)
---------------------------------------------------------------------------
timeout Traceback (most recent call last)
C:\My Program Files\Anaconda3\lib\urllib\request.py in do_open(self, http_class, req, **http_conn_args)
1316 h.request(req.get_method(), req.selector, req.data, headers,
-> 1317 encode_chunked=req.has_header('Transfer-encoding'))
1318 except OSError as err: # timeout error
...
URLError: <urlopen error timed out>
- 异常处理
In [21]: try:
...: response = urllib.request.urlopen('http://www.google.com', timeout=0.1)
...: except urllib.error.URLError as e:
...: if isinstance(e.reason, socket.timeout):
...: print("TIME OUT")
...:
TIME OUT
响应体
In [1]: import urllib.request
In [2]: response = urllib.request.urlopen('http://www.python.org')
In [3]: # 响应类型
In [4]: type(response)
Out[4]: http.client.HTTPResponse
In [5]: # 状态码
In [6]: response.status
Out[6]: 200
In [7]: # 响应头
In [8]: response.headers
Out[8]: <http.client.HTTPMessage at 0x1db21389d30>
In [9]: response.getheaders()
Out[9]:
[('Server', 'nginx'),
('Content-Type', 'text/html; charset=utf-8'),
('X-Frame-Options', 'SAMEORIGIN'),
('x-xss-protection', '1; mode=block'),
('X-Clacks-Overhead', 'GNU Terry Pratchett'),
('Via', '1.1 varnish'),
('Content-Length', '48863'),
('Accept-Ranges', 'bytes'),
('Date', 'Wed, 07 Nov 2018 15:05:30 GMT'),
('Via', '1.1 varnish'),
('Age', '389'),
('Connection', 'close'),
('X-Served-By', 'cache-iad2121-IAD, cache-lax8639-LAX'),
('X-Cache', 'MISS, HIT'),
('X-Cache-Hits', '0, 85'),
('X-Timer', 'S1541603131.666000,VS0,VE0'),
('Vary', 'Cookie'),
('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')]
In [10]: response.getheader('Server')
Out[10]: 'nginx'
Request
- 使用Request对象简单请求
In [11]: request = urllib.request.Request('http://httpbin.org/get')
In [12]: response = urllib.request.urlopen(request)
In [13]: response.read()
Out[13]: b'{\n "args": {}, \n "headers": {\n "Accept-Encoding": "identity", \n "Connection": "close", \n "Hos
t": "httpbin.org", \n "User-Agent": "Python-urllib/3.7"\n }, \n "origin": "183.216.200.80", \n "url": "http://http
bin.org/get"\n}\n'
- 添加header, data等参数请求
In [14]: url = 'http://httpbin.org/post'
In [15]: headers = {
...: 'user-agent': 'urlib/python'
...: }
In [16]: dict_ = {'python': 'urllib'}
In [18]: data = bytes(urllib.parse.urlencode(dict_), encoding='utf8')
In [19]: request = urllib.request.Request('http://httpbin.org/post', method='POST', headers=headers, data=data)
# 这种方式也可以添加header
In [23]: request.add_header('Connection', 'keep-alive')
In [20]: response = urllib.request.urlopen(request)
In [21]: response.status
Out[21]: 200
In [22]: response.read()
Out[22]: b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "python": "urllib"\n }, \n "headers":
{\n "Accept-Encoding": "identity", \n "Connection": "close", \n "Content-Length": "13", \n "Content-Type": "
application/x-www-form-urlencoded", \n "Host": "httpbin.org", \n "User-Agent": "urlib/python"\n }, \n "json": nu
ll, \n "origin": "183.216.200.80", \n "url": "http://httpbin.org/post"\n}\n'
handler
代理
In [25]: proxy_handler = urllib.request.ProxyHandler({
...: 'http': 'http://**.**.**.90:8888',
...: 'https': 'https://**.**.**.90:8888'
...: })
In [26]: opener = urllib.request.build_opener(proxy_handler)
In [28]: response = opener.open('http://httpbin.org/get')
In [29]: response.status
Out[29]: 200
In [30]: response.read()
Out[30]: b'{\n "args": {}, \n "headers": {\n "Accept-Encoding": "identity", \n "Connection": "close", \n
"Host": "httpbin.org", \n "User-Agent": "Python-urllib/3.7"\n }, \n "origin": "**.**.**.90", \n
"url": "http://httpbin.org/get"\n}\n'
Cookie
- 获取cookie
# cookie会追加
In [1]: import http.cookiejar, urllib.request
In [2]: cookie = http.cookiejar.CookieJar()
In [4]: handler = urllib.request.HTTPCookieProcessor(cookie)
In [5]: opener = urllib.request.build_opener(handler)
In [6]: response = opener.open('http://taobao.com')
In [7]: for item in cookie:
...: print("{}={}".format(item.name, item.value))
...:
thw=cn
In [8]: response = opener.open('http://www.zhihu.com')
In [9]: for item in cookie:
...: print("{}={}".format(item.name, item.value))
...:
thw=cn
_xsrf=NhLFIoFzjOT584eHM4zFNgG2WiwVNkws
_zap=b2d1cad8-d20b-44e7-827a-ceb83c794974
tgw_l7_route=170010e948f1b2a2d4c7f3737c85e98c
In [10]: response = opener.open('http://www.weibo.com')
In [11]: for item in cookie:
...: print("{}={}".format(item.name, item.value))
...:
thw=cn
_xsrf=NhLFIoFzjOT584eHM4zFNgG2WiwVNkws
_zap=b2d1cad8-d20b-44e7-827a-ceb83c794974
SSO-DBL=1d143a736fdf93d35dc1b24d4482f559
YF-Ugrow-G0=ea90f703b7694b74b62d38420b5273df
tgw_l7_route=170010e948f1b2a2d4c7f3737c85e98c
- 保存cookie到文本文件
- Mozilla保存方式
In [1]: import http.cookiejar, urllib.request
In [2]: filename = 'cookie.txt'
In [4]: cookie = http.cookiejar.MozillaCookieJar(filename=filename)
In [5]: handler = urllib.request.HTTPCookieProcessor(cookie)
In [6]: opener = urllib.request.build_opener(handler)
In [7]: response = opener.open('http://www.baidu.com')
In [8]: cookie.save(ignore_discard=True, ignore_expires=True)
- LWP保存方式
In [9]: cookie = http.cookiejar.LWPCookieJar(filename)
In [10]: handler = urllib.request.HTTPCookieProcessor(cookie)
In [11]: opener = urllib.request.build_opener(handler)
In [12]: response = opener.open('http://www.baidu.com')
In [13]: cookie.save(ignore_discard=True, ignore_expires=True)
-
从文本中读取cookie
用什么方式保存就用什么方式读取
每个域名都有对应的cookie
In [35]: import http.cookiejar, urllib.request
In [36]: response = urllib.request.urlopen('http://httpbin.org/cookies')
# 未携带cookie
In [37]: response.read()
Out[37]: b'{\n "cookies": {}\n}\n'
In [38]: filename = 'cookie.txt'
In [39]: cookie = http.cookiejar.LWPCookieJar()
# 加载cookie
In [40]: cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
In [41]: handler = urllib.request.HTTPCookieProcessor(cookie)
In [42]: opener = urllib.request.build_opener(handler)
In [43]: response = opener.open('http://httpbin.org/cookies')
In [44]: response.read()
Out[44]: b'{\n "cookies": {\n "BAIDUID": "B96AEE288C3789B9BD378E358955F6FD:FG=1", \n "BIDUPSID": "B96AEE288C3789B
9BD378E358955F6FD", \n "H_PS_PSSID": "1454_21103_27401_27509", \n "PSTM": "1541609145"\n }\n}\n'
异常
- URLError
In [45]: from urllib import request, error
In [49]: try:
...: response = request.urlopen('http://httpbin.org/status/404')
...: except error.URLError as e:
...: print("{}\n{}".format(type(e.reason), e.reason))
...:
<class 'str'>
NOT FOUND
- HTTPError
HTTPError的父类是URLError
In [51]: try:
...: response = request.urlopen('http://httpbin.org/status/404')
...: except error.HTTPError as e:
...: print("{}-{}\n{}".format(e.code, e.reason, e.headers))
...: except error.URLError as e:
...: print(e.reason)
...: else:
...: print("Request Success")
...:
404-NOT FOUND
Connection: close
Server: gunicorn/19.9.0
Date: Wed, 07 Nov 2018 17:22:18 GMT
Content-Type: text/html; charset=utf-8
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
Content-Length: 0
Via: 1.1 vegur
- error.URLError.reason
In [55]: try:
...: response = request.urlopen('http://httpbin.org/status/404', timeout=0.1)
...: except error.URLError as e:
...: print(type(e.reason))
...: if isinstance(e.reason, socket.timeout):
...: print("Time Out")
...: else:
...: print("Request Success")
...:
<class 'socket.timeout'>
Time Out
parse模块
urlparse
In [59]: url = "http://www.myweb.com/index.jsp?id=123#second"
In [61]: type(urllib.parse.urlparse(url))
Out[61]: urllib.parse.ParseResult
# 解析
In [60]: urllib.parse.urlparse(url)
Out[60]: ParseResult(scheme='http', netloc='www.myweb.com', path='/index.jsp', params='', query='id=123', fragment='second')
# 按https解析,当url中标了协议就按url解析
In [62]: urllib.parse.urlparse(url, scheme='https')
Out[62]: ParseResult(scheme='http', netloc='www.myweb.com', path='/index.jsp', params='', query='id=123', fragment='second')
# 当url中没有标明协议就按scheme解析
In [68]: url = "www.myweb.com/index.jsp?id=123#second"
In [69]: urllib.parse.urlparse(url, scheme='https')
Out[69]: ParseResult(scheme='https', netloc='', path='www.myweb.com/index.jsp', params='', query='id=123', fragment='second')
# 不拼接锚点
In [63]: urllib.parse.urlparse(url, allow_fragments=False)
Out[63]: ParseResult(scheme='http', netloc='www.myweb.com', path='/index.jsp', params='', query='id=123#second', fragment='')
# 当没有参数又不拼接锚点时锚点会拼接在路径上
In [65]: url = 'http://www.baidu.com/index.php#second'
In [66]: urllib.parse.urlparse(url, allow_fragments=False)
Out[66]: ParseResult(scheme='http', netloc='www.baidu.com', path='/index.php#second', params='', query='', fragment='')
# 当拼接锚点时
In [67]: urllib.parse.urlparse(url)
Out[67]: ParseResult(scheme='http', netloc='www.baidu.com', path='/index.php', params='', query='', fragment='second')
urlunparse
合成url
In [72]: data = ['https', 'www.baidu.com', '/user/login', 'not', 'user=8888', 'password']
In [73]: urllib.parse.urlunparse(data)
Out[73]: 'https://www.baidu.com/user/login;not?user=8888#password'
urljoin
合并url
In [83]: url = 'https://www.baidu.com/index.php'
In [84]: new_url = 'http://www.myweb.com/index.jsp?id=123#second'
In [85]: urllib.parse.urljoin(url, new_url)
Out[85]: 'http://www.myweb.com/index.jsp?id=123#second'
一共6个字段,在旧url基础上添加新的url,一一对应的添加
urlencode
url编码
In [86]: dict_ = {
...: 'hee': 'honey'
...: }
In [87]: par = urllib.parse.urlencode(dict_)
In [88]: par
Out[88]: 'hee=honey'