urllib/python3基本使用

python2与python3的区别

  • python2
import urllib2
response = urllib2.urlopen('http://www.weibo.com')
  • python3
import urllib
response = urllib.request.urlopen('http://www.weibo.com')

urllib的模块

urlopen使用

  1. get请求
In [10]: response = urllib.request.urlopen('http://wapok.cn')

In [11]: response.status
Out[11]: 200

In [12]: response.read()
Out[12]: b'\xef\xbb\xbf<!DOCTYPE html> ...... </html>\r'
  1. Post请求
# 需要进行url编码,再转byte编码,python3网络传输都是使用byte格式
In [13]: data = bytes(urllib.parse.urlencode({'hello': 'python'}), encoding='utf8')

In [14]: data
Out[14]: b'hello=python'

In [16]: response = urllib.request.urlopen('http://httpbin.org/post', data=data)

In [17]: response.read()
Out[17]: b'{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {\n    "hello": "python"\n  }, \n  "headers": {
\n    "Accept-Encoding": "identity", \n    "Connection": "close", \n    "Content-Length": "12", \n    "Content-Type": "a
pplication/x-www-form-urlencoded", \n    "Host": "httpbin.org", \n    "User-Agent": "Python-urllib/3.7"\n  }, \n  "json"
: null, \n  "origin": "183.216.200.80", \n  "url": "http://httpbin.org/post"\n}\n'
  1. 设置超时
# 当没有在规定时间返回数据,会抛出异常
In [18]: response = urllib.request.urlopen('http://www.google.com', timeout=1)
---------------------------------------------------------------------------
timeout                                   Traceback (most recent call last)
C:\My Program Files\Anaconda3\lib\urllib\request.py in do_open(self, http_class, req, **http_conn_args)
   1316                 h.request(req.get_method(), req.selector, req.data, headers,
-> 1317                           encode_chunked=req.has_header('Transfer-encoding'))
   1318             except OSError as err: # timeout error
...
URLError: <urlopen error timed out>
  1. 异常处理
In [21]: try:
    ...:     response = urllib.request.urlopen('http://www.google.com', timeout=0.1)
    ...: except urllib.error.URLError as e:
    ...:     if isinstance(e.reason, socket.timeout):
    ...:         print("TIME OUT")
    ...:
TIME OUT

响应体

In [1]: import urllib.request

In [2]: response = urllib.request.urlopen('http://www.python.org')

In [3]: # 响应类型

In [4]: type(response)
Out[4]: http.client.HTTPResponse

In [5]: # 状态码

In [6]: response.status
Out[6]: 200

In [7]: # 响应头

In [8]: response.headers
Out[8]: <http.client.HTTPMessage at 0x1db21389d30>

In [9]: response.getheaders()
Out[9]:
[('Server', 'nginx'),
 ('Content-Type', 'text/html; charset=utf-8'),
 ('X-Frame-Options', 'SAMEORIGIN'),
 ('x-xss-protection', '1; mode=block'),
 ('X-Clacks-Overhead', 'GNU Terry Pratchett'),
 ('Via', '1.1 varnish'),
 ('Content-Length', '48863'),
 ('Accept-Ranges', 'bytes'),
 ('Date', 'Wed, 07 Nov 2018 15:05:30 GMT'),
 ('Via', '1.1 varnish'),
 ('Age', '389'),
 ('Connection', 'close'),
 ('X-Served-By', 'cache-iad2121-IAD, cache-lax8639-LAX'),
 ('X-Cache', 'MISS, HIT'),
 ('X-Cache-Hits', '0, 85'),
 ('X-Timer', 'S1541603131.666000,VS0,VE0'),
 ('Vary', 'Cookie'),
 ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')]

In [10]: response.getheader('Server')
Out[10]: 'nginx'

Request

  1. 使用Request对象简单请求
In [11]: request = urllib.request.Request('http://httpbin.org/get')

In [12]: response = urllib.request.urlopen(request)

In [13]: response.read()
Out[13]: b'{\n  "args": {}, \n  "headers": {\n    "Accept-Encoding": "identity", \n    "Connection": "close", \n    "Hos
t": "httpbin.org", \n    "User-Agent": "Python-urllib/3.7"\n  }, \n  "origin": "183.216.200.80", \n  "url": "http://http
bin.org/get"\n}\n'
  1. 添加header, data等参数请求
In [14]: url = 'http://httpbin.org/post'

In [15]: headers = {
    ...:     'user-agent': 'urlib/python'
    ...: }

In [16]: dict_ = {'python': 'urllib'}

In [18]: data = bytes(urllib.parse.urlencode(dict_), encoding='utf8')

In [19]: request = urllib.request.Request('http://httpbin.org/post', method='POST', headers=headers, data=data)

# 这种方式也可以添加header
In [23]: request.add_header('Connection', 'keep-alive')

In [20]: response = urllib.request.urlopen(request)

In [21]: response.status
Out[21]: 200

In [22]: response.read()
Out[22]: b'{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {\n    "python": "urllib"\n  }, \n  "headers":
{\n    "Accept-Encoding": "identity", \n    "Connection": "close", \n    "Content-Length": "13", \n    "Content-Type": "
application/x-www-form-urlencoded", \n    "Host": "httpbin.org", \n    "User-Agent": "urlib/python"\n  }, \n  "json": nu
ll, \n  "origin": "183.216.200.80", \n  "url": "http://httpbin.org/post"\n}\n'

handler

代理
In [25]: proxy_handler = urllib.request.ProxyHandler({
    ...: 'http': 'http://**.**.**.90:8888',
    ...: 'https': 'https://**.**.**.90:8888'
    ...: })

In [26]: opener = urllib.request.build_opener(proxy_handler)

In [28]: response = opener.open('http://httpbin.org/get')

In [29]: response.status
Out[29]: 200

In [30]: response.read()
Out[30]: b'{\n  "args": {}, \n  "headers": {\n    "Accept-Encoding": "identity", \n    "Connection": "close", \n   
"Host": "httpbin.org", \n    "User-Agent": "Python-urllib/3.7"\n  }, \n  "origin": "**.**.**.90", \n  
"url": "http://httpbin.org/get"\n}\n'
Cookie
  1. 获取cookie
# cookie会追加
In [1]: import http.cookiejar, urllib.request

In [2]: cookie = http.cookiejar.CookieJar()

In [4]: handler = urllib.request.HTTPCookieProcessor(cookie)

In [5]: opener = urllib.request.build_opener(handler)

In [6]: response = opener.open('http://taobao.com')

In [7]: for item in cookie:
   ...:     print("{}={}".format(item.name, item.value))
   ...:
thw=cn

In [8]: response = opener.open('http://www.zhihu.com')

In [9]: for item in cookie:
   ...:     print("{}={}".format(item.name, item.value))
   ...:
thw=cn
_xsrf=NhLFIoFzjOT584eHM4zFNgG2WiwVNkws
_zap=b2d1cad8-d20b-44e7-827a-ceb83c794974
tgw_l7_route=170010e948f1b2a2d4c7f3737c85e98c

In [10]: response = opener.open('http://www.weibo.com')

In [11]: for item in cookie:
    ...:     print("{}={}".format(item.name, item.value))
    ...:
thw=cn
_xsrf=NhLFIoFzjOT584eHM4zFNgG2WiwVNkws
_zap=b2d1cad8-d20b-44e7-827a-ceb83c794974
SSO-DBL=1d143a736fdf93d35dc1b24d4482f559
YF-Ugrow-G0=ea90f703b7694b74b62d38420b5273df
tgw_l7_route=170010e948f1b2a2d4c7f3737c85e98c
  1. 保存cookie到文本文件
  • Mozilla保存方式
In [1]: import http.cookiejar, urllib.request

In [2]: filename = 'cookie.txt'

In [4]: cookie = http.cookiejar.MozillaCookieJar(filename=filename)

In [5]: handler = urllib.request.HTTPCookieProcessor(cookie)

In [6]: opener = urllib.request.build_opener(handler)

In [7]: response = opener.open('http://www.baidu.com')

In [8]: cookie.save(ignore_discard=True, ignore_expires=True)
save cookie
  • LWP保存方式
In [9]: cookie = http.cookiejar.LWPCookieJar(filename)

In [10]: handler = urllib.request.HTTPCookieProcessor(cookie)

In [11]: opener = urllib.request.build_opener(handler)

In [12]: response = opener.open('http://www.baidu.com')

In [13]: cookie.save(ignore_discard=True, ignore_expires=True)
image.png
  1. 从文本中读取cookie
    用什么方式保存就用什么方式读取


    cookie.txgt

    每个域名都有对应的cookie

In [35]: import http.cookiejar, urllib.request

In [36]: response = urllib.request.urlopen('http://httpbin.org/cookies')
# 未携带cookie
In [37]: response.read()
Out[37]: b'{\n  "cookies": {}\n}\n'

In [38]: filename = 'cookie.txt'

In [39]: cookie = http.cookiejar.LWPCookieJar()
# 加载cookie
In [40]: cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)

In [41]: handler = urllib.request.HTTPCookieProcessor(cookie)

In [42]: opener = urllib.request.build_opener(handler)

In [43]: response = opener.open('http://httpbin.org/cookies')

In [44]: response.read()
Out[44]: b'{\n  "cookies": {\n    "BAIDUID": "B96AEE288C3789B9BD378E358955F6FD:FG=1", \n    "BIDUPSID": "B96AEE288C3789B
9BD378E358955F6FD", \n    "H_PS_PSSID": "1454_21103_27401_27509", \n    "PSTM": "1541609145"\n  }\n}\n'

异常

  • URLError
In [45]: from urllib import request, error

In [49]: try:
    ...:     response = request.urlopen('http://httpbin.org/status/404')
    ...: except error.URLError as e:
    ...:     print("{}\n{}".format(type(e.reason), e.reason))
    ...:
<class 'str'>
NOT FOUND
  • HTTPError
    HTTPError的父类是URLError
In [51]: try:
    ...:     response = request.urlopen('http://httpbin.org/status/404')
    ...: except error.HTTPError as e:
    ...:     print("{}-{}\n{}".format(e.code, e.reason, e.headers))
    ...: except error.URLError as e:
    ...:     print(e.reason)
    ...: else:
    ...:     print("Request Success")
    ...:
404-NOT FOUND
Connection: close
Server: gunicorn/19.9.0
Date: Wed, 07 Nov 2018 17:22:18 GMT
Content-Type: text/html; charset=utf-8
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
Content-Length: 0
Via: 1.1 vegur
  • error.URLError.reason
In [55]: try:
    ...:     response = request.urlopen('http://httpbin.org/status/404', timeout=0.1)
    ...: except error.URLError as e:
    ...:     print(type(e.reason))
    ...:     if isinstance(e.reason, socket.timeout):
    ...:         print("Time Out")
    ...: else:
    ...:     print("Request Success")
    ...:
<class 'socket.timeout'>
Time Out

parse模块

urlparse

In [59]: url = "http://www.myweb.com/index.jsp?id=123#second"

In [61]: type(urllib.parse.urlparse(url))
Out[61]: urllib.parse.ParseResult
# 解析
In [60]: urllib.parse.urlparse(url)
Out[60]: ParseResult(scheme='http', netloc='www.myweb.com', path='/index.jsp', params='', query='id=123', fragment='second')

# 按https解析,当url中标了协议就按url解析
In [62]: urllib.parse.urlparse(url, scheme='https')
Out[62]: ParseResult(scheme='http', netloc='www.myweb.com', path='/index.jsp', params='', query='id=123', fragment='second')
# 当url中没有标明协议就按scheme解析
In [68]: url = "www.myweb.com/index.jsp?id=123#second"
In [69]: urllib.parse.urlparse(url, scheme='https')
Out[69]: ParseResult(scheme='https', netloc='', path='www.myweb.com/index.jsp', params='', query='id=123', fragment='second')

# 不拼接锚点
In [63]: urllib.parse.urlparse(url, allow_fragments=False)
Out[63]: ParseResult(scheme='http', netloc='www.myweb.com', path='/index.jsp', params='', query='id=123#second', fragment='')
# 当没有参数又不拼接锚点时锚点会拼接在路径上
In [65]: url = 'http://www.baidu.com/index.php#second'
In [66]: urllib.parse.urlparse(url, allow_fragments=False)
Out[66]: ParseResult(scheme='http', netloc='www.baidu.com', path='/index.php#second', params='', query='', fragment='')
# 当拼接锚点时
In [67]: urllib.parse.urlparse(url)
Out[67]: ParseResult(scheme='http', netloc='www.baidu.com', path='/index.php', params='', query='', fragment='second')

urlunparse

合成url

In [72]: data = ['https', 'www.baidu.com', '/user/login', 'not', 'user=8888', 'password']

In [73]: urllib.parse.urlunparse(data)
Out[73]: 'https://www.baidu.com/user/login;not?user=8888#password'

urljoin

合并url

In [83]: url = 'https://www.baidu.com/index.php'

In [84]: new_url = 'http://www.myweb.com/index.jsp?id=123#second'

In [85]: urllib.parse.urljoin(url, new_url)
Out[85]: 'http://www.myweb.com/index.jsp?id=123#second'

一共6个字段,在旧url基础上添加新的url,一一对应的添加

urlencode

url编码

In [86]: dict_ = {
    ...: 'hee': 'honey'
    ...: }

In [87]: par = urllib.parse.urlencode(dict_)

In [88]: par
Out[88]: 'hee=honey'

参考文档:https://docs.python.org/3/library/urllib.html

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 212,332评论 6 493
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,508评论 3 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 157,812评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,607评论 1 284
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,728评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 49,919评论 1 290
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,071评论 3 410
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,802评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,256评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,576评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,712评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,389评论 4 332
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,032评论 3 316
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,798评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,026评论 1 266
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,473评论 2 360
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,606评论 2 350

推荐阅读更多精彩内容