urllib/python3基本使用

python2与python3的区别

python2

import urllib2
response = urllib2.urlopen('http://www.weibo.com')

python3

import urllib
response = urllib.request.urlopen('http://www.weibo.com')

urllib的模块

urllib.request for opening and reading URLs
urllib.error containing the exceptions raised by urllib.request
urllib.parse for parsing URLs
urllib.robotparser for parsing robots.txt files

urlopen使用

get请求

In [10]: response = urllib.request.urlopen('http://wapok.cn')

In [11]: response.status
Out[11]: 200

In [12]: response.read()
Out[12]: b'\xef\xbb\xbf<!DOCTYPE html> ...... </html>\r'

Post请求

# 需要进行url编码，再转byte编码，python3网络传输都是使用byte格式
In [13]: data = bytes(urllib.parse.urlencode({'hello': 'python'}), encoding='utf8')

In [14]: data
Out[14]: b'hello=python'

In [16]: response = urllib.request.urlopen('http://httpbin.org/post', data=data)

In [17]: response.read()
Out[17]: b'{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {\n    "hello": "python"\n  }, \n  "headers": {
\n    "Accept-Encoding": "identity", \n    "Connection": "close", \n    "Content-Length": "12", \n    "Content-Type": "a
pplication/x-www-form-urlencoded", \n    "Host": "httpbin.org", \n    "User-Agent": "Python-urllib/3.7"\n  }, \n  "json"
: null, \n  "origin": "183.216.200.80", \n  "url": "http://httpbin.org/post"\n}\n'

设置超时

# 当没有在规定时间返回数据，会抛出异常
In [18]: response = urllib.request.urlopen('http://www.google.com', timeout=1)
---------------------------------------------------------------------------
timeout                                   Traceback (most recent call last)
C:\My Program Files\Anaconda3\lib\urllib\request.py in do_open(self, http_class, req, **http_conn_args)
   1316                 h.request(req.get_method(), req.selector, req.data, headers,
-> 1317                           encode_chunked=req.has_header('Transfer-encoding'))
   1318             except OSError as err: # timeout error
...
URLError: <urlopen error timed out>

异常处理

In [21]: try:
    ...:     response = urllib.request.urlopen('http://www.google.com', timeout=0.1)
    ...: except urllib.error.URLError as e:
    ...:     if isinstance(e.reason, socket.timeout):
    ...:         print("TIME OUT")
    ...:
TIME OUT

响应体

In [1]: import urllib.request

In [2]: response = urllib.request.urlopen('http://www.python.org')

In [3]: # 响应类型

In [4]: type(response)
Out[4]: http.client.HTTPResponse

In [5]: # 状态码

In [6]: response.status
Out[6]: 200

In [7]: # 响应头

In [8]: response.headers
Out[8]: <http.client.HTTPMessage at 0x1db21389d30>

In [9]: response.getheaders()
Out[9]:
[('Server', 'nginx'),
 ('Content-Type', 'text/html; charset=utf-8'),
 ('X-Frame-Options', 'SAMEORIGIN'),
 ('x-xss-protection', '1; mode=block'),
 ('X-Clacks-Overhead', 'GNU Terry Pratchett'),
 ('Via', '1.1 varnish'),
 ('Content-Length', '48863'),
 ('Accept-Ranges', 'bytes'),
 ('Date', 'Wed, 07 Nov 2018 15:05:30 GMT'),
 ('Via', '1.1 varnish'),
 ('Age', '389'),
 ('Connection', 'close'),
 ('X-Served-By', 'cache-iad2121-IAD, cache-lax8639-LAX'),
 ('X-Cache', 'MISS, HIT'),
 ('X-Cache-Hits', '0, 85'),
 ('X-Timer', 'S1541603131.666000,VS0,VE0'),
 ('Vary', 'Cookie'),
 ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')]

In [10]: response.getheader('Server')
Out[10]: 'nginx'

Request

使用Request对象简单请求

In [11]: request = urllib.request.Request('http://httpbin.org/get')

In [12]: response = urllib.request.urlopen(request)

In [13]: response.read()
Out[13]: b'{\n  "args": {}, \n  "headers": {\n    "Accept-Encoding": "identity", \n    "Connection": "close", \n    "Hos
t": "httpbin.org", \n    "User-Agent": "Python-urllib/3.7"\n  }, \n  "origin": "183.216.200.80", \n  "url": "http://http
bin.org/get"\n}\n'

添加header, data等参数请求

In [14]: url = 'http://httpbin.org/post'

In [15]: headers = {
    ...:     'user-agent': 'urlib/python'
    ...: }

In [16]: dict_ = {'python': 'urllib'}

In [18]: data = bytes(urllib.parse.urlencode(dict_), encoding='utf8')

In [19]: request = urllib.request.Request('http://httpbin.org/post', method='POST', headers=headers, data=data)

# 这种方式也可以添加header
In [23]: request.add_header('Connection', 'keep-alive')

In [20]: response = urllib.request.urlopen(request)

In [21]: response.status
Out[21]: 200

In [22]: response.read()
Out[22]: b'{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {\n    "python": "urllib"\n  }, \n  "headers":
{\n    "Accept-Encoding": "identity", \n    "Connection": "close", \n    "Content-Length": "13", \n    "Content-Type": "
application/x-www-form-urlencoded", \n    "Host": "httpbin.org", \n    "User-Agent": "urlib/python"\n  }, \n  "json": nu
ll, \n  "origin": "183.216.200.80", \n  "url": "http://httpbin.org/post"\n}\n'

handler

代理

In [25]: proxy_handler = urllib.request.ProxyHandler({
    ...: 'http': 'http://**.**.**.90:8888',
    ...: 'https': 'https://**.**.**.90:8888'
    ...: })

In [26]: opener = urllib.request.build_opener(proxy_handler)

In [28]: response = opener.open('http://httpbin.org/get')

In [29]: response.status
Out[29]: 200

In [30]: response.read()
Out[30]: b'{\n  "args": {}, \n  "headers": {\n    "Accept-Encoding": "identity", \n    "Connection": "close", \n   
"Host": "httpbin.org", \n    "User-Agent": "Python-urllib/3.7"\n  }, \n  "origin": "**.**.**.90", \n  
"url": "http://httpbin.org/get"\n}\n'

Cookie

获取cookie

# cookie会追加
In [1]: import http.cookiejar, urllib.request

In [2]: cookie = http.cookiejar.CookieJar()

In [4]: handler = urllib.request.HTTPCookieProcessor(cookie)

In [5]: opener = urllib.request.build_opener(handler)

In [6]: response = opener.open('http://taobao.com')

In [7]: for item in cookie:
   ...:     print("{}={}".format(item.name, item.value))
   ...:
thw=cn

In [8]: response = opener.open('http://www.zhihu.com')

In [9]: for item in cookie:
   ...:     print("{}={}".format(item.name, item.value))
   ...:
thw=cn
_xsrf=NhLFIoFzjOT584eHM4zFNgG2WiwVNkws
_zap=b2d1cad8-d20b-44e7-827a-ceb83c794974
tgw_l7_route=170010e948f1b2a2d4c7f3737c85e98c

In [10]: response = opener.open('http://www.weibo.com')

In [11]: for item in cookie:
    ...:     print("{}={}".format(item.name, item.value))
    ...:
thw=cn
_xsrf=NhLFIoFzjOT584eHM4zFNgG2WiwVNkws
_zap=b2d1cad8-d20b-44e7-827a-ceb83c794974
SSO-DBL=1d143a736fdf93d35dc1b24d4482f559
YF-Ugrow-G0=ea90f703b7694b74b62d38420b5273df
tgw_l7_route=170010e948f1b2a2d4c7f3737c85e98c

保存cookie到文本文件

Mozilla保存方式

In [1]: import http.cookiejar, urllib.request

In [2]: filename = 'cookie.txt'

In [4]: cookie = http.cookiejar.MozillaCookieJar(filename=filename)

In [5]: handler = urllib.request.HTTPCookieProcessor(cookie)

In [6]: opener = urllib.request.build_opener(handler)

In [7]: response = opener.open('http://www.baidu.com')

In [8]: cookie.save(ignore_discard=True, ignore_expires=True)

save cookie

LWP保存方式

In [9]: cookie = http.cookiejar.LWPCookieJar(filename)

In [10]: handler = urllib.request.HTTPCookieProcessor(cookie)

In [11]: opener = urllib.request.build_opener(handler)

In [12]: response = opener.open('http://www.baidu.com')

In [13]: cookie.save(ignore_discard=True, ignore_expires=True)

image.png

从文本中读取cookie
用什么方式保存就用什么方式读取

cookie.txgt

每个域名都有对应的cookie

In [35]: import http.cookiejar, urllib.request

In [36]: response = urllib.request.urlopen('http://httpbin.org/cookies')
# 未携带cookie
In [37]: response.read()
Out[37]: b'{\n  "cookies": {}\n}\n'

In [38]: filename = 'cookie.txt'

In [39]: cookie = http.cookiejar.LWPCookieJar()
# 加载cookie
In [40]: cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)

In [41]: handler = urllib.request.HTTPCookieProcessor(cookie)

In [42]: opener = urllib.request.build_opener(handler)

In [43]: response = opener.open('http://httpbin.org/cookies')

In [44]: response.read()
Out[44]: b'{\n  "cookies": {\n    "BAIDUID": "B96AEE288C3789B9BD378E358955F6FD:FG=1", \n    "BIDUPSID": "B96AEE288C3789B
9BD378E358955F6FD", \n    "H_PS_PSSID": "1454_21103_27401_27509", \n    "PSTM": "1541609145"\n  }\n}\n'

异常

URLError

In [45]: from urllib import request, error

In [49]: try:
    ...:     response = request.urlopen('http://httpbin.org/status/404')
    ...: except error.URLError as e:
    ...:     print("{}\n{}".format(type(e.reason), e.reason))
    ...:
<class 'str'>
NOT FOUND

HTTPError
HTTPError的父类是URLError

In [51]: try:
    ...:     response = request.urlopen('http://httpbin.org/status/404')
    ...: except error.HTTPError as e:
    ...:     print("{}-{}\n{}".format(e.code, e.reason, e.headers))
    ...: except error.URLError as e:
    ...:     print(e.reason)
    ...: else:
    ...:     print("Request Success")
    ...:
404-NOT FOUND
Connection: close
Server: gunicorn/19.9.0
Date: Wed, 07 Nov 2018 17:22:18 GMT
Content-Type: text/html; charset=utf-8
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
Content-Length: 0
Via: 1.1 vegur

error.URLError.reason

In [55]: try:
    ...:     response = request.urlopen('http://httpbin.org/status/404', timeout=0.1)
    ...: except error.URLError as e:
    ...:     print(type(e.reason))
    ...:     if isinstance(e.reason, socket.timeout):
    ...:         print("Time Out")
    ...: else:
    ...:     print("Request Success")
    ...:
<class 'socket.timeout'>
Time Out

parse模块

urlparse

In [59]: url = "http://www.myweb.com/index.jsp?id=123#second"

In [61]: type(urllib.parse.urlparse(url))
Out[61]: urllib.parse.ParseResult
# 解析
In [60]: urllib.parse.urlparse(url)
Out[60]: ParseResult(scheme='http', netloc='www.myweb.com', path='/index.jsp', params='', query='id=123', fragment='second')

# 按https解析，当url中标了协议就按url解析
In [62]: urllib.parse.urlparse(url, scheme='https')
Out[62]: ParseResult(scheme='http', netloc='www.myweb.com', path='/index.jsp', params='', query='id=123', fragment='second')
# 当url中没有标明协议就按scheme解析
In [68]: url = "www.myweb.com/index.jsp?id=123#second"
In [69]: urllib.parse.urlparse(url, scheme='https')
Out[69]: ParseResult(scheme='https', netloc='', path='www.myweb.com/index.jsp', params='', query='id=123', fragment='second')

# 不拼接锚点
In [63]: urllib.parse.urlparse(url, allow_fragments=False)
Out[63]: ParseResult(scheme='http', netloc='www.myweb.com', path='/index.jsp', params='', query='id=123#second', fragment='')
# 当没有参数又不拼接锚点时锚点会拼接在路径上
In [65]: url = 'http://www.baidu.com/index.php#second'
In [66]: urllib.parse.urlparse(url, allow_fragments=False)
Out[66]: ParseResult(scheme='http', netloc='www.baidu.com', path='/index.php#second', params='', query='', fragment='')
# 当拼接锚点时
In [67]: urllib.parse.urlparse(url)
Out[67]: ParseResult(scheme='http', netloc='www.baidu.com', path='/index.php', params='', query='', fragment='second')

urlunparse

合成url

In [72]: data = ['https', 'www.baidu.com', '/user/login', 'not', 'user=8888', 'password']

In [73]: urllib.parse.urlunparse(data)
Out[73]: 'https://www.baidu.com/user/login;not?user=8888#password'

urljoin

合并url

In [83]: url = 'https://www.baidu.com/index.php'

In [84]: new_url = 'http://www.myweb.com/index.jsp?id=123#second'

In [85]: urllib.parse.urljoin(url, new_url)
Out[85]: 'http://www.myweb.com/index.jsp?id=123#second'

一共6个字段，在旧url基础上添加新的url，一一对应的添加

urlencode

url编码

In [86]: dict_ = {
    ...: 'hee': 'honey'
    ...: }

In [87]: par = urllib.parse.urlencode(dict_)

In [88]: par
Out[88]: 'hee=honey'

参考文档：https://docs.python.org/3/library/urllib.html

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 212,332评论 6赞 493
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 90,508评论 3赞 385
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 157,812评论 0赞 348
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 56,607评论 1赞 284
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 65,728评论 6赞 386
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 49,919评论 1赞 290
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 39,071评论 3赞 410
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 37,802评论 0赞 268
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 44,256评论 1赞 303
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 36,576评论 2赞 327
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 38,712评论 1赞 341
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 34,389评论 4赞 332
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 40,032评论 3赞 316
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 30,798评论 0赞 21
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 32,026评论 1赞 266
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 46,473评论 2赞 360
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 43,606评论 2赞 350

urllib/python3基本使用

urllib的模块

urlopen使用

响应体

Request

handler

代理

Cookie

异常

parse模块

urlparse

urlunparse

urljoin

urlencode

推荐阅读更多精彩内容