urllib 01

关于urllib库的使用request 发送请求urlopen()request类高级用法

关于urllib库的使用

urllib库是python内置的HTTP请求库,也就是说我们是不需要进行安装的。 其中urllib包括以下四个模块 1. request:用以模拟发送请求,就像我们浏览器的输入框,其中可以加入一些额外参数用以模拟浏览器 2.error 3.parse 4.robotparser

request 发送请求

urlopen()

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="markdowm" cid="n7" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">urllib.request()可以实现最基本的HTTP访问。
下面以一个实例爬取python官网进行展示</pre>

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" cid="n8" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import urllib.request
response = urllib.request.urlopen('http://www.python.org')
print(response.read().decode('utf-8'))</pre>

可以看到python程序已经成功获取了网页源代码,但是发现在获取的response中的类型需要通过 response.read().decode('utf-8') 才能进行展示,检查一下response的数据类型为

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" cid="n11" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">type(response)</pre>

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="" cid="n14" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">http.client.HTTPResponse</pre>

发现response的数据类型为http.client.HTTPResponse 该类型的对象主要有read() readinto() getheader(name) getheaders() fileno() 等方法 以及msg/version/status/reason/debuglecel/closed等属性

如 通过response.read()方法可以获取读取结果,通过response.status属性可以获取response的状态

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" cid="n19" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">response.status</pre>

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="" cid="n22" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">200</pre>

目前status属性为200,即是联通状态 可以对其他方式进行调用测试

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" cid="n26" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">print(response.getheaders())</pre>

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="" cid="n27" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">[('Connection', 'close'), ('Content-Length', '50897'), ('Server', 'nginx'), ('Content-Type', 'text/html; charset=utf-8'), ('X-Frame-Options', 'DENY'), ('Via', '1.1 vegur, 1.1 varnish, 1.1 varnish'), ('Accept-Ranges', 'bytes'), ('Date', 'Fri, 05 Mar 2021 11:22:01 GMT'), ('Age', '1919'), ('X-Served-By', 'cache-bwi5125-BWI, cache-hnd18721-HND'), ('X-Cache', 'HIT, HIT'), ('X-Cache-Hits', '1, 3182'), ('X-Timer', 'S1614943322.979393,VS0,VE0'), ('Vary', 'Cookie'), ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')]</pre>

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n29" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">print(response.getheader('Content-length'))</pre>

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="" cid="n30" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">50897</pre>

我觉得我有必要区别以下getheader()和getheaders(): 所以很明显的是,getheaders()方法用于输出全部的响应头的信息,而getheader(name) 则对应的输出响应值

urlopen()还有许多的参数,首先看一下urlopen的API

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n33" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">urllib.request.urlopen(
url,
data=None,
timeout=<object object at 0x000001F385D1F160>,
*,
cafile=None,
capath=None,
cadefault=False,
context=None,
) </pre>

下面分别从不同参数理解整个函数

  • data

下面通过一个实例来介绍data的用法 这里urlopen()方法中,data的类型必须是byte类型,所以输入的字典{'word':'hello'},必须通过bytes()方法转换成byte类型,而bytes()方法只接受string类型,通过urllib.parse.urlencode()方法将字典转换成string,再转换成byte类型 这里我用的是jupyter lab,所以在后面response.read().decode('utf-8')让换行符消失

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n41" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import urllib.parse
import urllib.request

data = bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf-8')
response = urllib.request.urlopen('http://httpbin.org/post',data=data)
print(response.read().decode('utf-8'))</pre>

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="" cid="n42" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">{
"args": {},
"data": "",
"files": {},
"form": {
"word": "hello"
},
"headers": {
"Accept-Encoding": "identity",
"Content-Length": "10",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "httpbin.org",
"User-Agent": "Python-urllib/3.7",
"X-Amzn-Trace-Id": "Root=1-60423d57-04ac587026fe33de2ccee34f"
},
"json": null,
"origin": "183.246.30.52",
"url": "http://httpbin.org/post"
}</pre>

form中,出现了输入的字典,则表示url的请求方式为post方式,这里简单介绍一下post方式和get方式的差异:

  1. 在安全性方面

    • get方式会让信息暴露在url中,可能导致密码泄露

    • post方式以表单的形式上传,能更好地保护信息安全

  2. 在传输数据量方面

    • get请求提交最多只有1024个字节

    • post请求无限制

  • timeout

timeout参数用于设置限定时间,即在限定的时间内如果没有获得服务器响应即报错。 通过下面的实例进行演示

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n65" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import socket
import urllib.request
import urllib.error
try:
response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)
except urllib.error.URLError as e:

print(response.read().decode('utf-8'))

if isinstance(e.reason,socket.timeout):
    print('TIME OUT')
    </pre>

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="" cid="n66" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">TIME OUT</pre>

由于时间超时,输出的错误类型是URLerror是属于urllib.error类型,所以导入相关库进行抓取错误。 timeout 类型错误是属于超市错误属于socket类型的错误

request类

介绍以下request的API内容

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n70" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">urllib.request.Request(
url,
data=None,
headers={},
origin_req_host=None,
unverifiable=False,
method=None,
) </pre>

urllib.request.Request( url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None, )

  1. 其中url是必选参数,其他都是可选参数

  2. data必须传bytes类型

  3. headers传入的内容是字典,作为请求头,可通过修改User-Agent伪装浏览器

  4. origin_req_host是指请求方的host地址或IP地址

  5. unverifiable表示的是这个请求是否是无法验证的??

  6. method传入的书请求方法,有get和post两种

以下通过一个实例来展示request类

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n87" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib import request, parse
url = 'http://httpbin.org/post'

构造请求头

headers = {
'User-Agent':'Mozilla/4.0(compatible; MSIE 5.5; Windows NT)',
'Host':'httpbin.org'
}

构造输入的表单数据

dic ={
'name':'Atu'
}

构造data数据

data = bytes(parse.urlencode(dic),encoding='utf-8')

构造了request类,即是一个伪装后的url

req = request.Request(
url=url,
data=None,
headers=headers,
origin_req_host=None,
unverifiable=False,
method='POST'
)
response = request.urlopen(req)
print(response.read().decode('utf-8'))
</pre>

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="xml" cid="n88" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">{
"args": {},
"data": "",
"files": {},
"form": {},
"headers": {
"Accept-Encoding": "identity",
"Content-Length": "0",
"Host": "httpbin.org",
"User-Agent": "Mozilla/4.0(compatible; MSIE 5.5; Windows NT)",
"X-Amzn-Trace-Id": "Root=1-60430595-3161f57a0ee7913f60b742a4"
},
"json": null,
"origin": "223.104.247.132",
"url": "http://httpbin.org/post"
}</pre>

高级用法

对于更加高级的如cookie处理,代理设置等,我们可以选择强大的handler工具。 首先介绍urllib.request模块里面的BaseHandler父类,各个子类都继承了BaseHandler的属性:

  • HTTPDefaultErrorHandler: 用于处理HTTP相应错误,抛出的HTTPError错误

  • HTTPRedirectHandler:处于处理重定向[mark,重定向是指什么]

  • HTTPCookieProcessor:用于处理Cookies 还有其他handler类可以参考官方文档

另一个比较重要的类是OpenerDirector,我们称之为Opener,前面使用的urlopen()方法实际上就是一个urllib配置好的opener,为了实现更加高级的方法,这里使用Opener进行底层配置: 下面通过实例来检验相关用法:

  • 验证

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n104" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib.request import HTTPPasswordMgrWithDefaultRealm,HTTPBasicAuthHandler,build_opener
from urllib.error import URLError

username = 'username'
password = 'password'

url='http://localhost:5000/'

p = HTTPPasswordMgrWithDefaultRealm()
p.add_password(None,url,username,password)
auth_handler =HTTPBasicAuthHandler(p) # auth_handler是由HTTPBasicAtuthHandler实例化的Handler对象,其输入额参数为 HTTPPasswordMgrWithDefaultRealm对象
opener = build_opener(auth_handler)

try:
result = opener.open(url) #使用open打开的对象就是需要的结果
html = result.read().decode('utf-8')
print(html)
except URLError as e:
print(e.reason)</pre>

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="xml" cid="n105" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">[WinError 10061] 由于目标计算机积极拒绝,无法连接。</pre>

  • 代理 做爬虫时间,大多数时候为了隐藏自身的真实ip都会选择代理,以以下实例进行展示: proxyHandler是用来做代理类,其参数是字典

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n111" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib.error import URLError
from urllib.request import ProxyHandler,build_opener

proxy_handler = ProxyHandler({
'http':'http://127.0.0.1:9743',
'http':'http://127.0.0.1:9743'
})
opener = build_opener(proxy_handler)
try:
response = opener.open('https://www.baidu.com')
print(response.read().decode('utf-8'))
except URLError as e:
print(e.reason)</pre>

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="xml" cid="n112" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"><html>
<head>
<script>
location.replace(location.href.replace("https://","http://"));
</script>
</head>
<body>
<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>
</body>
</html></pre>

  • Cookie 下面通过实例介绍,保存cookie的Handler
  1. 声明CookieJar对象,把CookieJar对象进行实例化

  2. 使用HTTPCookieProcessor构建handler,其输入的参数为CookieJar

  3. 利用build_opener构建出Opener

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n125" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import http.cookiejar,urllib.request

cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')

for item in cookie:
print(item.name+"="+item.value)</pre>

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="" cid="n126" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">BAIDUID=9871F8DE5C415A182773408756D8CCB7:FG=1
BIDUPSID=9871F8DE5C415A1867D1295E303CB06F
H_PS_PSSID=33517_33637_33261_33273_31660_33594_33605_33590_26350_22158
PSTM=1615017612
BDSVRTM=0
BD_HOME=1</pre>

考虑将Cookie内容本地保存,可以使用MozillaCookieJar方法. 保存文件的时候应当保存cookie文件,cookie文件的格式是[http.cookiejar.MozillaCookieJar],而response的格式是[http.client.HTTPResponse],

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n129" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">filename='cookie.txt'
cookie = http.cookiejar.MozillaCookieJar(filename=filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)</pre>

也可以采用LWPCookieJar的格式保存,即声明: cookie = http.cookiejar.LWPCookieJar(),生成了文件格式有所不同

在看一下怎么读取文件中的Cookie

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" cid="n133" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">cookie = http.cookiejar.MozillaCookieJar()
cookie.load(filename="cookie.txt",ignore_discard=True,ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8')</pre>

以上代码可以输出百度首页的源代码,urllib的request用法到这里介绍为止,其他内容可以参考官网文档,然后后面继续是urllib的其他问题

希望能跟你沟通学习,能够一直成长,加油!

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 216,287评论 6 498
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,346评论 3 392
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 162,277评论 0 353
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,132评论 1 292
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,147评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,106评论 1 295
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,019评论 3 417
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,862评论 0 274
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,301评论 1 310
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,521评论 2 332
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,682评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,405评论 5 343
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,996评论 3 325
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,651评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,803评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,674评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,563评论 2 352

推荐阅读更多精彩内容