wallhaven-kwpprm.jpg

处理异常

urllib的异常处理模块中包含，URLError和HTTPError两种类型

URLError

URLError继承自OSerror,具有一个属性即reason 下面通过实例进行展示：

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" cid="n6" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib import request,error
try:
response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.URLError as e:
print(e.reason)</pre>

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="" cid="n7" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">Not Found</pre>

HTTPError

是URLError的子类，用于处理HTTP请求错误有三个参数分别是：

code 返回状态码
reason 返回原因
headers 返回请求头下面通过实例进行介绍

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" cid="n19" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib import request,error
try:
response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.HTTPError as e:
print(e.code,e.reason,e.headers,sep="\n")</pre>

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="" cid="n20" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">404
Not Found
Server: GitHub.com
Date: Sat, 06 Mar 2021 11:21:28 GMT
Content-Type: text/html; charset=utf-8
X-NWS-UUID-VERIFY: 8e28a376520626e0b40a8367b1c3ef01
Access-Control-Allow-Origin: *
ETag: "603c6eb8-c62c"
x-proxy-cache: MISS
X-GitHub-Request-Id: 2048:3FA9:10723D:12979D:60436434
Accept-Ranges: bytes
Age: 388
Via: 1.1 varnish
X-Served-By: cache-tyo11971-TYO
X-Cache: HIT
X-Cache-Hits: 0
X-Timer: S1615029689.514317,VS0,VE0
Vary: Accept-Encoding
X-Fastly-Request-ID: e918aae0b13b8d2e0a544247f6886f793856e5f1
X-Daa-Tunnel: hop_count=2
X-Cache-Lookup: Hit From Upstream
X-Cache-Lookup: Hit From Inner Cluster
Content-Length: 50732
X-NWS-LOG-UUID: 10709706064540277416
Connection: close
X-Cache-Lookup: Cache Miss</pre>

由于URLError是HTTPError的父类，所以可以先捕捉子类错误再捕捉父类错误，这是更加效率的写法由于子类错误一定会在父类错误中出现，所以先小后大可以更加精准地捕捉错误

解析链接

url的解析与构造办法

urlparse() 通过解析后返回的结果是：ParseResult(scheme='https', netloc='www.zhihu.com', path='/question/443763645/answer/1728725938', params='', query='', fragment='') 其中scheme代表通信协议，netloc代表域名，path代表访问路径，params代表参数，query代表查询条件，fragment代表的是锚点

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" cid="n29" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib.parse import urlparse
result = urlparse('https://www.zhihu.com/question/443763645/answer/1728725938')
print(type(result),result,sep="\n")</pre>

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="" cid="n30" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"><class 'urllib.parse.ParseResult'>
ParseResult(scheme='https', netloc='www.zhihu.com', path='/question/443763645/answer/1728725938', params='', query='', fragment='')</pre>

通过urlparse解析的内容有三个参数，分别是：

import urllib urllib.parse.urlparse(url, scheme='', allow_fragments=True) scheme即为协议，者url中无协议时，可以再scheme中指定

urlunparse() 与urlparse功能是相对的，即组成url，但是接受的参数长度必须为6，如例子data也可以是元组或其他类型。

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" cid="n37" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib.parse import urlunparse
data = ['http','www.baidu.com','index.html','user','a=6','comment']
print(urlunparse(data))</pre>

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="" cid="n38" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">http://www.baidu.com/index.html;user?a=6#comment</pre>

urlsplit() 与urlparse()方法相似，但是不会解析params的内容

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n44" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib.parse import urlsplit
result = urlsplit('https://www.zhihu.com/people/a-tu-14-28')
print(type(result),result,sep='\n')</pre>

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="" cid="n45" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"><class 'urllib.parse.SplitResult'>
SplitResult(scheme='https', netloc='www.zhihu.com', path='/people/a-tu-14-28', query='', fragment='')</pre>

urlunsplit() 与urlsplit()方法对立，但是参数长度必须是5

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n51" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib.parse import urlunsplit
data = ['http','www.baidu.com','index.html','a=6','comment']
urlunsplit(data)</pre>

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="" cid="n54" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">'http://www.baidu.com/index.html?a=6#comment'</pre>

urljoin() API:urllib.parse.urljoin(base, url, allow_fragments=True) 先看以下给出的API，以base作为基础链接，url作为新链接参数，根据scheme，netloc和path对缺失内容自动进行补充

下面以实例进行展示：

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n61" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib.parse import urljoin
print(urljoin('http://www.baidu.com','FAQ.html'))
print(urljoin('http://www.baidu.com','https://www.zhihu.com/question/28358499/answer/278409457'))</pre>

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="" cid="n62" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">http://www.baidu.com/FAQ.html
https://www.zhihu.com/question/28358499/answer/278409457</pre>

实际上，base_url提供的三项内容scheme、netloc和path，如果这三项在新链接中都不存在则进行补充，如果在新链接中存在就用新链接的内容

urlencode() 在构造get请求参数时较为有用，可以将字典转换成get请求参数下面以实例进行展示：

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n68" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib.parse import urlencode

params = {
'name':'Atul',
'age':'22'
}
base_url = 'http://www.baidu.com?'
url = base_url+urlencode(params)
print(url)</pre>

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="" cid="n69" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">http://www.baidu.com?name=Atul&age=22</pre>

quote() 该方法可以把内容转换成URL编码的格式。在链接中有中文时，可以通过quote()转换成URL编码避免乱码。

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n75" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib.parse import quote
keyword ='壁纸'
url = 'http://www.baidu.com/s?wd'+quote(keyword)
print(url)</pre>

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="" cid="n76" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">http://www.baidu.com/s?wd%E5%A3%81%E7%BA%B8</pre>

unquote() 与quote方法功能相反，将url编码转回

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n82" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib.parse import unquote
print(unquote(url))</pre>

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="" cid="n83" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">http://www.baidu.com/s?wd壁纸</pre>

分析robots协议

robots协议

Robots协议全称为网络爬虫排除标准(Robots Exclusion Protocol),用以告诉爬虫和搜索引擎那部分内容是可以爬取的，那部分是不可以爬取的，通常会以Robots.txt的文件格式在网站的根目录下，以实例进行展示： User-agent：* Disallow：/ 表示禁止所有爬虫访问任何目录，其中*就是表示针对所有爬虫 User-agent：* Disallow：则是表示允许所有爬虫访问网站还有禁止访问部分网站，或仅允许部分爬虫访问的办法具体可参考官方文档

robotparser

使用robotparser解析robots.txt文件可以解析网站是否可以被爬取

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n92" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">urllib.robotparser.RobotFileParser(url='')</pre>

下面列出几个常用的类方法：

set_url(): 用以设置robots.txt文件的链接。如果前面设置了url=''，这里就不需要设置。
read():用以读取robots文件并进行分析，这部分是必须做的，不会返回任何内容但是执行了读取操作，如果没有这个步骤，后面的操作都会为False
parse():用以解析输入的需要解析的内容，会按照robots.txt的规则进行解析
can_fetch():该方法输入两个参数，分别是User-agent和url，会输出是否可以爬取该网页
mtime():返回的是分析和抓取的时间
modified():将当前时间设置为上次分析和抓取的时间下面通过对知乎的爬取进行演示

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" cid="n108" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url('https://www.zhihu.com/robots.txt')
rp.read()
print(rp.can_fetch('','https://www.zhihu.com/question/388419751/answer/1724872519'))
print(rp.can_fetch('','https://www.zhihu.com/hot'))</pre>

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="" cid="n109" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">False
False</pre>

004416-1612197856dbf8.jpg

看到都是不让爬的，但是可以通过后面的requests的学习，干翻他希望和你一起沟通学习。

urllib 02