urllib 02

wallhaven-kwpprm.jpg

处理异常

urllib的异常处理模块中包含,URLError和HTTPError两种类型

URLError

URLError继承自OSerror,具有一个属性即reason 下面通过实例进行展示:

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" cid="n6" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib import request,error
try:
response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.URLError as e:
print(e.reason)</pre>

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="" cid="n7" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">Not Found</pre>

HTTPError

是URLError的子类,用于处理HTTP请求错误 有三个参数分别是:

  • code 返回状态码

  • reason 返回原因

  • headers 返回请求头 下面通过实例进行介绍

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" cid="n19" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib import request,error
try:
response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.HTTPError as e:
print(e.code,e.reason,e.headers,sep="\n")</pre>

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="" cid="n20" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">404
Not Found
Server: GitHub.com
Date: Sat, 06 Mar 2021 11:21:28 GMT
Content-Type: text/html; charset=utf-8
X-NWS-UUID-VERIFY: 8e28a376520626e0b40a8367b1c3ef01
Access-Control-Allow-Origin: *
ETag: "603c6eb8-c62c"
x-proxy-cache: MISS
X-GitHub-Request-Id: 2048:3FA9:10723D:12979D:60436434
Accept-Ranges: bytes
Age: 388
Via: 1.1 varnish
X-Served-By: cache-tyo11971-TYO
X-Cache: HIT
X-Cache-Hits: 0
X-Timer: S1615029689.514317,VS0,VE0
Vary: Accept-Encoding
X-Fastly-Request-ID: e918aae0b13b8d2e0a544247f6886f793856e5f1
X-Daa-Tunnel: hop_count=2
X-Cache-Lookup: Hit From Upstream
X-Cache-Lookup: Hit From Inner Cluster
Content-Length: 50732
X-NWS-LOG-UUID: 10709706064540277416
Connection: close
X-Cache-Lookup: Cache Miss</pre>

由于URLError是HTTPError的父类,所以可以先捕捉子类错误再捕捉父类错误,这是更加效率的写法 由于子类错误一定会在父类错误中出现,所以先小后大可以更加精准地捕捉错误

解析链接

url的解析与构造办法

  1. urlparse() 通过解析后返回的结果是:ParseResult(scheme='https', netloc='www.zhihu.com', path='/question/443763645/answer/1728725938', params='', query='', fragment='') 其中scheme代表通信协议,netloc代表域名,path代表访问路径,params代表参数,query代表查询条件,fragment代表的是锚点

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" cid="n29" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib.parse import urlparse
result = urlparse('https://www.zhihu.com/question/443763645/answer/1728725938')
print(type(result),result,sep="\n")</pre>

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="" cid="n30" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"><class 'urllib.parse.ParseResult'>
ParseResult(scheme='https', netloc='www.zhihu.com', path='/question/443763645/answer/1728725938', params='', query='', fragment='')</pre>

通过urlparse解析的内容有三个参数,分别是:

import urllib urllib.parse.urlparse(url, scheme='', allow_fragments=True) scheme即为协议,者url中无协议时,可以再scheme中指定

  1. urlunparse() 与urlparse功能是相对的,即组成url,但是接受的参数长度必须为6,如例子data也可以是元组或其他类型。

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" cid="n37" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib.parse import urlunparse
data = ['http','www.baidu.com','index.html','user','a=6','comment']
print(urlunparse(data))</pre>

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="" cid="n38" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">http://www.baidu.com/index.html;user?a=6#comment</pre>

  1. urlsplit() 与urlparse()方法相似,但是不会解析params的内容

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n44" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib.parse import urlsplit
result = urlsplit('https://www.zhihu.com/people/a-tu-14-28')
print(type(result),result,sep='\n')</pre>

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="" cid="n45" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"><class 'urllib.parse.SplitResult'>
SplitResult(scheme='https', netloc='www.zhihu.com', path='/people/a-tu-14-28', query='', fragment='')</pre>

  1. urlunsplit() 与urlsplit()方法对立,但是参数长度必须是5

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n51" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib.parse import urlunsplit
data = ['http','www.baidu.com','index.html','a=6','comment']
urlunsplit(data)</pre>

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="" cid="n54" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">'http://www.baidu.com/index.html?a=6#comment'</pre>

  1. urljoin() API:urllib.parse.urljoin(base, url, allow_fragments=True) 先看以下给出的API,以base作为基础链接,url作为新链接参数,根据scheme,netloc和path对缺失内容自动进行补充

下面以实例进行展示:

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n61" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib.parse import urljoin
print(urljoin('http://www.baidu.com','FAQ.html'))
print(urljoin('http://www.baidu.com','https://www.zhihu.com/question/28358499/answer/278409457'))</pre>

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="" cid="n62" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">http://www.baidu.com/FAQ.html
https://www.zhihu.com/question/28358499/answer/278409457</pre>

实际上,base_url提供的三项内容scheme、netloc和path,如果这三项在新链接中都不存在则进行补充,如果在新链接中存在就用新链接的内容

  1. urlencode() 在构造get请求参数时较为有用,可以将字典转换成get请求参数 下面以实例进行展示:

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n68" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib.parse import urlencode

params = {
'name':'Atul',
'age':'22'
}
base_url = 'http://www.baidu.com?'
url = base_url+urlencode(params)
print(url)</pre>

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="" cid="n69" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">http://www.baidu.com?name=Atul&age=22</pre>

  1. quote() 该方法可以把内容转换成URL编码的格式。在链接中有中文时,可以通过quote()转换成URL编码避免乱码。

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n75" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib.parse import quote
keyword ='壁纸'
url = 'http://www.baidu.com/s?wd'+quote(keyword)
print(url)</pre>

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="" cid="n76" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">http://www.baidu.com/s?wd%E5%A3%81%E7%BA%B8</pre>

  1. unquote() 与quote方法功能相反,将url编码转回

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n82" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib.parse import unquote
print(unquote(url))</pre>

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="" cid="n83" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">http://www.baidu.com/s?wd壁纸</pre>

分析robots协议

robots协议

Robots协议全称为网络爬虫排除标准(Robots Exclusion Protocol),用以告诉爬虫和搜索引擎那部分内容是可以爬取的,那部分是不可以爬取的,通常会以Robots.txt的文件格式在网站的根目录下,以实例进行展示: User-agent:* Disallow:/ 表示禁止所有爬虫访问任何目录,其中*就是表示针对所有爬虫 User-agent:* Disallow: 则是表示允许所有爬虫访问网站 还有禁止访问部分网站,或仅允许部分爬虫访问的办法具体可参考官方文档

robotparser

使用robotparser解析robots.txt文件可以解析网站是否可以被爬取

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n92" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">urllib.robotparser.RobotFileParser(url='')</pre>

下面列出几个常用的类方法:

  • set_url(): 用以设置robots.txt文件的链接。如果前面设置了url='',这里就不需要设置。

  • read():用以读取robots文件并进行分析,这部分是必须做的,不会返回任何内容但是执行了读取操作,如果没有这个步骤,后面的操作都会为False

  • parse():用以解析输入的需要解析的内容,会按照robots.txt的规则进行解析

  • can_fetch():该方法输入两个参数,分别是User-agent和url,会输出是否可以爬取该网页

  • mtime():返回的是分析和抓取的时间

  • modified():将当前时间设置为上次分析和抓取的时间 下面通过对知乎的爬取进行演示

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" cid="n108" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url('https://www.zhihu.com/robots.txt')
rp.read()
print(rp.can_fetch('','https://www.zhihu.com/question/388419751/answer/1724872519'))
print(rp.can_fetch('
','https://www.zhihu.com/hot'))</pre>

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="" cid="n109" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">False
False</pre>


004416-1612197856dbf8.jpg

看到都是不让爬的,但是可以通过后面的requests的学习,干翻他 希望和你一起沟通学习。

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 204,684评论 6 478
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 87,143评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 151,214评论 0 337
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,788评论 1 277
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,796评论 5 368
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,665评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,027评论 3 399
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,679评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 41,346评论 1 299
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,664评论 2 321
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,766评论 1 331
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,412评论 4 321
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,015评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,974评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,203评论 1 260
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,073评论 2 350
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,501评论 2 343

推荐阅读更多精彩内容