处理异常
urllib的异常处理模块中包含,URLError和HTTPError两种类型
URLError
URLError继承自OSerror,具有一个属性即reason 下面通过实例进行展示:
<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" cid="n6" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib import request,error
try:
response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.URLError as e:
print(e.reason)</pre>
<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="" cid="n7" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">Not Found</pre>
HTTPError
是URLError的子类,用于处理HTTP请求错误 有三个参数分别是:
code 返回状态码
reason 返回原因
headers 返回请求头 下面通过实例进行介绍
<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" cid="n19" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib import request,error
try:
response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.HTTPError as e:
print(e.code,e.reason,e.headers,sep="\n")</pre>
<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="" cid="n20" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">404
Not Found
Server: GitHub.com
Date: Sat, 06 Mar 2021 11:21:28 GMT
Content-Type: text/html; charset=utf-8
X-NWS-UUID-VERIFY: 8e28a376520626e0b40a8367b1c3ef01
Access-Control-Allow-Origin: *
ETag: "603c6eb8-c62c"
x-proxy-cache: MISS
X-GitHub-Request-Id: 2048:3FA9:10723D:12979D:60436434
Accept-Ranges: bytes
Age: 388
Via: 1.1 varnish
X-Served-By: cache-tyo11971-TYO
X-Cache: HIT
X-Cache-Hits: 0
X-Timer: S1615029689.514317,VS0,VE0
Vary: Accept-Encoding
X-Fastly-Request-ID: e918aae0b13b8d2e0a544247f6886f793856e5f1
X-Daa-Tunnel: hop_count=2
X-Cache-Lookup: Hit From Upstream
X-Cache-Lookup: Hit From Inner Cluster
Content-Length: 50732
X-NWS-LOG-UUID: 10709706064540277416
Connection: close
X-Cache-Lookup: Cache Miss</pre>
由于URLError是HTTPError的父类,所以可以先捕捉子类错误再捕捉父类错误,这是更加效率的写法 由于子类错误一定会在父类错误中出现,所以先小后大可以更加精准地捕捉错误
解析链接
url的解析与构造办法
- urlparse() 通过解析后返回的结果是:ParseResult(scheme='https', netloc='www.zhihu.com', path='/question/443763645/answer/1728725938', params='', query='', fragment='') 其中scheme代表通信协议,netloc代表域名,path代表访问路径,params代表参数,query代表查询条件,fragment代表的是锚点
<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" cid="n29" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib.parse import urlparse
result = urlparse('https://www.zhihu.com/question/443763645/answer/1728725938')
print(type(result),result,sep="\n")</pre>
<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="" cid="n30" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"><class 'urllib.parse.ParseResult'>
ParseResult(scheme='https', netloc='www.zhihu.com', path='/question/443763645/answer/1728725938', params='', query='', fragment='')</pre>
通过urlparse解析的内容有三个参数,分别是:
import urllib urllib.parse.urlparse(url, scheme='', allow_fragments=True) scheme即为协议,者url中无协议时,可以再scheme中指定
- urlunparse() 与urlparse功能是相对的,即组成url,但是接受的参数长度必须为6,如例子data也可以是元组或其他类型。
<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" cid="n37" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib.parse import urlunparse
data = ['http','www.baidu.com','index.html','user','a=6','comment']
print(urlunparse(data))</pre>
<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="" cid="n38" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">http://www.baidu.com/index.html;user?a=6#comment</pre>
- urlsplit() 与urlparse()方法相似,但是不会解析params的内容
<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n44" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib.parse import urlsplit
result = urlsplit('https://www.zhihu.com/people/a-tu-14-28')
print(type(result),result,sep='\n')</pre>
<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="" cid="n45" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"><class 'urllib.parse.SplitResult'>
SplitResult(scheme='https', netloc='www.zhihu.com', path='/people/a-tu-14-28', query='', fragment='')</pre>
- urlunsplit() 与urlsplit()方法对立,但是参数长度必须是5
<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n51" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib.parse import urlunsplit
data = ['http','www.baidu.com','index.html','a=6','comment']
urlunsplit(data)</pre>
<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="" cid="n54" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">'http://www.baidu.com/index.html?a=6#comment'</pre>
- urljoin() API:urllib.parse.urljoin(base, url, allow_fragments=True) 先看以下给出的API,以base作为基础链接,url作为新链接参数,根据scheme,netloc和path对缺失内容自动进行补充
下面以实例进行展示:
<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n61" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib.parse import urljoin
print(urljoin('http://www.baidu.com','FAQ.html'))
print(urljoin('http://www.baidu.com','https://www.zhihu.com/question/28358499/answer/278409457'))</pre>
<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="" cid="n62" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">http://www.baidu.com/FAQ.html
https://www.zhihu.com/question/28358499/answer/278409457</pre>
实际上,base_url提供的三项内容scheme、netloc和path,如果这三项在新链接中都不存在则进行补充,如果在新链接中存在就用新链接的内容
- urlencode() 在构造get请求参数时较为有用,可以将字典转换成get请求参数 下面以实例进行展示:
<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n68" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib.parse import urlencode
params = {
'name':'Atul',
'age':'22'
}
base_url = 'http://www.baidu.com?'
url = base_url+urlencode(params)
print(url)</pre>
<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="" cid="n69" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">http://www.baidu.com?name=Atul&age=22</pre>
- quote() 该方法可以把内容转换成URL编码的格式。在链接中有中文时,可以通过quote()转换成URL编码避免乱码。
<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n75" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib.parse import quote
keyword ='壁纸'
url = 'http://www.baidu.com/s?wd'+quote(keyword)
print(url)</pre>
<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="" cid="n76" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">http://www.baidu.com/s?wd%E5%A3%81%E7%BA%B8</pre>
- unquote() 与quote方法功能相反,将url编码转回
<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n82" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib.parse import unquote
print(unquote(url))</pre>
<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="" cid="n83" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">http://www.baidu.com/s?wd壁纸</pre>
分析robots协议
robots协议
Robots协议全称为网络爬虫排除标准(Robots Exclusion Protocol),用以告诉爬虫和搜索引擎那部分内容是可以爬取的,那部分是不可以爬取的,通常会以Robots.txt的文件格式在网站的根目录下,以实例进行展示: User-agent:* Disallow:/ 表示禁止所有爬虫访问任何目录,其中*
就是表示针对所有爬虫 User-agent:* Disallow: 则是表示允许所有爬虫访问网站 还有禁止访问部分网站,或仅允许部分爬虫访问的办法具体可参考官方文档
robotparser
使用robotparser解析robots.txt文件可以解析网站是否可以被爬取
<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n92" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">urllib.robotparser.RobotFileParser(url='')</pre>
下面列出几个常用的类方法:
set_url(): 用以设置robots.txt文件的链接。如果前面设置了url='',这里就不需要设置。
read():用以读取robots文件并进行分析,这部分是必须做的,不会返回任何内容但是执行了读取操作,如果没有这个步骤,后面的操作都会为False
parse():用以解析输入的需要解析的内容,会按照robots.txt的规则进行解析
can_fetch():该方法输入两个参数,分别是User-agent和url,会输出是否可以爬取该网页
mtime():返回的是分析和抓取的时间
modified():将当前时间设置为上次分析和抓取的时间 下面通过对知乎的爬取进行演示
<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" cid="n108" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url('https://www.zhihu.com/robots.txt')
rp.read()
print(rp.can_fetch('','https://www.zhihu.com/question/388419751/answer/1724872519'))
print(rp.can_fetch('','https://www.zhihu.com/hot'))</pre>
<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="" cid="n109" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Consolas, Menlo, Monaco, monospace, serif; font-size: 0.9rem; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(254, 254, 254); display: block; break-inside: avoid; text-align: left; white-space: normal; position: relative !important; margin-left: 1em; padding-left: 1em; border: 1px solid rgb(221, 221, 221); padding-bottom: 8px; padding-top: 6px; margin-bottom: 1.5em; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">False
False</pre>
看到都是不让爬的,但是可以通过后面的requests的学习,干翻他 希望和你一起沟通学习。