urllib库的使用

urllib提供了一系列用于操作URL的功能。

1.Get

urllib的 request 模块可以非常方便地抓取URL内容，也就是发送一个GET请求到指定的页面，然后返回HTTP的响应。

1.1 urlopen

首先是一个简单的例子：向http://httpbin.org发起一个get请求。该网站是一个专门用于调试的网站，提供了很多接口，对学习网络请求的用户十分友好。例如，测试其get接口，只需要访问http://httpbin.org/get。用浏览器访问该url，得到的是：

{
  "args": {}, 
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", 
    "Accept-Encoding": "gzip, deflate", 
    "Accept-Language": "zh-CN,zh;q=0.9", 
    "Host": "httpbin.org", 
    "Referer": "https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&tn=baidu&wd=http%3A%2F%2Fhttpbin.org%2Fget&oq=urllib&rsv_pq=a446a4660001fadd&rsv_t=7779tZ0%2FkXsGJ0POjiJkzATQhQOSaTHyme0dr%2B2aLc2X8zKxTRziPIRgk4A&rqlang=cn&rsv_enter=1&rsv_dl=tb&rsv_n=2&rsv_sug3=1&rsv_sug2=0&inputT=818&rsv_sug4=819", 
    "Upgrade-Insecure-Requests": "1", 
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.116 Safari/537.36", 
    "X-Amzn-Trace-Id": "Root=1-5e6a24ca-46637abaca0ee78e2a74e076"
  }, 
  "origin": "36.63.28.30", 
  "url": "http://httpbin.org/get"
}

下面尝试用python的urllib库来发起这个请求：

from urllib import request

with request.urlopen("http://httpbin.org/get") as f:
    data = f.read()
    print("status:", f.status, f.reason)
    for k, v in f.getheaders():
        print("%s : %s" % (k, v))
    print("data:",data.decode("utf-8"))

注：with语句的用法：它是一种上下文管理协议，目的在于从流程图中把 try,except 和finally 关键字和资源分配释放相关代码统统去掉，简化try….except….finlally的处理流程。使用语法为：

with context as var:
    with_suite

运行该脚本，观察控制台的输出，可以看到HTTP响应的头和JSON数据：

get请求控制台输出

1.2 Request

可以看到，在json数据中User-Agent字段对应的值是Python-urllib/3.8，而通过浏览器的访问为：Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.116 Safari/537.36显然前者是python访问。那么如何通过python模拟浏览器发起请求呢？，这里就需要用到Request。
修改完善1.1中的源码如下：

from urllib import request

# 返回给变量req
req = request.Request("http://httpbin.org/get")
# 添加头部信息的User-Agent
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                             'AppleWebKit/537.36 (KHTML, like Gecko) '
                             'Chrome/80.0.3987.116 Safari/537.36')
# 将req作为urlopen的参数
with request.urlopen(req) as f:
    data = f.read()
    print("status:", f.status, f.reason)
    for k, v in f.getheaders():
        print("%s : %s" % (k, v))
    print("data:", data.decode("utf-8"))

修改处已经在源码做出注释。只是咋请求中手动添加了User-Agent，将其值赋为浏览器访问时的值。该值除了通过httbin.org获得之外，也可以通过F12调试工具抓取请求或由其他抓包工具获得。再观察输出结果：

修改User-Agent

可以看到User-Agent的值已经变成了设定值（截图不全）。

这个例子虽能说明问题，但是感受不是那么直观。换一个神马搜索的链接，感受可能会更深切。神马搜索的url为：https://m.sm.cn/该网站只有手机端可以访问。在PC端访问的效果如图：

神马搜索PC端

若将上述修改User-Agent之后的代码。只修改Request的url参数。即仍然以我伪装的电脑User-Agent访问，得到的效果为

......
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1" />
<link rel="shortcut icon" type="image/x-icon" href="//sm01.alicdn.com/L1/272/1990/favicon/favicon.ico" />
<title>神马搜索</title>
<meta name="description" content="神马是全球第一款完全基于移动互联网的搜索引擎。神马为移动而生，专注于移动搜索用户刚需满足和痛点解决，致力于创造有用、有趣的全新移动搜索体验。" />
<meta name="keywords" content="神马搜索,sm.cn,sm搜索,手机移动搜索,移动搜索" />
.......
</body></html>

...
可以看到，得到的数据是整个页面的代码。再将User-Agent修改为苹果手机浏览器的：Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25

返回的数据为：


<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8"><meta content="width=device-width,maximum-scale=1.0,minimum-scale=1.0,initial-scale=1.0,user-scalable=no" name="viewport">
    <meta name="format-detection" content="telephone=no"><meta content="always" name="referrer"><meta name="data-spm" data-spm-protocol="i"><link rel="shortcut icon" href="//sm01.alicdn.com/L1/272/1990/favicon/favicon.ico" type="image/x-icon" />
    <title>神马搜索</title>
    <script id="ls[hJP]">!function(e,t){"use strict";var n=e.sm||{},o=e.encodeURIComponent,r=e.decodeURIComponent,i=[],a=[],s={}
.......
</script>
</html>

页面数据显然发生了很大的改变。不再是之前的广告页，而是进入了手机的搜索页面。那么如果将User-Agent修改为原始的python呢？经我测试，神马搜索还是禁止python直接进行搜索的。
总之：User-Agent是用于模拟浏览器访问的。

2. POST

如果要以POST发送一个请求，只需要把参数data以bytes形式传入。

仍然以http://httpbin.org为例

from urllib import request, parse

# 定义一个字典数据
info_dict = {'username': '123', 'password': '456'}
# 使用urlencode将字典参数序列化成字符串
info_encoded = parse.urlencode(info_dict)
# 将序列化后的字符串转换成bytes
info = bytes(info_encoded, 'utf-8')

req = request.Request("http://httpbin.org/post")
req.add_header("User-Agent", "NiuBai Browser")
with request.urlopen(req, data=info) as f:
    print("status:", f.status, f.reason)
    for k, v in f.getheaders():
        print("%s :%s " % (k, v))
    print("data: ", f.read().decode('utf-8'))

运行结果为：

带参数的post

可见参数已经传入。

3. urllib的结构

上面的两部分是从get和post的角度分别测试urllib的。用到的主要是urllib.requst以及urllib.parse两个模块。urllib还有哪些模块呢？查询python官方网站的文档，得到的说明为：

`urllib` is a package that collects several modules for working with URLs:

*   [`urllib.request`](https://docs.python.org/3/library/urllib.request.html#module-urllib.request "urllib.request: Extensible library for opening URLs.") for opening and reading URLs

*   [`urllib.error`](https://docs.python.org/3/library/urllib.error.html#module-urllib.error "urllib.error: Exception classes raised by urllib.request.") containing the exceptions raised by [`urllib.request`](https://docs.python.org/3/library/urllib.request.html#module-urllib.request "urllib.request: Extensible library for opening URLs.")

*   [`urllib.parse`](https://docs.python.org/3/library/urllib.parse.html#module-urllib.parse "urllib.parse: Parse URLs into or assemble them from components.") for parsing URLs

*   [`urllib.robotparser`](https://docs.python.org/3/library/urllib.robotparser.html#module-urllib.robotparser "urllib.robotparser: Load a robots.txt file and answer questions about fetchability of other URLs.") for parsing `robots.txt` files

也即：

urllib脑图

4. 更加高级的请求

如果还需要更复杂的控制，比如通过一个Proxy去访问网站，我们需要利用ProxyHandler来处理，示例代码如下：

proxy_handler = urllib.request.ProxyHandler({'http': 'http://www.example.com:3128/'})
proxy_auth_handler = urllib.request.ProxyBasicAuthHandler()
proxy_auth_handler.add_password('realm', 'host', 'username', 'password')
opener = urllib.request.build_opener(proxy_handler, proxy_auth_handler)
with opener.open('http://www.example.com/login.html') as f:
    pass

5. 作业

利用urllib读取JSON，然后将JSON解析为Python对象（以http://httpbin.org/headers为例）
参考源码：

from urllib import request
import json


def fetch_data(url):
    req = request.Request(url)
    with request.urlopen(req) as f:
        data_obj = f.read().decode("utf-8")
        # print(data_obj)
    return json.loads(data_obj)


Url = "http://httpbin.org/headers"
data = fetch_data(Url)
print(data)