urllib提供了一系列用于操作URL的功能。
1.Get
urllib的 request 模块可以非常方便地抓取URL内容,也就是发送一个GET请求到指定的页面,然后返回HTTP的响应。
1.1 urlopen
首先是一个简单的例子:向http://httpbin.org发起一个get请求。该网站是一个专门用于调试的网站,提供了很多接口,对学习网络请求的用户十分友好。例如,测试其get接口,只需要访问http://httpbin.org/get。用浏览器访问该url,得到的是:
{
"args": {},
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "zh-CN,zh;q=0.9",
"Host": "httpbin.org",
"Referer": "https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&tn=baidu&wd=http%3A%2F%2Fhttpbin.org%2Fget&oq=urllib&rsv_pq=a446a4660001fadd&rsv_t=7779tZ0%2FkXsGJ0POjiJkzATQhQOSaTHyme0dr%2B2aLc2X8zKxTRziPIRgk4A&rqlang=cn&rsv_enter=1&rsv_dl=tb&rsv_n=2&rsv_sug3=1&rsv_sug2=0&inputT=818&rsv_sug4=819",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.116 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-5e6a24ca-46637abaca0ee78e2a74e076"
},
"origin": "36.63.28.30",
"url": "http://httpbin.org/get"
}
下面尝试用python的urllib库来发起这个请求:
from urllib import request
with request.urlopen("http://httpbin.org/get") as f:
data = f.read()
print("status:", f.status, f.reason)
for k, v in f.getheaders():
print("%s : %s" % (k, v))
print("data:",data.decode("utf-8"))
注:with语句的用法:它是一种上下文管理协议,目的在于从流程图中把 try,except 和finally 关键字和资源分配释放相关代码统统去掉,简化try….except….finlally的处理流程。使用语法为:
with context as var:
with_suite
运行该脚本,观察控制台的输出,可以看到HTTP响应的头和JSON数据:
1.2 Request
可以看到,在json数据中User-Agent字段对应的值是Python-urllib/3.8,而通过浏览器的访问为:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.116 Safari/537.36显然前者是python访问。那么如何通过python模拟浏览器发起请求呢?,这里就需要用到Request。
修改完善1.1中的源码如下:
from urllib import request
# 返回给变量req
req = request.Request("http://httpbin.org/get")
# 添加头部信息的User-Agent
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/80.0.3987.116 Safari/537.36')
# 将req作为urlopen的参数
with request.urlopen(req) as f:
data = f.read()
print("status:", f.status, f.reason)
for k, v in f.getheaders():
print("%s : %s" % (k, v))
print("data:", data.decode("utf-8"))
修改处已经在源码做出注释。只是咋请求中手动添加了User-Agent,将其值赋为浏览器访问时的值。该值除了通过httbin.org获得之外,也可以通过F12调试工具抓取请求或由其他抓包工具获得。再观察输出结果:
可以看到User-Agent的值已经变成了设定值(截图不全)。
这个例子虽能说明问题,但是感受不是那么直观。换一个神马搜索的链接,感受可能会更深切。神马搜索的url为:https://m.sm.cn/该网站只有手机端可以访问。在PC端访问的效果如图:
若将上述修改User-Agent之后的代码。只修改Request的url参数。即仍然以我伪装的电脑User-Agent访问,得到的效果为
......
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1" />
<link rel="shortcut icon" type="image/x-icon" href="//sm01.alicdn.com/L1/272/1990/favicon/favicon.ico" />
<title>神马搜索</title>
<meta name="description" content="神马是全球第一款完全基于移动互联网的搜索引擎。神马为移动而生,专注于移动搜索用户刚需满足和痛点解决,致力于创造有用、有趣的全新移动搜索体验。" />
<meta name="keywords" content="神马搜索,sm.cn,sm搜索,手机移动搜索,移动搜索" />
.......
</body></html>
...
可以看到,得到的数据是整个页面的代码。再将User-Agent修改为苹果手机浏览器的:Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25
返回的数据为:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"><meta content="width=device-width,maximum-scale=1.0,minimum-scale=1.0,initial-scale=1.0,user-scalable=no" name="viewport">
<meta name="format-detection" content="telephone=no"><meta content="always" name="referrer"><meta name="data-spm" data-spm-protocol="i"><link rel="shortcut icon" href="//sm01.alicdn.com/L1/272/1990/favicon/favicon.ico" type="image/x-icon" />
<title>神马搜索</title>
<script id="ls[hJP]">!function(e,t){"use strict";var n=e.sm||{},o=e.encodeURIComponent,r=e.decodeURIComponent,i=[],a=[],s={}
.......
</script>
</html>
页面数据显然发生了很大的改变。不再是之前的广告页,而是进入了手机的搜索页面。那么如果将User-Agent修改为原始的python呢?经我测试,神马搜索还是禁止python直接进行搜索的。
总之:User-Agent是用于模拟浏览器访问的。
2. POST
如果要以POST发送一个请求,只需要把参数data以bytes形式传入。
仍然以http://httpbin.org为例
from urllib import request, parse
# 定义一个字典数据
info_dict = {'username': '123', 'password': '456'}
# 使用urlencode将字典参数序列化成字符串
info_encoded = parse.urlencode(info_dict)
# 将序列化后的字符串转换成bytes
info = bytes(info_encoded, 'utf-8')
req = request.Request("http://httpbin.org/post")
req.add_header("User-Agent", "NiuBai Browser")
with request.urlopen(req, data=info) as f:
print("status:", f.status, f.reason)
for k, v in f.getheaders():
print("%s :%s " % (k, v))
print("data: ", f.read().decode('utf-8'))
运行结果为:
可见参数已经传入。
3. urllib的结构
上面的两部分是从get和post的角度分别测试urllib的。用到的主要是urllib.requst以及urllib.parse两个模块。urllib还有哪些模块呢?查询python官方网站的文档,得到的说明为:
`urllib` is a package that collects several modules for working with URLs:
* [`urllib.request`](https://docs.python.org/3/library/urllib.request.html#module-urllib.request "urllib.request: Extensible library for opening URLs.") for opening and reading URLs
* [`urllib.error`](https://docs.python.org/3/library/urllib.error.html#module-urllib.error "urllib.error: Exception classes raised by urllib.request.") containing the exceptions raised by [`urllib.request`](https://docs.python.org/3/library/urllib.request.html#module-urllib.request "urllib.request: Extensible library for opening URLs.")
* [`urllib.parse`](https://docs.python.org/3/library/urllib.parse.html#module-urllib.parse "urllib.parse: Parse URLs into or assemble them from components.") for parsing URLs
* [`urllib.robotparser`](https://docs.python.org/3/library/urllib.robotparser.html#module-urllib.robotparser "urllib.robotparser: Load a robots.txt file and answer questions about fetchability of other URLs.") for parsing `robots.txt` files
也即:
4. 更加高级的请求
如果还需要更复杂的控制,比如通过一个Proxy去访问网站,我们需要利用ProxyHandler来处理,示例代码如下:
proxy_handler = urllib.request.ProxyHandler({'http': 'http://www.example.com:3128/'})
proxy_auth_handler = urllib.request.ProxyBasicAuthHandler()
proxy_auth_handler.add_password('realm', 'host', 'username', 'password')
opener = urllib.request.build_opener(proxy_handler, proxy_auth_handler)
with opener.open('http://www.example.com/login.html') as f:
pass
5. 作业
利用urllib读取JSON,然后将JSON解析为Python对象(以http://httpbin.org/headers为例)
参考源码:
from urllib import request
import json
def fetch_data(url):
req = request.Request(url)
with request.urlopen(req) as f:
data_obj = f.read().decode("utf-8")
# print(data_obj)
return json.loads(data_obj)
Url = "http://httpbin.org/headers"
data = fetch_data(Url)
print(data)