requests 是爬取数据最常用的模块,比起 urllib, urllib2, urllib3 这几个单是看名字就晕的模块,requests 不仅功能强大,而且 api 简单易用,使用起来有如丝般顺滑
以下用实例演示 requests 的相关用法
构造 GET 请求
In [12]: r = requests.get('http://httpbin.org/get')
In [13]: print(r.text)
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Connection": "close",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.19.1"
},
"origin": "183.63.188.162",
"url": "http://httpbin.org/get"
}
在get请求中添加参数
# 直接在url拼接参数,能实现但不推荐
In [14]: r = requests.get('http://httpbin.org/get?name=saiyan_cat&age=3')
# 建议将参数封装成独立的字典
In [15]: data = {
...: 'name': 'saiyan_cat',
...: 'age': 3
...: }
In [16]: r = requests.get('http://httpbin.org/get', params=data)
In [17]: print(r.text)
{
"args": {
"age": "3",
"name": "saiyan_cat"
},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Connection": "close",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.19.1"
},
"origin": "183.63.188.162",
"url": "http://httpbin.org/get?name=saiyan_cat&age=3"
}
抓取二进制数据
下载图片,无非就是将二进制数据下载后保存
import requests
r = requests.get('https://github.com/favicon.ico')
# 将图片保存到本地
with open('favicon.ico', 'wb') as f:
f.write(r.content)
构造POST请求
import requests
data = {'name': '塞亚猫', 'skill': '卖萌'}
r = requests.post('http://httpbin.org/post', data=data)
print(r.text)
上传文件
本质就是 post 数据,只不过数据格式是文件
import requests
files = {'file': open('favicon.ico', 'rb')}
r = requests.post('http://httpbin.org/post', files=files)
处理cookie
获取cookie
import requests
r = requests.get('https://www.taobao.com')
print(r.cookies)
for key, value in r.cookies.items():
print(key + '=' + value)
运行结果:
<RequestsCookieJar[<Cookie thw=cn for .taobao.com/>]>
thw=cn
携带cookie模拟登录
以知乎为例:从浏览器开发工具中获取cookie
import requests
# 替换成你的cookie
cookie = '__DAYU_PP=EEJz2QFnjbMArAFzvJr7297f1f25fc0f; _zap=ace3...........'
# cookie = ''
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
'Cookie': cookie,
'Host': 'www.zhihu.com',
}
r = requests.get('https://www.zhihu.com/people/edit', headers=headers)
print(type(r.text))
flag = '一句话介绍' in r.text
# 登录的情况下,结果为True
print(flag)
session 维持会话
requests新发出一个请求,相当于打开一个新的浏览器,并不会记住上一次请求的会话
import requests
# 设置cookie
requests.get('http://httpbin.org/cookies/set/number/123456789')
# 读取cookie
r = requests.get('http://httpbin.org/cookies')
print(r.text)
运行结果:
{
"cookies": {}
}
如果要维持会话,当然可以选择每次都携带相同的cookie,但这种方式显得太蠢笨了
用 Sesssion 可以实现会话维持
import requests
# 请求改为由session发起
s = requests.Session()
# 设置cookie
s.get('http://httpbin.org/cookies/set/number/123456789')
# 读取cookie
r = s.get('http://httpbin.org/cookies')
print(r.text)
运行结果:
{
"cookies": {
"number": "123456789"
}
}
设置代理
需要有代理 ip
import requests
proxies = {
'http': 'http://127.0.0.1:1087',
'https': 'http://127.0.0.1:1087',
}
r = requests.get('https://www.google.com', proxies=proxies)
print(r.status_code)
超时设置
当网络连接不好时,要设置超时,不然程序会一直无意义地等待
import requests
r = requests.get('https://www.baidu.com', timeout=0.01)
print(r.status_code)
超时会抛出如下异常
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='www.baidu.com', port=443): Read timed out. (read timeout=0.01)
程序遇到异常会中断执行,应该将异常捕获,由开发人员处理异常
import requests
try:
r = requests.get('https://www.baidu.com', timeout=0.01)
print(r.status_code)
except requests.exceptions.ReadTimeout as e:
print('超时')
print('后续的程序继续执行...')
执行结果:
超时
后续的程序继续执行...
如果超时后需要重试,参考 python使用retrying重试请求
nginx认证
当 nginx 设置了账号密码,(详见nginx配置网站访问密码) 可携带账号密码登录
import requests
from requests.auth import HTTPBasicAuth
r = requests.get('http://127.0.0.1:8001/', auth=HTTPBasicAuth('username', 'password'))
print(r.status_code)
# 认证通过返回200,否则返回401