requests作用
发送网络请求,返回响应数据
requests中文文档
发送get请求
发送带header的请求
发送带参数的请求
发送get请求
【demo01】获取百度首页信息
import requests
# 目标url
url = 'https://www.baidu.com'
# 向目标url发送get请求
response = requests.get(url)
# 打印响应内容
print(response.text)
response的常用属性:
- response.text 响应体 str类型
- respones.content 响应体 bytes类型
- response.status_code 响应状态码
- response.request.headers 响应对应的请求头
- response.headers 响应头
- response.request.cookies 响应对应请求的cookie
- response.cookies 响应的cookie(经过了set-cookie动作)
获取网页源码的通用方式:
- response.content.decode()
- response.content.decode("GBK")
- response.text
【demo02】保存网络图片
import requests
# 图片的url
url = 'https://www.baidu.com/img/bd_logo1.png'
# 响应本身就是一个图片,并且是二进制类型
response = requests.get(url)
# print(response.content)
# 以二进制+写入的方式打开文件
with open('baidu.png', 'wb') as f:
# 写入response.content bytes二进制类型
f.write(response.content)
发送带header的请求
带header的原因:
模仿浏览器,欺骗服务器,获取和浏览器一致的内容
header形式:
字典
用法:
requests.get(url, headers=headers)
【demo03】模拟浏览器获取百度首页
# 获取百度首页
import requests
url='https://www.baidu.com'
# 请求头中带上User-Agent
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 Safari/537.36'}
response=requests.get(url,headers=headers)
# 打印请求头信息
print(response.request.headers)
发送带参数的请求
请求参数形式:
字典
kw = {'wd':'长城'}
用法:
requests.get(url,params=kw)
【demo04】发送带参数的请求
# 发送带参数的请求
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 Safari/537.36'}
url = 'https://www.baidu.com/s?'
kw = {'wd': 'python'}
# 带上参数发起请求
response = requests.get(url, headers=headers, params=kw)
print(response.content)
【作业】获取新浪首页,查看response.text 和response.content.decode()的区别
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 Safari/537.36'}
url = 'https://www.sina.com.cn/'
response = requests.get(url, headers=headers)
print(response.text)
print(response.content.decode())
结果:
response.text返回乱码
response.content.decode()没有返回乱码
结论:
response.text是根据网页的响应来猜测编码,如果不指定的话,默认是Unicode型的数据(ISO-8859-1)
【作业】实现任意贴吧的爬虫,保存网页到本地
import requests
import sys
class Tieba(object):
def __init__(self, name, pn):
self.name = name
self.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 Safari/537.36'
}
self.url = 'http://tieba.baidu.com/f?kw={}&pn='.format(self.name)
self.url_list = [self.url + str(i * 50) for i in range(pn)]
def get_data(self, url):
response = requests.get(url, headers=self.headers)
return response.content
def save_data(self, data, index):
filename = self.name + "_{}.html".format(index)
with open(filename, 'wb')as f:
f.write(data)
def run(self):
# 遍历url列表
for url in self.url_list:
index = self.url_list.index(url)
# 发送请求
data = self.get_data(url)
# 保存
self.save_data(data, index)
if __name__ == '__main__':
name = input("输入贴吧名:")
pn = input("输入页数:")
tieba = Tieba(name, int(pn))
tieba.run()