爬虫遇到的问题总结

1.请求下来的HTML中文编码问题

import requests
from bs4 import BeautifulSoup
newsurl = "http://news.sina.com.cn/china"
res = requests.get(newsurl)
soup = BeautifulSoup(res.text,"lxml")
news_item = soup.select(".news-item")
print(news_item[0].select("h2")[0].text)

结果:

����������� �止�

解决办法

import requests
from bs4 import BeautifulSoup
newsurl = "http://news.sina.com.cn/china"
res = requests.get(newsurl)
soup = BeautifulSoup(res.text.encode(res.encoding).decode('utf-8'),"lxml") #添加编解码
news_item = soup.select(".news-item")
print(news_item[0].select("h2")[0].text)

结果:

半月谈:政务公开渠道多干货少 各地无统一标准

2.爬虫长时间运行报错

urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer'))

解决办法一,设置请求头user-agent:

headers = requests.utils.default_headers()
headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
#headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.167 Safari/537.36'
r = requests.get('https://academic.oup.com/journals', headers=headers)

解决办法二:更换ip地址

©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容