用requests爬取3dmgame网站,代码如下:
import requests
url ="https://www.3dmgame.com/"
handers = {
"user-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36",
}
r = requests.get(url)
with open("3dGame.html","w",encoding="UTF-8")as fp:
fp.write(r.text)
爬取后的网页出现乱码:
发现乱码后,浮现在脑海里应该是编解码的问题,所以在r = requests.get(url)下,加入了print(r.text)检查一下
输出结果如下:
从输出的结果得知,在爬取后出现了乱码,需要在爬取后需要手动编码才可以正常,所以我在爬取后手动指定字符编码为utf-8
在手动输入指定的字符编码后,爬取后的网页正常!
总结
在爬取数据中,用text属性获取了html内容出现unicode码,可以考虑用 X.encoding ="utf-8" 来匹配指定的编码
test的源代码的内容为:
Content of the response, in unicode.
If Response.encoding is None, encoding will be guessed using chardet.
The encoding of the response content is determined based solely on HTTP headers, following RFC 2616 to the letter. If you can take advantage of non-HTTP knowledge to make a better guess at the encoding, you should set r.encoding appropriately before accessing this property.
从源代码得知,爬取的内容编码仅基于http信头,当用text属性获取html时候,容易乱码,在text和html互转中,可以手动设置指定的编码,达到我们爬取的目的。