1 判断编码
使用
response = requests.get(url)
获得网页内容后,用chardet判断网页编码方式
chardet.detect(response.content)
自己出现乱码的网页,response.encoding返回的结果是:ISO-8859-1。
而chardet.detect(response.content)返回的结果是:{'confidence': 0.99, 'encoding': 'GB2312'}。
2 将bytes转换为并写入文件
然后用该编码方式将response的内容转换为str类型
f.write(response.content.decode("GB2312","ignore"))
不加”ignore“参数容易报错,提示某些字符无法decode。
3 读取时指明编码方式
soup=BeautifulSoup(open("test.html"),"lxml",from_encoding="GB2312")
4 完整的编码示例
def saveUrl(url,fileName):
url = "输入目标网址"
fileName="希望保存的文件名称"
response = requests.get(url)
p("status_code:"+str(response.status_code))
p("encoding:"+response.encoding)
p("chardet result:")
p(chardet.detect(response.content))
f = open(fileName + ".html", 'w')
f.write(response.content.decode(chardet.detect(response.content)["encoding"],"ignore"))
f.close()
def mainF():
#参数from_encoding="GB2312"来自p(chardet.detect(response.content))
soup=BeautifulSoup(open("test.html"),"lxml",from_encoding="GB2312")
soup = BeautifulSoup(open("test.html"), "lxml")
p(soup.prettify())
pass
saveUrl("","")
mainF()