关于python 爬虫爬取网页的乱码问题

立一个flag在这里，认真解决这个问题

举个例子，使用requests库爬取网页，经常会出现乱码，尤其是稍微大型一点的网站，比如百度，新浪新闻等。

#coding:utf-8
import requests
#import urllib.request

# 注意這個亂碼的分析 用這個文件解釋的時候，得到的就是亂碼
# 但是用Html_download2 執行的時候，就不是亂碼
# 真的是 
class HtmlDownload(object):
    def download(self, url):
        if url is None:
            return None
        response = requests.get(url)
        if response.status_code!= 200:
            return None
        #得到html 的全部內容

        print response.encoding
        #response.encoding=('utf8')
        #print response.encoding

        return response.text


hd=HtmlDownload()
url='https://baike.baidu.com/'
html_content=hd.download(url)
print (html_content)

如果print 爬取出来的网页，会出现乱码。如下图。

image.png

为什么会这样，刚刚入门python的时候，被编码问题搞得对编码产生了阴影。

看来requests的源码之后，大概找到了问题，就是requests 如果不能找到指定的编码，它在爬取网页的时候，会猜测网页的编码，这样可能会带来一个问题。

#coding:utf-8
import requests
#import urllib.request

class HtmlDownload(object):
    def download(self, url):
        if url is None:
            return None
        response = requests.get(url)
        if response.status_code!= 200:
            return None
        #得到html 的全部內容
        print ("ok")
        print (">>test")
        #输出response的网页内容编码和response的网页的头部的编码
        #response的网页内容编码
        print ('encoding:',response.encoding)
        #response的网页头部的编码
        print ('apparent_encoding:',response.apparent_encoding)
        return response.text

hd=HtmlDownload()
url='https://baike.baidu.com/'
html_content=hd.download(url)
#print (html_content)

image.png

print ('encoding:',response.encoding)
print ('apparent_encoding:',response.apparent_encoding)
的运行结果一个是ISO-8859-1,一个事utf-8，这样就会带来问题。

所以，问题解决的方法，也很简单。
将网页文本的编码指定为UTF-8就可以了。

插入如下代码

 response.encoding=('utf8')

image.png

在次爬取一下网页

image.png

问题已经解决，哇咔咔！

附上源码：

#coding:utf-8
import requests
#import urllib.request

# 注意這個亂碼的分析 用這個文件解釋的時候，得到的就是亂碼
# 但是用Html_download2 執行的時候，就不是亂碼
# 真的是 
class HtmlDownload(object):
    def download(self, url):
        if url is None:
            return None
        response = requests.get(url)
        if response.status_code!= 200:
            return None
        #得到html 的全部內容


        print ("ok")
        print (">>test")
        print ('encoding:',response.encoding)
        print ('apparent_encoding:',response.apparent_encoding)
        response.encoding=('utf8')
        print ('encoding :',response.encoding)
        return response.text
        #print ('encoding:',response.encoding)
        #return response.text


hd=HtmlDownload()
url='https://baike.baidu.com/'
html_content=hd.download(url)
print (html_content)

关于编码问题，参考资料：
http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html
http://blog.chinaunix.net/uid-13869856-id-5747417.html
http://blog.csdn.net/wyb199026/article/details/52562538

关于python 爬虫爬取网页的乱码问题

立一个flag在这里，认真解决这个问题

推荐阅读更多精彩内容