微博爬虫系列之博文评论爬取

前言

之前写的微博爬虫系列还没写完，想起来继续写下去。近期看了下爬取微信公众号文章的方法，也写了相关的代码，之后再看看写成博客吧~
本篇主要针对怎么对指定博文下的评论进行爬取。
准备写的内容：

微博热门内容及榜单的博文爬取微博爬虫系列之微博榜单博文爬取
定向关键词及指定用户博文爬取微博爬虫系列之关键词及指定用户博文爬取
博文评论爬取 微博爬虫系列之博文评论爬取
微博用户信息爬取

针对博文评论的爬取，采用的仍然是微博网页版https://weibo.cn，在爬取时仍然需要cookies，获取方式可参照微博爬虫系列之关键词及指定用户博文爬取。

这里随便选用一个人民日报的微博博文进行评论的爬取。首先需要获取博文的评论页数，由于微博的一些反爬措施，并不能完全爬取到所有页面。为了尽可能地爬取多一些评论，这里将评论热门页面与默认页面都爬取下来。

import re
from urllib import request
cookies = '你的cookies'
headers = {
      "user-agent": get_random_ua(),
      'Cookie' : cookies,
    }
page_res = requests.get(url, headers = headers)

查看网页源代码，可以看到点击查看更多热门的地方以及写着有多少页评论的地方：

在这里插入图片描述

hot_rank = re.search(r'查看更多热门', page_res.text)
all_page = re.search(r'/>&nbsp;1/(\d+)页</div>', page_res.text)

if all_page:
    all_page = all_page.group(1)
    all_page = int(all_page)
    all_page = all_page if all_page <= 50 else 50
    if hot_rank:
        hot_url = url.replace('comment', 'comment/hot')
        page_urls.append(hot_url)
        for page_num in range(2, all_page + 1):
            page_url = hot_url.replace('page=1', 'page={}'.format(page_num))
            page_urls.append(page_url)
            
    for page_num in range(2, all_page + 1):
        page_url = url.replace('page=1', 'page={}'.format(page_num))
        page_urls.append(page_url)

通过这样的方式，尽可能多地爬取评论数据。当然这样也会出现重复评论的情况，并且也不能完全爬取所有评论。目前想到的方法就只能尽可能地多爬一些，爬取后去重。

在获取了所有页面后，开始爬取每一页的评论内容。分析网页源代码可以看到每一个评论的标记是<div class = "c" id = ""></div>

在这里插入图片描述

对上面获取到的url进行循环爬取评论页面内容，下面用第一页作为例子，用lxml.etree进行解析与定位。

from lxml import etree
res = requests.get(url, headers = headers)    
tree_node = etree.HTML(res.text.encode('utf-8'))
comment_nodes = tree_node.xpath('//div[@class="c" and contains(@id,"C_")]')

通过这样子就可以定位到评论部分，返回的comment_nodes是一个列表，每个元素是一个评论的元素，每个评论内的内容可以看源代码：

在这里插入图片描述

这里假设只爬取评论用户id、评论用户名、评论id、评论内容、点赞数，其他项来源这些也是一样的（在上面就是来自网页）。

## 评论用户id
comment_user_url = comment_node.xpath('.//a[contains(@href,"/u/")]/@href')
## 这里是因为有的用户id貌似没有/u/，还有的不是数字
if comment_user_url:
    comment_user_id = re.search(r'/u/(\d+)', comment_user_url[0]).group(1)
else:
    comment_user_url = comment_node.xpath('.//a[contains(@href,"/")]/@href')
    if comment_user_url:
        comment_user_id = re.search(r'/(.*)', comment_user_url[0]).group(1)

## 评论用户名
comment_user_name = str(comment_node.xpath('./a/text()')[0])
## 评论id
comment_id = str(comment_node.xpath('./@id')[0])
## 评论内容
content = extract_comment_content(etree.tostring(comment_node, encoding='unicode'))
## 点赞数
like_num = comment_node.xpath('.//a[contains(text(),"赞[")]/text()')[-1]
comment_item['like_num'] = int(re.search('\d+', like_num).group())

这里还会涉及到可能有的评论会有评论配图的情况，可以对应的下载图片。假设出现评论配图时，会有评论配图的字眼：

##### 保存照片
def savePics(picUrl, filename , path):
    headers = {
      "user-agent": get_random_ua(),
    }
    
    # 目录不存在，就新建一个
    if not os.path.exists(path):
        os.makedirs(path)
    picID = picUrl.split('/')[-1].split('?')[0].split('.')[0]
    Suffix = picUrl.split('/')[-1].split('?')[0].split('.')[-1]
    pic_path = ''.join([path, '/', filename, '.', Suffix])
    req = request.Request(picUrl, headers = headers)
    data = request.urlopen(req, timeout=30).read()
    f = open(pic_path, 'wb')
    f.write(data)
    f.close()
    
if '评论配图' in content:
    comment_pic_url = etree.tostring(comment_node, encoding='unicode').split('class="ctt">', maxsplit=1)[1].split('举报', maxsplit=1)[0]
    comment_pic_url = re.search(r'<a href="(.*)">评论配图', comment_pic).group(1)
    savePics(comment_pic_url , filename  = 'filename', path = 'your_path')

到这里爬取微博博文评论数据的操作就结束啦~ 其他的看自己的需求写完整的代码即可。在爬取过程中可能会出现爬不到的情况，可以用一些判断语句跳过这些爬不到的情况，爬取的时候也需要稍微停一下不然会被封号。之后再接着写爬取微博博文系列的最后一篇吧~