也是前几天看到一个公众号推了一篇文章,是爬取战狼的影评。今天自己也来试一下
我选择爬的是《杀破狼》
然后就是打开短评页面,可以看到comment-item,这就是影评了
现在已经找到想要的了,但是这仅仅是第一页的,可以看到一共有六千多条记录,那么怎么拿到其他的呢,页面拉到下方的后页,可以看到地址栏变成了下面的这个地址
所以可以知道limit应该是每页记录数,start是从第几条开始,知道这个我们就知道了所有的地址啦
url_list = ['https://movie.douban.com/subject/26826398/comments?' \
'start={}&limit=20&sort=new_score&status=P' .format(x)for x in range(0, 6317, 20)]
爬取过程就是利用bs4拿到想要的就ok
response = requests.get(url=url, headers=header)
response.encoding = 'utf-8'
html = BeautifulSoup(response.text, 'html.parser')
comment_items = html.select('div.comment-item')
for item in comment_items:
comment = item.find('p')
然后把爬取的文本写入txt中最后用来作数据分析
要作数据分析首先到网上找个停用词表,然后利用jieba来分析,代码如下(这里也是看了罗罗攀的文章:http://www.jianshu.com/p/b277199346ae)
def fenci():
path = '/Users/mocokoo/Documents/shapolang.txt'
with open(path, mode='r', encoding='utf-8') as f:
content = f.read()
analyse.set_stop_words('/Users/mocokoo /Documents/tycibiao.txt')
tags = analyse.extract_tags(content, topK=100, withWeight=True)
for item in tags:
print(item[0] + '\t' + str(int(item[1] * 1000)))
最后利用这个网站来制作一下输出结果
https://wordart.com/create
最后附上完整代码:
#!usr/bin/env python3
# -*- coding:utf-8-*-
import requests
from bs4 import BeautifulSoup
import jieba.analyse as analyse
header = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
url_list = ['https://movie.douban.com/subject/26826398/comments?' \
'start={}&limit=20&sort=new_score&status=P' .format(x)for x in range(0, 6317, 20)]
# 爬取所有短评写入文件中
def get_comments():
with open(file='/Users/mocokoo/Documents/shapolang.txt', mode='w', encoding='utf-8') as f:
i = 1
for url in url_list:
print('正在爬取杀破狼影评第_%d_页' % i)
response = requests.get(url=url, headers=header)
response.encoding = 'utf-8'
html = BeautifulSoup(response.text, 'html.parser')
comment_items = html.select('div.comment-item')
for item in comment_items:
comment = item.find('p')
f.write(comment.get_text().strip() + '\n')
print('第_%d_页完成' % i)
i += 1
# 分词
def fenci():
path = '/Users/mocokoo/Documents/shapolang.txt'
with open(path, mode='r', encoding='utf-8') as f:
content = f.read()
analyse.set_stop_words('/Users/mocokoo/Documents/tycibiao.txt')
tags = analyse.extract_tags(content, topK=100, withWeight=True)
for item in tags:
print(item[0] + '\t' + str(int(item[1] * 1000)))
if __name__ == '__main__':
get_comments() # 将影评写入文档中
# fenci()