三十. 模拟登陆实战 - 爬取微博信息

爬取网址：http://m.weibo.cn→搜索→微博热搜榜
爬取信息：热搜榜内容
爬取方式：json数据
存储方式：txt文件，结果用词云来展现。

主要爬取微博热搜榜的内容，首先登陆微博网页版：http://m.weibo.cn。登陆后可以选择右上方的“搜索”图标，然后选择“微博热搜”，即可进入热搜榜。

image.png

目前微博采用Ajax技术，使用chrome的开发者工具，在请求URL中即可看到网址。请求头加上User-Agent和Cookies即可。

image.png

使用Preview标签可以清楚看到数据的结构。

image.png

代码为：

import requests
import json

url = "https://m.weibo.cn/api/container/getIndex?containerid=106003type%253D25%2526t%253D3%2526disable_hot%253D1%2526filter_type%253Drealtimehot&title=%25E5%25BE%25AE%25E5%258D%259A%25E7%2583%25AD%25E6%2590%259C&hidemenu=1&extparam=filter_type%3Drealtimehot%26mi_cid%3D%26pos%3D9%26c_type%3D30%26source%3Dranklist%26flag%3D1%26display_time%3D1519704766&luicode=10000011&lfid=106003type%3D1&featurecode=20000320"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3294.6 Safari/537.36',
           'Cookie':'xxxx'}
r = requests.get(url,headers= headers)

json_data = json.loads(r.text)
hot_groups = json_data['data']['cards'][0]['card_group']    #热搜词
realtime_groups = json_data['data']['cards'][1]['card_group']    #实时上升热点
print(len(hot_groups),len(realtime_groups ))

with open("F:/weibo.txt",'a+') as f:
    for hot_group in hot_groups:
        text1 = hot_group['desc']
        f.write(text1+"\n")

    for realtime_group in realtime_groups:
        text2 = realtime_group['desc']
        f.write(text2+"\n")

使用词频统计，代码如下：

from jieba import analyse
with open("F:/weibo.txt",'r') as f:
    sentence = f.read()
    analyse.set_stop_words("F:/中文停用词表.txt")  ##设置停用词表，这些词便不会加入统计计算。
    tags = analyse.extract_tags(sentence,topK=100,withWeight=True)
    for i in tags:
        print(i[0],int(i[1]*1000))

打印的部分结果为：

四六级 169
艺考 113
张杰 109
章丘 102
女友 88
苹果 71
女孩 70
猫咪 65
夏清 65
爸爸 64
...

制作的词云结果为：

image.png

三十. 模拟登陆实战 - 爬取微博信息

推荐阅读更多精彩内容