在上一篇Machine Learning文章,我写了如何在网络上简单爬取资源。这次介绍关于爬取资源的处理,即Data Processing;
我思
二、Data Processing
a、中文词频统计及词云可视化
工具为:中文分词jieba模块,jieba是一款优秀的中文分词处理器,简单、方便且开源;python WordCloud 模块,功能齐全,可玩性及展示性较强;
以下为相关代码:
from scipy.misc import imread
import jieba
import jieba.analyse #关键字提取
from os import path
from wordcloud import WordCloud
import matplotlib.pyplot as plt
file=open(r'./art/鹿鼎记.txt',encoding='utf-8',errors='ignore')
url=r'./art/stop_word.txt'
content=file.read()
file_one=[]
try:
jieba.analyse.set_stop_words(url) # 除去中文停止词库
tags=jieba.analyse.extract_tags(content,topK=160,withWeight=True) # 获得关键词及其次数,数量为前160
for tag in tags:
file_one.append([tag[0],tag[1]*1000]) # 写入file_one 列表内
print(tag[0]+'\t'+str(tag[1]*1000)) # 显示
finally:
print('OK')
patch=r'C:\Users\22109\Downloads\字体-方正兰亭黑体.ttf' #设定词云显示中文类型
dictionary=dict(file_one) # list——dict类型
bg_pic = imread(r'./art/img.png')
wc = WordCloud(
# 设置字体
font_path = patch,
# 设置背景色
background_color='white',
# 允许最大词汇
max_words=200,
# 词云形状
mask=bg_pic,
# 最大号字体
max_font_size=200,
)
wc.generate_from_frequencies(dictionary) #引入字典类型
plt.figure()
plt.imshow(wc) # plt显示
plt.axis('off')
词云显示图:
Figure_1.png
b、统计可视化处理
b-1、直方图比较
import os
import jieba
import jieba.analyse
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sys
import matplotlib
def file_read(url,n):
matplotlib.rcParams['font.family'] = 'sans-serif'
matplotlib.rcParams['font.sans-serif'] = [u'SimHei']
matplotlib.rcParams['font.size'] = '10'
content=open(links,'r').read()
file_one=[]
try:
jieba.analyse.set_stop_words('D:\My Documents\Downloads\jinyong\停用词.txt')
jieba.load_userdict('D:\My Documents\Downloads\jinyong\dict.txt')
tags = jieba.analyse.extract_tags(content,topK=120,withWeight=True)
for i in range(len(tags)):
file_one.append(tags[i])
finally:
print 'OK'
dictionary=pd.DataFrame(file_one).iloc[0:n,:]
return dictionary
width,n=0.4,21
links='D:\My Documents\Downloads\jinyong\诛仙.txt'
dictionary=file_read(links,n)
plt.bar(range(len(dictionary[0])),dictionary[1]*100,width=width,color='rgy')
for i in range(len(dictionary)):
plt.text(i-width/4*3,dictionary[1][i]*100,dictionary[0][i])
plt.show()
结果如下(以诛仙为例):
诛仙词频图
可以看出,主角光环十分强大,而且与故事情节较为呼应的是,张小凡及鬼厉的频率几乎相等,即符合诛仙小说的情节变化(小凡黑化);有意思的是,从数据来看,陆雪琪的出场率大于碧瑶的出场率,若是不懂情节或未看过原著的话,大部分会认为是:陆雪琪为女主角 ;但是,如果爬取碧瑶青云之战之前的数据,估计碧瑶是遥遥领先的。。。
将武侠小说(鹿鼎记)与诛仙对比:
代码如下:
import os
import jieba
import jieba.analyse
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sys
import matplotlib
def file_read(url,n,width):
matplotlib.rcParams['font.family'] = 'sans-serif'
matplotlib.rcParams['font.sans-serif'] = [u'SimHei']
matplotlib.rcParams['font.size'] = '8'
long=len(url)
for j in range(long):
content=open(url[j],'r').read()
file_one=[]
try:
jieba.analyse.set_stop_words('D:\My Documents\Downloads\jinyong\停用词.txt')
jieba.load_userdict('D:\My Documents\Downloads\jinyong\dict.txt')
tags = jieba.analyse.extract_tags(content,topK=120,withWeight=True)
for i in range(len(tags)):
file_one.append(tags[i])
finally:
print 'ok'
dictionary=pd.DataFrame(file_one).iloc[0:n,:]
plt.bar(np.arange(len(dictionary[0]))-width/long*pow(-1,j),dictionary[1]*100,width=width/long)
for i in range(len(dictionary)):
plt.text(np.array(i)-width/long*pow(-1,j)-width/4*3,dictionary[1][i]*100,dictionary[0][i])
plt.show()
width,n=0.4,21
links='D:\My Documents\Downloads\jinyong\诛仙.txt'
url='D:\My Documents\Downloads\jinyong\鹿鼎记.txt'
urls=[links,url]
file_read(urls,n,width)
可视化处理:
对比图