One Project: pratice for machine leaning（二）

在上一篇Machine Learning文章,我写了如何在网络上简单爬取资源。这次介绍关于爬取资源的处理，即Data Processing;

我思

二、Data Processing

a、中文词频统计及词云可视化

工具为：中文分词jieba模块,jieba是一款优秀的中文分词处理器，简单、方便且开源；python WordCloud 模块,功能齐全，可玩性及展示性较强；

以下为相关代码：

from scipy.misc import imread
import jieba
import jieba.analyse #关键字提取
from os import path
from wordcloud import WordCloud
import matplotlib.pyplot as plt


file=open(r'./art/鹿鼎记.txt',encoding='utf-8',errors='ignore')
url=r'./art/stop_word.txt'
content=file.read()
file_one=[]
try:
    jieba.analyse.set_stop_words(url) # 除去中文停止词库
    tags=jieba.analyse.extract_tags(content,topK=160,withWeight=True) # 获得关键词及其次数，数量为前160
    for tag in tags:
        file_one.append([tag[0],tag[1]*1000]) # 写入file_one 列表内
        print(tag[0]+'\t'+str(tag[1]*1000)) # 显示

finally:
    print('OK')
patch=r'C:\Users\22109\Downloads\字体-方正兰亭黑体.ttf'      #设定词云显示中文类型
dictionary=dict(file_one) # list——dict类型
bg_pic = imread(r'./art/img.png')
wc = WordCloud(
            # 设置字体
            font_path = patch,
            # 设置背景色
            background_color='white',
            # 允许最大词汇
            max_words=200,
            # 词云形状
            mask=bg_pic,
            # 最大号字体
            max_font_size=200,
            )
wc.generate_from_frequencies(dictionary) #引入字典类型
plt.figure() 
plt.imshow(wc) # plt显示
plt.axis('off')

词云显示图：

Figure_1.png

b、统计可视化处理

b-1、直方图比较

import os
import jieba
import jieba.analyse
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sys
import matplotlib

def file_read(url,n):
    matplotlib.rcParams['font.family'] = 'sans-serif'
    matplotlib.rcParams['font.sans-serif'] = [u'SimHei']
    matplotlib.rcParams['font.size'] = '10'
    content=open(links,'r').read()
    file_one=[]
    try:
        jieba.analyse.set_stop_words('D:\My Documents\Downloads\jinyong\停用词.txt')
        jieba.load_userdict('D:\My Documents\Downloads\jinyong\dict.txt')
        tags = jieba.analyse.extract_tags(content,topK=120,withWeight=True)
        for i in range(len(tags)):
            file_one.append(tags[i])
    finally:
        print 'OK'
    dictionary=pd.DataFrame(file_one).iloc[0:n,:]
    return dictionary
    
width,n=0.4,21
links='D:\My Documents\Downloads\jinyong\诛仙.txt'
dictionary=file_read(links,n)
plt.bar(range(len(dictionary[0])),dictionary[1]*100,width=width,color='rgy')
for i in range(len(dictionary)):
    plt.text(i-width/4*3,dictionary[1][i]*100,dictionary[0][i])
plt.show()

结果如下（以诛仙为例）：

诛仙词频图

可以看出，主角光环十分强大，而且与故事情节较为呼应的是，张小凡及鬼厉的频率几乎相等，即符合诛仙小说的情节变化（小凡黑化）；有意思的是，从数据来看，陆雪琪的出场率大于碧瑶的出场率，若是不懂情节或未看过原著的话，大部分会认为是：陆雪琪为女主角；但是，如果爬取碧瑶青云之战之前的数据，估计碧瑶是遥遥领先的。。。

将武侠小说（鹿鼎记）与诛仙对比：
代码如下：

import os
import jieba
import jieba.analyse
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sys
import matplotlib


def file_read(url,n,width):
    matplotlib.rcParams['font.family'] = 'sans-serif'
    matplotlib.rcParams['font.sans-serif'] = [u'SimHei']
    matplotlib.rcParams['font.size'] = '8'
    long=len(url)
    for j in range(long):
        content=open(url[j],'r').read()
        file_one=[]
        try:
            jieba.analyse.set_stop_words('D:\My Documents\Downloads\jinyong\停用词.txt')
            jieba.load_userdict('D:\My Documents\Downloads\jinyong\dict.txt')
            tags = jieba.analyse.extract_tags(content,topK=120,withWeight=True)
            for i in range(len(tags)):
                file_one.append(tags[i])
        finally:
            print 'ok'
        dictionary=pd.DataFrame(file_one).iloc[0:n,:]
        plt.bar(np.arange(len(dictionary[0]))-width/long*pow(-1,j),dictionary[1]*100,width=width/long)
        for i in range(len(dictionary)):
            plt.text(np.array(i)-width/long*pow(-1,j)-width/4*3,dictionary[1][i]*100,dictionary[0][i])
            
    plt.show()
    
width,n=0.4,21
links='D:\My Documents\Downloads\jinyong\诛仙.txt'
url='D:\My Documents\Downloads\jinyong\鹿鼎记.txt'
urls=[links,url]
file_read(urls,n,width)

可视化处理：

对比图

One Project: pratice for machine leaning（二）

二、Data Processing

a、中文词频统计及词云可视化

b、统计可视化处理

b-1、直方图比较

推荐阅读更多精彩内容