本文主要介绍Python中NLTK文本分析的内容,咱先来看看文本分析的整个流程:
原始文本 - 分词 - 词性标注 - 词形归一化 - 去除停用词 - 去除特殊字符 - 单词大小写转换 - 文本分析
一、分词
使用DBSCAN聚类算法的英文介绍文本为例:
from nltk import word_tokenize
sentence = "DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density "
token_words = word_tokenize(sentence)
print(token_words)
输出分词结果:
['DBSCAN', '-', 'Density-Based', 'Spatial', 'Clustering', 'of', 'Applications', 'with', 'Noise', '.', 'Finds', 'core', 'samples', 'of', 'high', 'density', 'and', 'expands', 'clusters', 'from', 'them', '.', 'Good', 'for', 'data', 'which', 'contains', 'clusters', 'of', 'similar', 'density']
二、词性标注
为什么要进行词性标注?咱先来看看不做词性标注,直接按照第一步分词结果进行词形归一化的情形:
常见词形归一化有两种方式(词干提取与词形归并):
1、词干提取
from nltk.stem.lancaster import LancasterStemmer
lancaster_stemmer = LancasterStemmer()
words_stemmer = [lancaster_stemmer.stem(token_word) for token_word in token_words]
print(words_stemmer)
输出结果:
['dbscan', '-', 'density-based', 'spat', 'clust', 'of', 'apply', 'with', 'nois', '.', 'find', 'cor', 'sampl', 'of', 'high', 'dens', 'and', 'expand', 'clust', 'from', 'them', '.', 'good', 'for', 'dat', 'which', 'contain', 'clust', 'of', 'simil', 'dens']
说明:词干提取默认提取单词词根,容易得出一些不具实际意义的单词,比如上面的”Spatial“变为”spat“,”Noise“变为”nois“,在常规文本分析中没意义,在信息检索中用该方法会比较合适。
2、词形归并(单词变体还原)
from nltk.stem import WordNetLemmatizer
wordnet_lematizer = WordNetLemmatizer()
words_lematizer = [wordnet_lematizer.lemmatize(token_word) for token_word in token_words]
print(words_lematizer)
输出结果:
['DBSCAN', '-', 'Density-Based', 'Spatial', 'Clustering', 'of', 'Applications', 'with', 'Noise', '.', 'Finds', 'core', 'sample', 'of', 'high', 'density', 'and', 'expands', 'cluster', 'from', 'them', '.', 'Good', 'for', 'data', 'which', 'contains', 'cluster', 'of', 'similar', 'density']
说明:这种方法主要在于将过去时、将来时、第三人称等单词还原为原始词,不会产生词根这些无意义的单词,但是仍存在有些词无法还原的情况,比如“Finds”、“expands”、”contains“仍是第三人称的形式,原因在于wordnet_lematizer.lemmatize函数默认将其当做一个名词,以为这就是单词原型,如果我们在使用该函数时指明动词词性,就可以将其变为”contain“了。所以要先进行词性标注获取单词词性(详情如下)。
3、词性标注
先分词,再词性标注:
from nltk import word_tokenize,pos_tag
sentence = "DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density"
token_word = word_tokenize(sentence) #分词
token_words = pos_tag(token_word) #词性标注
print(token_words)
输出结果:
[('DBSCAN', 'NNP'), ('-', ':'), ('Density-Based', 'JJ'), ('Spatial', 'NNP'), ('Clustering', 'NNP'), ('of', 'IN'), ('Applications', 'NNP'), ('with', 'IN'), ('Noise', 'NNP'), ('.', '.'), ('Finds', 'NNP'), ('core', 'NN'), ('samples', 'NNS'), ('of', 'IN'), ('high', 'JJ'), ('density', 'NN'), ('and', 'CC'), ('expands', 'VBZ'), ('clusters', 'NNS'), ('from', 'IN'), ('them', 'PRP'), ('.', '.'), ('Good', 'JJ'), ('for', 'IN'), ('data', 'NNS'), ('which', 'WDT'), ('contains', 'VBZ'), ('clusters', 'NNS'), ('of', 'IN'), ('similar', 'JJ'), ('density', 'NN')]
说明:列表中每个元组第二个元素显示为该词的词性,具体每个词性注释可运行代码”nltk.help.upenn_tagset()“或参看说明文档:词性标签说明
三、词形归一化(指明词性)
from nltk.stem import WordNetLemmatizer
words_lematizer = []
wordnet_lematizer = WordNetLemmatizer()
for word, tag in token_words:
if tag.startswith('NN'):
word_lematizer = wordnet_lematizer.lemmatize(word, pos='n') # n代表名词
elif tag.startswith('VB'):
word_lematizer = wordnet_lematizer.lemmatize(word, pos='v') # v代表动词
elif tag.startswith('JJ'):
word_lematizer = wordnet_lematizer.lemmatize(word, pos='a') # a代表形容词
elif tag.startswith('R'):
word_lematizer = wordnet_lematizer.lemmatize(word, pos='r') # r代表代词
else:
word_lematizer = wordnet_lematizer.lemmatize(word)
words_lematizer.append(word_lematizer)
print(words_lematizer)
输出结果:
['DBSCAN', '-', 'Density-Based', 'Spatial', 'Clustering', 'of', 'Applications', 'with', 'Noise', '.', 'Finds', 'core', 'sample', 'of', 'high', 'density', 'and', 'expand', 'cluster', 'from', 'them', '.', 'Good', 'for', 'data', 'which', 'contain', 'cluster', 'of', 'similar', 'density']
说明:可以看到单词变体已经还原成单词原型,如“Finds”、“expands”、”contains“均已变为各自的原型。
四、去除停用词
经过分词与词形归一化之后,得到各个词性单词的原型,但仍存在一些无实际意义的介词、量词等在文本分析中不重要的词(这类词在文本分析中称作停用词),需要将其去除。
from nltk.corpus import stopwords
cleaned_words = [word for word in words_lematizer if word not in stopwords.words('english')]
print('原始词:', words_lematizer)
print('去除停用词后:', cleaned_words)
输出结果:
原始词: ['DBSCAN', '-', 'Density-Based', 'Spatial', 'Clustering', 'of', 'Applications', 'with', 'Noise', '.', 'Finds', 'core', 'sample', 'of', 'high', 'density', 'and', 'expand', 'cluster', 'from', 'them', '.', 'Good', 'for', 'data', 'which', 'contain', 'cluster', 'of', 'similar', 'density']
去除停用词后: ['DBSCAN', '-', 'Density-Based', 'Spatial', 'Clustering', 'Applications', 'Noise', '.', 'Finds', 'core', 'sample', 'high', 'density', 'expand', 'cluster', '.', 'Good', 'data', 'contain', 'cluster', 'similar', 'density']
说明:of、for、and这类停用词已被去除。
五、去除特殊字符
标点符号在文本分析中也是不需要的,也将其剔除,这里我们采用循环列表判断的方式来剔除,可自定义要去除的标点符号、要剔除的特殊单词也可以放在这将其剔除,比如咱将"DBSCAN"也连同标点符号剔除。
characters = [',', '.','DBSCAN', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%','-','...','^','{','}']
words_list = [word for word in cleaned_words if word not in characters]
print(words_list)
输出结果:
['Density-Based', 'Spatial', 'Clustering', 'Applications', 'Noise', 'Finds', 'core', 'sample', 'high', 'density', 'expand', 'cluster', 'Good', 'data', 'contain', 'cluster', 'similar', 'density']
说明:处理后的单词列表已不存在“-”、“.”等特殊字符。
六、大小写转换
为防止同一个单词同时存在大小写而算作两个单词的情况,还需要统一单词大小写(此处统一为小写)。
words_lists = [x.lower() for x in words_list ]
print(words_lists)
输出结果:
['density-based', 'spatial', 'clustering', 'applications', 'noise', 'finds', 'core', 'sample', 'high', 'density', 'expand', 'cluster', 'good', 'data', 'contain', 'cluster', 'similar', 'density']
七、文本分析
经以上六步的文本预处理后,已经得到干净的单词列表做文本分析或文本挖掘(可转换为DataFrame之后再做分析)。
统计词频(这里我们以统计词频为例):
from nltk import FreqDist
freq = FreqDist(words_lists)
for key,val in freq.items():
print (str(key) + ':' + str(val))
输出结果:
density-based:1
spatial:1
clustering:1
applications:1
noise:1
finds:1
core:1
sample:1
high:1
density:2
expand:1
cluster:2
good:1
data:1
contain:1
similar:1
可视化(折线图):
freq.plot(20,cumulative=False)
可视化(词云):
绘制词云需要将单词列表转换为字符串
words = ' '.join(words_lists)
words
输出结果:
'density-based spatial clustering applications noise finds core sample high density expand cluster good data contain cluster similar density'
绘制词云
from wordcloud import WordCloud
from imageio import imread
import matplotlib.pyplot as plt
pic = imread('./picture/china.jpg')
wc = WordCloud(mask = pic,background_color = 'white',width=800, height=600)
wwc = wc.generate(words)
plt.figure(figsize=(10,10))
plt.imshow(wwc)
plt.axis("off")
plt.show()
文本分析结论:根据折线图或词云,咱可以直观看到“density”与“cluster”两个单词出现最多,词云中字体越大。