[python] LDA处理文档主题分布代码入门笔记
该文本内容原自博客:文本分析之TFIDF/LDA/Word2vec实践 ,推荐大家去阅读。
新春 备 年货 , 新年 联欢晚会
新春 节目单 , 春节 联欢晚会 红火
大盘 下跌 股市 散户
下跌 股市 赚钱
金猴 新春 红火 新年
新车 新年 年货 新春
股市 反弹 下跌
股市 散户 赚钱
新年 , 看 春节 联欢晚会
大盘 下跌 散户
输出则是这十篇文档的主题分布,Shape(10L, 2L)表示10篇文档,2个主题。
shape: (10L, 2L)
doc: 0 topic: 0
doc: 1 topic: 0
doc: 2 topic: 1
doc: 3 topic: 1
doc: 4 topic: 0
doc: 5 topic: 0
doc: 6 topic: 1
doc: 7 topic: 1
doc: 8 topic: 0
doc: 9 topic: 1
同时调用 matplotlib.pyplot 输出了对应的文档主题分布图,可以看到主题Doc0、Doc1、Doc8分布于Topic0,它们主要描述主题新春;而Doc2、Doc3、Doc9分布于Topic1,主要描述股市。
# coding=utf-8
import os
import sys
import numpy as np
import matplotlib
import scipy
import matplotlib.pyplot as plt
from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
if __name__ == "__main__":
#存储读取语料 一行预料为一个文档
corpus = []
for line in open('test.txt', 'r').readlines():
#print line
#print corpus
#将文本中的词语转换为词频矩阵 矩阵元素a[i][j] 表示j词在i类文本下的词频
vectorizer = CountVectorizer()
print vectorizer
X = vectorizer.fit_transform(corpus)
analyze = vectorizer.build_analyzer()
weight = X.toarray()
print len(weight)
print (weight[:5, :5])
print 'LDA:'
import numpy as np
import lda
import lda.datasets
model = lda.LDA(n_topics=2, n_iter=500, random_state=1)
model.fit(np.asarray(weight)) # model.fit_transform(X) is also available
topic_word = model.topic_word_ # model.components_ also works
doc_topic = model.doc_topic_
print("type(doc_topic): {}".format(type(doc_topic)))
print("shape: {}".format(doc_topic.shape))
label = []
for n in range(10):
topic_most_pr = doc_topic[n].argmax()
print("doc: {} topic: {}".format(n, topic_most_pr))
import matplotlib.pyplot as plt
f, ax= plt.subplots(6, 1, figsize=(8, 8), sharex=True)
for i, k in enumerate([0, 1, 2, 3, 8, 9]):
ax[i].stem(doc_topic[k,:], linefmt='r-',
markerfmt='ro', basefmt='w-')
ax[i].set_xlim(-1, 2) #x坐标下标
ax[i].set_ylim(0, 1.2) #y坐标下标
ax[i].set_title("Document {}".format(k))
import matplotlib.pyplot as plt
f, ax= plt.subplots(2, 1, figsize=(6, 6), sharex=True)
for i, k in enumerate([0, 1]): #两个主题
ax[i].stem(topic_word[k,:], linefmt='b-',
markerfmt='bo', basefmt='w-')
ax[i].set_ylim(0, 1)
ax[i].set_title("topic {}".format(k))
特征计算方法参考:Feature Extraction - scikit-learn
Features length: 15
下跌 反弹 大盘 年货 散户 新年 新春 新车 春节 红火 联欢晚会 股市 节目单 赚钱 金猴
0.0 0.0 0.0 0.579725686076 0.0 0.450929562568 0.450929562568 0.0 0.0 0.0 0.507191470855 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.356735384792 0.0 0.458627428458 0.458627428458 0.401244805261 0.0 0.539503693426 0.0 0.0
0.450929562568 0.0 0.579725686076 0.0 0.507191470855 0.0 0.0 0.0 0.0 0.0 0.0 0.450929562568 0.0 0.0 0.0
0.523221265036 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.523221265036 0.0 0.672665604612 0.0
0.0 0.0 0.0 0.0 0.0 0.410305398084 0.410305398084 0.0 0.0 0.52749830162 0.0 0.0 0.0 0.0 0.620519542315
0.0 0.0 0.0 0.52749830162 0.0 0.410305398084 0.410305398084 0.620519542315 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.482964462575 0.730404446714 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.482964462575 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.568243852685 0.0 0.0 0.0 0.0 0.0 0.0 0.505209504985 0.0 0.649509260872 0.0
0.0 0.0 0.0 0.0 0.0 0.505209504985 0.0 0.0 0.649509260872 0.0 0.568243852685 0.0 0.0 0.0 0.0
0.505209504985 0.0 0.649509260872 0.0 0.568243852685 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'
Features length: 15
下跌 反弹 大盘 年货 散户 新年 新春 新车 春节 红火 联欢晚会 股市 节目单 赚钱 金猴 TF Weight:
0 0 0 1 0 1 1 0 0 0 1 0 0 0 0
0 0 0 0 0 0 1 0 1 1 1 0 1 0 0
1 0 1 0 1 0 0 0 0 0 0 1 0 0 0
1 0 0 0 0 0 0 0 0 0 0 1 0 1 0
0 0 0 0 0 1 1 0 0 1 0 0 0 0 1
0 0 0 1 0 1 1 1 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 1 0 0 0 0 0 0 1 0 1 0
0 0 0 0 0 1 0 0 1 0 1 0 0 0 0
1 0 1 0 1 0 0 0 0 0 0 0 0 0 0
[[0 0 0 1 0]
[0 0 0 0 0]
[1 0 1 0 1]
[1 0 0 0 0]
[0 0 0 0 0]]
import lda
model = lda.LDA(n_topics=20, n_iter=500, random_state=1)
from sklearn.cluster import KMeans
clf = KMeans(n_clusters=4) #景区 动物 人物 国家
s = clf.fit(weight)
shape: (12L, 4L)
doc: 0 topic: 1
doc: 1 topic: 1
doc: 2 topic: 1
doc: 3 topic: 3
doc: 4 topic: 3
doc: 5 topic: 3
doc: 6 topic: 0
doc: 7 topic: 0
doc: 8 topic: 0
doc: 9 topic: 2
doc: 10 topic: 2
doc: 11 topic: 2
print 'LDA:'
model = lda.LDA(n_topics=2, n_iter=500, random_state=1)
model.fit(np.asarray(weight)) # model.fit_transform(X) is also available
topic_word = model.topic_word_ # model.components_ also works
word = vectorizer.get_feature_names()
for w in word:
print w
print topic_word[:, :3]
n = 5
for i, topic_dist in enumerate(topic_word):
topic_words = np.array(word)[np.argsort(topic_dist)][:-(n+1):-1]
print(u'*Topic {}\n- {}'.format(i, ' '.join(topic_words)))
doc_topic = model.doc_topic_
print("type(doc_topic): {}".format(type(doc_topic)))
print("shape: {}".format(doc_topic.shape))
通过word = vectorizer.get_feature_names()获取整个预料的词向量,其中TF-IDF对应的就是它的值。然后再获取其位置对应的关键词即可,代码中输出5个关键词,如下图所示: