LDA 打印出主题与其对应文档编号（索引）

一般Sklearn的LDA用法如下：通过这几行代码，类聚出来了
但是现在网上有的资源基本还是直接打印出主题，没有将主题和文档一一对应。

import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
#tf_idf的意义
from sklearn.decomposition import NMF, LatentDirichletAllocation  


random_state = 0
n_topics = 10
vec = TfidfVectorizer(max_features=5000, stop_words="english", max_df=0.95, min_df=2)
#text_arry是二维数组
features = vec.fit_transform(text_arry)

lda = LatentDirichletAllocation(n_components=14,  # 主题个数
                                # max_iter=5,    # EM算法的最大迭代次数
                                learning_method='batch',
                                learning_offset=50., # 仅仅在算法使用online时有意义，取值要大于1。用来减小前面训练样本批次对最终模型的影响
                                random_state=0)
docres = lda.fit_transform(features)

本文主要是分享如何将LDA主题与文档对应起来（非原创，侵删）
上面两行代码的来处多数博客已经分享，此处不再赘述。
大家根据自己的结果改对应参数，运行以下代码即可实现主题——文档对应目标

# 文档所属每个类别的概率
LDA_corpus = np.array(docres)
print('类别所属概率:\n', LDA_corpus)
# 每篇文章中对每个特征词的所属概率矩阵：list长度等于分类数量
# print('主题词所属矩阵：\n', lda.components_)
# 构建一个零矩阵
LDA_corpus_one = np.zeros([LDA_corpus.shape[0]])
# 对比所属两个概率的大小，确定属于的类别
LDA_corpus_one = np.argmax(LDA_corpus, axis=1) # 返回沿轴axis最大值的索引，axis=1代表行；最大索引即表示最可能表示的数字是多少
print('每个文档所属类别：', LDA_corpus_one)
# 打印每个单词的主题的权重值
tt_matrix = lda.components_
id = 0
for tt_m in tt_matrix:
    tt_dict = [(name, tt) for name, tt in zip(feature_names, tt_m)]
    tt_dict = sorted(tt_dict, key=lambda x: x[1], reverse=True)
    # 打印权重值大于0.6的主题词：
    # tt_dict = [tt_threshold for tt_threshold in tt_dict if tt_threshold[1] > 0.6]
    # 打印每个类别前5个主题词：
    tt_dict = tt_dict[:8]
    print('主题%d:' % (id), tt_dict)
    id += 1

LDA 打印出主题与其对应文档编号（索引）

推荐阅读更多精彩内容