一、LDA是什么？

简单来说，LDA（ Latent Dirichelet Allocation）是一个文档主题生成模型，包含词、主题、文档三层结构。模型的思想是假设一篇文章的每个词都是通过“以一定的概率选择了某个主题，并在该主题下以一定的概率选择词语”这样一个过程来得到。通过LDA主题分析，我们可以得到每一个文档的主题分布。

二、怎么做LDA？

01 加载相关包

# 加载相关包
import numpy as np
import pandas as pd
import xlrd        #读取excel数据用
import openpyxl    #输出excel数据用

import re
import jieba
import jieba.analyse
#jieba.enable_parallel() #并行分词开启

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# 文本主题可视化
import pyLDAvis
import pyLDAvis.sklearn

pyLDAvis.enable_notebook()

import warnings
warnings.filterwarnings("ignore")

02 加载数据，并进行预处理

def stopwordslist():
    stopwords = [line.strip() for line in open('./data/stopwords.txt',encoding='UTF-8').readlines()]
    return stopwords

def data_preprocessing():
    songs = pd.read_excel("data/周杰伦歌词.xlsx", encoding='utf-8')
    # 歌词只保留中文
    songs['歌词'] = songs['歌词'].apply(lambda x: re.sub(r'[^\u4e00-\u9fa5]+', ' ',str(x), flags=re.U))
    # 分词,去停用词
    stopwords = stopwordslist()
    songs['分词'] = songs['歌词'].apply(lambda x: ' '.join([w for w in jieba.lcut(x.strip()) if w not in stopwords and w != ' ']))
    all_train = list(songs.分词.values)
    return all_train, songs

corpus, songs = data_preprocessing()

03 生成tfidf矩阵

tfidf_vectorizer = TfidfVectorizer(min_df = 5) 
tfidf_mat = tfidf_vectorizer.fit_transform(corpus)

print('字典长度：', len(tfidf_vectorizer.vocabulary_))

Output: 字典长度： 306

04 LDA主题聚类

n_topics = 3      # 自定义主题个数
lda_model = LatentDirichletAllocation(n_components = n_topics, batch_size=8, random_state = 0)
# 使用TF-IDF矩阵拟合LDA模型
lda_model.fit(tfidf_mat)

output

我们来可视化一下主题分类效果，可以看到，在多维度量尺下，各个主题隔得很远，主题2和主题3内部比较紧凑，这个效果看起来还可以。

data = pyLDAvis.sklearn.prepare(lda_model, tfidf_mat, tfidf_vectorizer)
pyLDAvis.show(data)

image.png

我们打印各个主题权重最大的前15个词来看看，每个主题是什么意思。

# 主题词打印函数
def print_top_words(model, feature_names, n_top_words):
    
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:"%(topic_idx+1))
        print(" ".join([feature_names[i] for i in topic.argsort()[-n_top_words-1:-1]]))
    
n_top_words = 15
tf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(lda_model, tf_feature_names, n_top_words)

主题词分布

我试(hu)图(luan)解释一下各个主题在传达什么：
主题1：关于快乐
主题2：关于后悔
主题3：关于爱情

我们来看看《七里香》专辑中每首歌的主题分布：

doc_topic_matrix = lda_model.transform(tfidf_mat)
doc_topic_df = pd.DataFrame(doc_topic_matrix)
songs['最大概率主题'] = np.argmax(doc_topic_matrix, axis=1) + 1
songs['主题1比例'], songs['主题2比例'], songs['主题3比例'] = doc_topic_df.iloc[:,0], doc_topic_df.iloc[:,1], doc_topic_df.iloc[:,2]

songs[songs['专辑'] == '七里香'].drop(columns=['专辑','年份','分词'])

《七里香》专辑主题分布

嗯，七里香这首歌简直就是初恋的味道了，主题3的占比为85%，这很合理，其它的就。。。。。。我们还是来对比一下周杰伦的作词主题与方文山的有什么不同吧。

三、方文山歌词主题分布 vs 周杰伦歌词主题分布

fang = songs[songs['填词'] == '方文山']
jay = songs[songs['填词'] == '周杰伦']

topics = ['About happiness','About regret','About Love']

fang_topic = fang.iloc[:,-3:].mean()
plt.pie(fang_topic,
        labels=topics,
        startangle=90,
        shadow= True,
        explode=(0,0,0.1),  # explode the second one eating to draw attention
        autopct='%1.1f%%') # adds percentage
plt.title("Topic Distribution of Fang Wenshan's songs")

方文山歌词主题分布

jay_topic = jay.iloc[:,-3:].mean()
plt.pie(jay_topic,
        labels=topics,
        startangle=90,
        shadow= True,
        explode=(0,0,0.1),  # explode the second one eating to draw attention
        autopct='%1.1f%%') # adds percentage
plt.title("Topic Distribution of Jay's songs")

周杰伦歌词主题分布

总体而言，方文山的主题集中在爱情故事，而周杰伦的歌词主题分布则显得相对分散。

四、参考资料

【1】An NLP Approach to Mining Online Reviews using Topic Modeling(with Python codes)
【2】鬼吹灯文本挖掘4：LDA模型提取文档主题 sklearn LatentDirichletAllocation和gensim LdaModel
【3】网易云课堂AI工程师（自然语言处理）— 主题模型：文本主题抽取与表示

Enjoy reading ：）
不点个小心心吗？

用LDA文档主题模型探索一下周杰伦的歌词主题