这几天看了下lda topic model,单单看lda数学八卦那个pdf就看了2天,虽然还是很多没看懂,但还是学到了一些东西,原来不知道lda怎么就能输入一个topic的数目K和超参数alpha和beta它就能给你得出相应的topics出来。
原来关键的就是其中用到的gibbs采样算法。指定了K后,一开始word是随机赋值到1~K个主题上去的,然后通过gibbs采样里的用到的公式算出一个word i跳转到1~K个主题的概率,这个转移矩阵一直计算跳转,直到gibbs采样算法收敛,即每个单词赋给1~K个主题里的概率最后恒定不变了。这个时候就得到了lda模型,即word-topic共现矩阵,由这个也可以算出document-topic的分布。
至于gibbs采样用到的word属于某个topic的概率公式怎么算来的,那lda数学八卦里从源头推到了结尾,可以说,这个文档真的很不错,虽然我还是没怎么懂,但还是知道了由gamma函数推到beta分布推到Dirichlet分布到最后的gibbs采样公式。
现在在用R中的lda包,想学习一点文本数据看其topic的分布,关键的lda的我们自己要提供的参数输入有3个,一个是主题数K,一个是控制document-topic概率分布的alpha以及控制topic-word概率分布的beta(包中的好像是用eta)。
LDA数学八卦在在此:http://www.52nlp.cn/lda-math-%E6%B1%87%E6%80%BB-lda%E6%95%B0%E5%AD%A6%E5%85%AB%E5%8D%A6
不过,我还真不大知道,这3个参数该如何设置好,以及物理意义是啥。有的说alpha为50/K, beta为0.01,0.1,200/W,但我的数据好大,超出一般规模,所以,感觉好像不大适用,要比较合适的设置alpha和beta,了解其意义便是很重要的,我在网上找到个基本的解释,参见下文:
http://stats.stackexchange.com/questions/37405/natural-interpretation-for-lda-hyperparameters:
The answer depends on whether you are assuming the symmetric or asymmetric dirichlet distribution (or, more technically, whether thebase measureis uniform). Unless something else is specified, most implementations of LDA assume the distribution is symmetric.
For the symmetric distribution, a high alpha-value means that each document is likely to contain a mixture ofmostof the topics, and not any single topic specifically. A low alpha value puts less such constraints on documents and means that it is more likely that a document may contain mixture of just a few, or even only one, of the topics. Likewise, a high beta-value means that each topic is likely to contain a mixture of most of the words, and not any word specifically, while a low value means that a topic may contain a mixture of just a few of the words.
If, on the other hand, the distribution is asymmetric, a high alpha-value means that a specific topic distribution (depending on the base measure) is more likely for each document. Similarly, high beta-values means each topic is more likely to contain a specific word mix defined by the base measure.
In practice, a high alpha-value will lead to documents being more similar in terms of what topics they contain. A high beta-value will similarly lead to topics being more similar in terms of what words they contain.
So, yes, the alpha-parameters specify prior beliefs about topic sparsity/uniformity in the documents. I'm not entirely sure what you mean by "mutual exclusiveness of topics in terms of words" though.
More generally, these areconcentration parametersfor the dirichlet distribution used in the LDA model. To gain some intuitive understanding of how this works,this presentationcontains some nice illustrations, as well as a good explanation of LDA in general.
An additional comment I'll put here, since I can't comment on your original question: From what I've seen, the alpha- and beta-parameters can somewhat confusingly refer to several different parameterizations. The underlying dirichlet distribution is usually parameterized with the vector(α1,α2,...,αK), but this can be decomposed into the base measureu=(u1,u2,...,uK)and the concentration parameterα, such thatα∗u=(α1,α2,...,αK). In the case where the alpha parameter is a scalar, it is usually meant the concentration parameterα, but it can also mean the values of(α1,α2,...,αK), since these will be equal under the symmetrical dirichlet distribution. If it's a vector, it usually refers to(α1,α2,...,αK). I'm not sure which parametrization is most common, but in my reply I assume you meant the alpha- and beta-values as the concentration parameters.
大概的说法好像是,如果分布是均匀对称的,则alpha和beta值越大,则表示一个文档包含了大多数topics,一个topics包含了大多数word,相反,如果是不对称的,则值越大表示一个文档只包含了不几个topic,一个topic包含了不几个word。
啊,继续努力吧!
有一篇文章,LDA-Based Document Models for Ad-hoc Retrieval,里面用的数据集比较大,感觉可以参考其中的参数设置:alpha=50/K, beta=0.01,原文如下:
We use symmetric Dirichlet priors in the LDA estimation with a = 50/ K and b =0.01,which are common settings in the literature. Our experience shows that retrieval results are not very sensitive to the values of these parameters.
至于K的选择,文章中通过实验,将K设置为了400,gibbs采样收敛后计算30次迭代的平均结果作为最终的概率值。原文参考如下:
Selecting the right number of topics is also an important problem in topic modeling. Nonparametric models like the Chinese Restaurant Process (Blei et al, 2004; Teh et al, 2004) are not practical to use for large data sets to automatically decide the number of topics. A range of 50 to 300 topics is typically used in the topic modeling literature. 50 topics are often used for small collections and 300 for relatively large collections, which are still much smaller than the IR collections we use. It is well known that larger data sets may need more topics in general, and it is confirmed here by our xperiments with different values of K (100, 200, …) on the AP collection. K=800 gives the best average precision, as shown in Table 2. This number is much less than the corresponding optimal K value (2000) in the cluster model (Liu and Croft, 2004). As we explained in Section 3.3, in the cluster model, one document can be based on one topic, and in the LDA model, the mixture of topics for each document is more powerful and expressive; thus a smaller number of topics is used. Empirically, even with more parsimonious parameter settings like K=400, 30 iterations, 2 Markov chains, statistically significant improvements can also be achieved on most of the collections.
再附上,文章中用的数据集大小: