lda学习一丢丢

这几天看了下lda topic model,单单看lda数学八卦那个pdf就看了2天,虽然还是很多没看懂,但还是学到了一些东西,原来不知道lda怎么就能输入一个topic的数目K和超参数alpha和beta它就能给你得出相应的topics出来。

原来关键的就是其中用到的gibbs采样算法。指定了K后,一开始word是随机赋值到1~K个主题上去的,然后通过gibbs采样里的用到的公式算出一个word i跳转到1~K个主题的概率,这个转移矩阵一直计算跳转,直到gibbs采样算法收敛,即每个单词赋给1~K个主题里的概率最后恒定不变了。这个时候就得到了lda模型,即word-topic共现矩阵,由这个也可以算出document-topic的分布。

至于gibbs采样用到的word属于某个topic的概率公式怎么算来的,那lda数学八卦里从源头推到了结尾,可以说,这个文档真的很不错,虽然我还是没怎么懂,但还是知道了由gamma函数推到beta分布推到Dirichlet分布到最后的gibbs采样公式。

现在在用R中的lda包,想学习一点文本数据看其topic的分布,关键的lda的我们自己要提供的参数输入有3个,一个是主题数K,一个是控制document-topic概率分布的alpha以及控制topic-word概率分布的beta(包中的好像是用eta)。

LDA数学八卦在在此:http://www.52nlp.cn/lda-math-%E6%B1%87%E6%80%BB-lda%E6%95%B0%E5%AD%A6%E5%85%AB%E5%8D%A6

不过,我还真不大知道,这3个参数该如何设置好,以及物理意义是啥。有的说alpha为50/K, beta为0.01,0.1,200/W,但我的数据好大,超出一般规模,所以,感觉好像不大适用,要比较合适的设置alpha和beta,了解其意义便是很重要的,我在网上找到个基本的解释,参见下文:

http://stats.stackexchange.com/questions/37405/natural-interpretation-for-lda-hyperparameters:

The answer depends on whether you are assuming the symmetric or asymmetric dirichlet distribution (or, more technically, whether thebase measureis uniform). Unless something else is specified, most implementations of LDA assume the distribution is symmetric.

For the symmetric distribution, a high alpha-value means that each document is likely to contain a mixture ofmostof the topics, and not any single topic specifically. A low alpha value puts less such constraints on documents and means that it is more likely that a document may contain mixture of just a few, or even only one, of the topics. Likewise, a high beta-value means that each topic is likely to contain a mixture of most of the words, and not any word specifically, while a low value means that a topic may contain a mixture of just a few of the words.

If, on the other hand, the distribution is asymmetric, a high alpha-value means that a specific topic distribution (depending on the base measure) is more likely for each document. Similarly, high beta-values means each topic is more likely to contain a specific word mix defined by the base measure.

In practice, a high alpha-value will lead to documents being more similar in terms of what topics they contain. A high beta-value will similarly lead to topics being more similar in terms of what words they contain.

So, yes, the alpha-parameters specify prior beliefs about topic sparsity/uniformity in the documents. I'm not entirely sure what you mean by "mutual exclusiveness of topics in terms of words" though.

More generally, these areconcentration parametersfor the dirichlet distribution used in the LDA model. To gain some intuitive understanding of how this works,this presentationcontains some nice illustrations, as well as a good explanation of LDA in general.

An additional comment I'll put here, since I can't comment on your original question: From what I've seen, the alpha- and beta-parameters can somewhat confusingly refer to several different parameterizations. The underlying dirichlet distribution is usually parameterized with the vector(α1,α2,...,αK), but this can be decomposed into the base measureu=(u1,u2,...,uK)and the concentration parameterα, such thatα∗u=(α1,α2,...,αK). In the case where the alpha parameter is a scalar, it is usually meant the concentration parameterα, but it can also mean the values of(α1,α2,...,αK), since these will be equal under the symmetrical dirichlet distribution. If it's a vector, it usually refers to(α1,α2,...,αK). I'm not sure which parametrization is most common, but in my reply I assume you meant the alpha- and beta-values as the concentration parameters.

大概的说法好像是,如果分布是均匀对称的,则alpha和beta值越大,则表示一个文档包含了大多数topics,一个topics包含了大多数word,相反,如果是不对称的,则值越大表示一个文档只包含了不几个topic,一个topic包含了不几个word。

啊,继续努力吧!

有一篇文章,LDA-Based Document Models for Ad-hoc Retrieval,里面用的数据集比较大,感觉可以参考其中的参数设置:alpha=50/K, beta=0.01,原文如下:

We use symmetric Dirichlet priors in the LDA estimation with a = 50/ K and b =0.01,which are common settings in the literature. Our experience shows that retrieval results are not very sensitive to the values of these parameters.

至于K的选择,文章中通过实验,将K设置为了400,gibbs采样收敛后计算30次迭代的平均结果作为最终的概率值。原文参考如下:

Selecting the right number of topics is also an important problem in topic modeling. Nonparametric models like the Chinese Restaurant Process (Blei et al, 2004; Teh et al, 2004) are not practical to use for large data sets to automatically decide the number of topics. A range of 50 to 300 topics is typically used in the topic modeling literature. 50 topics are often used for small collections and 300 for relatively large collections, which are still much smaller than the IR collections we use. It is well known that larger data sets may need more topics in general, and it is confirmed here by our xperiments with different values of K (100, 200, …) on the AP collection. K=800 gives the best average precision, as shown in Table 2. This number is much less than the corresponding optimal K value (2000) in the cluster model (Liu and Croft, 2004). As we explained in Section 3.3, in the cluster model, one document can be based on one topic, and in the LDA model, the mixture of topics for each document is more powerful and expressive; thus a smaller number of topics is used. Empirically, even with more parsimonious parameter settings like K=400, 30 iterations, 2 Markov chains, statistically significant improvements can also be achieved on most of the collections.

再附上,文章中用的数据集大小:


最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 204,921评论 6 478
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 87,635评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 151,393评论 0 338
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,836评论 1 277
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,833评论 5 368
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,685评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,043评论 3 399
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,694评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 42,671评论 1 300
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,670评论 2 321
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,779评论 1 332
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,424评论 4 321
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,027评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,984评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,214评论 1 260
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,108评论 2 351
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,517评论 2 343

推荐阅读更多精彩内容

  • 扬手间,夏天就这样悄无声息地离开了。而“他”却来得是那么地猝不及防,就在一夜之间,世界好像发生了天翻地覆地变化……...
    一只胖子的小豆芽阅读 437评论 2 5
  • 雪地里的乐趣 一觉醒来,外面全部被雪盖的严严实实。我们兄妹三个总是最早在雪地里留下脚印的。那时候个子小,雪总是都没...
    艾米李阅读 566评论 0 1
  • 明天就要给孩子开家长会了,这次孩子成绩有所退步,孩子情绪低落,我们俩看着孩子那么努力,而结果又不堪人意,看孩子每天...
    小潞咿呀阅读 193评论 2 0
  • 昨天收到一个朋友的求助,一个特殊的分组需求,来源数据和目标结果数据如下(上面是来源数据,下面是展示的数据)。 其实...
    咖啡不解酒阅读 517评论 2 1