lecture 5

Class Project requirements

  • system architecture
    -get the data(through users interaction) and use creatively
  • features
    -challenging language, not English
    -or anything else that require tagging or other deeper NLP technics

There should be something new, like getting data with system or other languages or deeper NLP technics or things not being touched in the class.

Assignments

  • See Sakai wiki and assignments, they're different.
  • Due in two weeks

Topic modeling

  • LatentDirichletAllocation
    -documents --> topics: document distribution over topics, mixture of topics(latent variable, abstract learned representation)
    -topics --> words: distribution over words
    -topics: link to documents and words

  • These models are expensive to implement.

  • Algorithm
    -randomly assign words to topics
    -what topics are associated with documents? (words are in documents, not clear)
    -optimization: variational MCMC

  • Python package
    -sklearn. LatentDirichletAllocation

  • Resources
    -LatentDirichletAllocation tutorial
    -Variational MCMC
    He didn't go through the details as usual.

Tagging as classification

  • How the START and END token effect the model and when to use them??

  • Looking for patterns that generalize: identity of word at a given position, shape of word(-tion, -al) and distributional features

  • Something about data processing mentioned in the class:
    -Given tokenized data, e.g 3 sentences:
    [1,2,3,4]
    [5,6,7,8]
    [9,10,11,12,13]
    -Adding up START and END token and wrapping target word with context words, length=5:
    [S, S, 1, 2, 3]
    [S, 1, 2, 3, 4]
    [1, 2, 3, 4, 5]
    ...
    [11, 12, 13, E, E]

  • Features defined in a way like rules for patterns among words: like feature_1 = [previous word = 'the']

  • Applying linear classification on the encoded words matrix.

  • Optimization: SGD

Combining search and learning

  • Graphical Model for Sequence Tagging
    We have tag for each word in a sequence:
    e.g.
    -PRON VERB PRT VERB. --> P(T_{i-1} | T_i) -- (1)
    -They refuse to permit. --> P(W_i | T_i) -- (2)
    -There are strong independence assumption among words. The previous word doesn't help.(This is not right)
    -How to get those probabilities: count.
    -unseen sequence, have to smooth the model

  • Algorithm
    -Add up START and END token to a sentence.
    -Find probability of tag sequence that might have generated word sequence(conditional probability).
    -ML of tag sequence: P(C | W, C_0, C_{n+1}), where C_0: START, C_{n+1}: END, W: word sequence, C: sequence of tags.
    e.g. Following the example above: get P(SPVPVE, words), using equation (1) and (2), then apply bayes rule to get likelihood, P(SPVPVE) = prod{P(T_{i} | T_{i-1})} for all i.

  • Thinking:
    Find all the tag sequences in a document, count and order them from high frequency to low frequency, what's the differences? Same overfitting problem with ML? Does beam search help?

  • Conclusion:
    There is no learning in this model, but the probability extracted from corpus can be interpreted as weights. Because, basically, we need scores to search(ML). But there's no features, so no learning.

  • Improvement?
    Define features.

©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成,浏览时请结合常识与多方信息审慎甄别。
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

相关阅读更多精彩内容

  • **2014真题Directions:Read the following text. Choose the be...
    又是夜半惊坐起阅读 13,507评论 0 23
  • php5.6新特性 参考 使用表达式定义常量,使用const定义常量数组 使用 ... 运算符定义变长参数函数 使...
    code_nerd阅读 2,669评论 0 0
  • 早上磨到蛮晚起来都没有洗刷,所以就没跟着你去公司,其实也是我自己不想去,个中原因大概只有自己知道吧。 晚上跟着你去...
    风花微凉阅读 1,441评论 0 0
  • http://mt.sohu.com/20150603/n414349613.shtml 科普一分钟:USB T...
    靖兰亭阅读 3,244评论 0 49

友情链接更多精彩内容