lecture 5

system architecture
-get the data(through users interaction) and use creatively
features
-challenging language, not English
-or anything else that require tagging or other deeper NLP technics

There should be something new, like getting data with system or other languages or deeper NLP technics or things not being touched in the class.

LatentDirichletAllocation
-documents --> topics: document distribution over topics, mixture of topics(latent variable, abstract learned representation)
-topics --> words: distribution over words
-topics: link to documents and words
These models are expensive to implement.
Algorithm
-randomly assign words to topics
-what topics are associated with documents? (words are in documents, not clear)
-optimization: variational MCMC
Python package
-sklearn. LatentDirichletAllocation
Resources
-LatentDirichletAllocation tutorial
-Variational MCMC
He didn't go through the details as usual.

How the START and END token effect the model and when to use them??
Looking for patterns that generalize: identity of word at a given position, shape of word(-tion, -al) and distributional features
Something about data processing mentioned in the class:
-Given tokenized data, e.g 3 sentences:
[1,2,3,4]
[5,6,7,8]
[9,10,11,12,13]
-Adding up START and END token and wrapping target word with context words, length=5:
[S, S, 1, 2, 3]
[S, 1, 2, 3, 4]
[1, 2, 3, 4, 5]
...
[11, 12, 13, E, E]
Features defined in a way like rules for patterns among words: like feature_1 = [previous word = 'the']
Applying linear classification on the encoded words matrix.
Optimization: SGD

Graphical Model for Sequence Tagging
We have tag for each word in a sequence:
e.g.
-PRON VERB PRT VERB. --> P(T_{i-1} | T_i) -- (1)
-They refuse to permit. --> P(W_i | T_i) -- (2)
-There are strong independence assumption among words. The previous word doesn't help.(This is not right)
-How to get those probabilities: count.
-unseen sequence, have to smooth the model
Algorithm
-Add up START and END token to a sentence.
-Find probability of tag sequence that might have generated word sequence(conditional probability).
-ML of tag sequence: P(C | W, C_0, C_{n+1}), where C_0: START, C_{n+1}: END, W: word sequence, C: sequence of tags.
e.g. Following the example above: get P(SPVPVE, words), using equation (1) and (2), then apply bayes rule to get likelihood, P(SPVPVE) = prod{P(T_{i} | T_{i-1})} for all i.
Thinking:
Find all the tag sequences in a document, count and order them from high frequency to low frequency, what's the differences? Same overfitting problem with ML? Does beam search help?
Conclusion:
There is no learning in this model, but the probability extracted from corpus can be interpreted as weights. Because, basically, we need scores to search(ML). But there's no features, so no learning.
Improvement?
Define features.