Text Mining

When modeling text corpora and other collections of discrete data. The goal is to find short descriptions of the members of a collection that enable efficient processing of large collections while preserving the essential statistical relationships that are useful for basic tasks such as classification, novelty detection, summarization, and similarity and relevance judgments.

There are several methods for finding short descriptions of the members of a corpus that enable efficient processing of large collections, such as tf-idf, latent semantic indexing, and latent dirichlet allocation.

TF-IDF

What is tf-idf

TF-IDF is a numerical statistic that is intended to reflect the importance of a word to a document in a corpus. The tf-idf value increases proportionally to the number of occurrences a word appear in a document but is offset by the number of documents in the corpus that contain the word. The offset is called normalization.

Several statistics related to TF-IDF

  • tf refers to term frequency, which measures the frequency of a word in a document.

\text{Term Frequency = the number of occurrences a given term appear}

  • idf refers to inverse document frequency, which measures the inverse of the document frequency; in other words, the number of documents containing the term
    \text{Inverse Term Frequency = }\log{\frac{N}{count(d\in D:t \in d)}}
  • TF-IDF
    \text{TF-IDF} = TF \times ITF = n \times \log{\frac{N}{count(d\in D:t \in d)}}

Why we need to normalize term frequency by comparing an inverse document frequency

Some uninformative words, such as "the", "a", and "an", might appear more frequently in general, and normalization can tone down the importance of these words. For example, if we aim to find out words related to the punctuality of doctors basde on the a collection of reviews, such words as doctors and dr can be regarded as uninformative words appearing frequently in the entire documents. But high frequency does not necessarily mean they are highly related to doctors being punctual. Therefore, we need to normalize these words by multiplying inverse document frequency so that <u style="box-sizing: border-box;">less-common words "time" and "hour" can gain high weights</u>.

When to use tf-idf?

TF-IDf scheme can reduce documents of arbitrary length to fixed-length lists of numbers. Therefore, it is usually applied in the problem of modeling text corpora, trying to find short descriptions of members of a corpus that enable efficient processing of large collections while preserving the essential statistical relationships.

For example, it can be applied in text mining, and information retrieval(search engine). If we try to analyze the sentiment from the tweets of users, we can use tf-idf to build vocabulary of informative words important to a particular type of sentiment instead of trying to search for keywords one by one.

What are the advantages and disadvantags

  • The approach can efficienctly produce sets of words that are discriminative for documents in the corpus.

  • However, it can only provide a relatively small amount of reduction in description length and reveals a little in the way of inter or intra document statistical structure.

Example of how to code TF-IDF in python

Term Frequency by CountVectorizer

# create a collection of documents
corpus =  ["you were born with potential",
"you were born with goodness and trust",
"you were born with ideals and dreams",
"you were born with greatness",
"you were born with wings",
"you are not meant for crawling, so don't",
"you have wings",
"learn to use them and fly"
]

# compute the term frequency by countvectorizer
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()
vectorized = vect.fit_transform(corpus)
feature_names = vect.get_feature_names()
tf_df = pd.DataFrame(vectorized.toarray(),columns=feature_names)
tf_df

image.png

Interpretation

We can find that words, such as "you", "with", and "were", occur more frequently in genearl than other words, but these words are uninformative words that are unable to distinguish a document from others. So the importance of these words need to be reduced so that informative words, <u>potential, trust, dreams</u>, can gain high weight of importance.

Term Frequency - Inverse Document Frequency

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
feature_names = vectorizer.get_feature_names()
tf_idf_df = pd.DataFrame(X.toarray(),columns=feature_names)
tf_idf_df
image.png

Interpretation

  • The tf-idf value for potential variable in the firt row is 0.682895, meaning that potential is a highly important word to the first document and it can distinguish it from other documents. We can also identify the difference by looking at the value for other documents, which are all zero.
  • trust has the highest tf-idf value in the second document, at 0.522. This means that trust is a distinctive word for it.
  • ...

Latent Semantic Indexing

Given the shortcomings of TF-IDF, a dimensionality reduction technique, Latent Semantic Indexing,

use a singular value decomposition of the X matrix to identify a linear subspace in the space of tf-idf features that capture most of the variance in the collection. This approach can achieve significant compression in a large collection. However, it is not clear why one should adopt the LSI methodology—one can attempt to proceed more directly, fitting the model to data using maximum likelihood or Bayesian methods.

An alternative approach to LSI is PLSI that which models each word in a document as a sample from a mixture model, where the mixture components are multinomial random variables that can be viewed as representation of topics. Thus each word is generated from a single topic, and different words in a document may be generated from different topics. Each document is represented as a list of mixing proportions for these mixture components and thereby reduced to a probability distribution on a fixed set of topics. This distribution is the “reduced description” associated with the document. However, the approach does not provide a probabilistic model at the level of documents.\

LSI and PLSI are based on one assumption "bag-of-words", in which the order of words in a document can be neglected. In the language of probabilistic theory, this is an assumption of exchangeability for the words in a document. Finetti (1990) establishes that any collection of exchangeable random variables has a representation as a mixture ditstribution. Thus, we want to build a mixture model that can capture the exchangeability of both words and documents. This line of thinking leads to Latent Dirichlet Allocation.

Latent Dirichlet Allocation

What is Latent Dirichlet Allocation?

LDC is a generative statistical model that allows sets of observations to be explained by hidden groups explained why some parts of data are similar. — Wikipedia

Latent Dirichlet allocation (LDA) is a generative probabilistic model of a corpus. The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words.

So we can give one statement about LDA's definition: LDA is a generative statistical model used for dimensionality reduction. Its basic idea is that each document can be represented as mixtures over hidden topics, and each topic is characterized by word distributions.

Assumptions of LDA

  • Each document is a collection of bag-of-words, which means the order of words can be neglected.
  • Doc-topic distribution: each document can be explained by unobserved topics
  • Topic-word distribution: each topic is represented as a distribution over words

When doing topic modleling, the information available to us includes:

  • The text collection or corpus
  • Number of topics

The information not available for us includes:

  • The actual topics
  • Topic distribution for each document
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 214,658评论 6 496
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,482评论 3 389
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 160,213评论 0 350
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,395评论 1 288
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,487评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,523评论 1 293
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,525评论 3 414
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,300评论 0 270
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,753评论 1 307
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,048评论 2 330
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,223评论 1 343
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,905评论 5 338
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,541评论 3 322
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,168评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,417评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,094评论 2 365
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,088评论 2 352

推荐阅读更多精彩内容