Text Mining

When modeling text corpora and other collections of discrete data. The goal is to find short descriptions of the members of a collection that enable efficient processing of large collections while preserving the essential statistical relationships that are useful for basic tasks such as classification, novelty detection, summarization, and similarity and relevance judgments.

There are several methods for finding short descriptions of the members of a corpus that enable efficient processing of large collections, such as tf-idf, latent semantic indexing, and latent dirichlet allocation.

TF-IDF

What is tf-idf

TF-IDF is a numerical statistic that is intended to reflect the importance of a word to a document in a corpus. The tf-idf value increases proportionally to the number of occurrences a word appear in a document but is offset by the number of documents in the corpus that contain the word. The offset is called normalization.

Several statistics related to TF-IDF

tf refers to term frequency, which measures the frequency of a word in a document.

$\text{Term Frequency = the number of occurrences a given term appear}$

idf refers to inverse document frequency, which measures the inverse of the document frequency; in other words, the number of documents containing the term
$\text{Inverse Term Frequency = }\log{\frac{N}{count(d\in D:t \in d)}}$
TF-IDF
$\text{TF-IDF} = TF \times ITF = n \times \log{\frac{N}{count(d\in D:t \in d)}}$

Why we need to normalize term frequency by comparing an inverse document frequency

Some uninformative words, such as "the", "a", and "an", might appear more frequently in general, and normalization can tone down the importance of these words. For example, if we aim to find out words related to the punctuality of doctors basde on the a collection of reviews, such words as doctors and dr can be regarded as uninformative words appearing frequently in the entire documents. But high frequency does not necessarily mean they are highly related to doctors being punctual. Therefore, we need to normalize these words by multiplying inverse document frequency so that <u style="box-sizing: border-box;">less-common words "time" and "hour" can gain high weights</u>.

When to use tf-idf?

TF-IDf scheme can reduce documents of arbitrary length to fixed-length lists of numbers. Therefore, it is usually applied in the problem of modeling text corpora, trying to find short descriptions of members of a corpus that enable efficient processing of large collections while preserving the essential statistical relationships.

For example, it can be applied in text mining, and information retrieval(search engine). If we try to analyze the sentiment from the tweets of users, we can use tf-idf to build vocabulary of informative words important to a particular type of sentiment instead of trying to search for keywords one by one.

What are the advantages and disadvantags

The approach can efficienctly produce sets of words that are discriminative for documents in the corpus.
However, it can only provide a relatively small amount of reduction in description length and reveals a little in the way of inter or intra document statistical structure.

Example of how to code TF-IDF in python

Term Frequency by CountVectorizer

# create a collection of documents
corpus =  ["you were born with potential",
"you were born with goodness and trust",
"you were born with ideals and dreams",
"you were born with greatness",
"you were born with wings",
"you are not meant for crawling, so don't",
"you have wings",
"learn to use them and fly"
]

# compute the term frequency by countvectorizer
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()
vectorized = vect.fit_transform(corpus)
feature_names = vect.get_feature_names()
tf_df = pd.DataFrame(vectorized.toarray(),columns=feature_names)
tf_df

image.png

Interpretation

We can find that words, such as "you", "with", and "were", occur more frequently in genearl than other words, but these words are uninformative words that are unable to distinguish a document from others. So the importance of these words need to be reduced so that informative words, <u>potential, trust, dreams</u>, can gain high weight of importance.

Term Frequency - Inverse Document Frequency

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
feature_names = vectorizer.get_feature_names()
tf_idf_df = pd.DataFrame(X.toarray(),columns=feature_names)
tf_idf_df

image.png

Interpretation

The tf-idf value for potential variable in the firt row is 0.682895, meaning that potential is a highly important word to the first document and it can distinguish it from other documents. We can also identify the difference by looking at the value for other documents, which are all zero.
trust has the highest tf-idf value in the second document, at 0.522. This means that trust is a distinctive word for it.
...

Latent Semantic Indexing

Given the shortcomings of TF-IDF, a dimensionality reduction technique, Latent Semantic Indexing,

use a singular value decomposition of the X matrix to identify a linear subspace in the space of tf-idf features that capture most of the variance in the collection. This approach can achieve significant compression in a large collection. However, it is not clear why one should adopt the LSI methodology—one can attempt to proceed more directly, fitting the model to data using maximum likelihood or Bayesian methods.

An alternative approach to LSI is PLSI that which models each word in a document as a sample from a mixture model, where the mixture components are multinomial random variables that can be viewed as representation of topics. Thus each word is generated from a single topic, and different words in a document may be generated from different topics. Each document is represented as a list of mixing proportions for these mixture components and thereby reduced to a probability distribution on a fixed set of topics. This distribution is the “reduced description” associated with the document. However, the approach does not provide a probabilistic model at the level of documents.\

LSI and PLSI are based on one assumption "bag-of-words", in which the order of words in a document can be neglected. In the language of probabilistic theory, this is an assumption of exchangeability for the words in a document. Finetti (1990) establishes that any collection of exchangeable random variables has a representation as a mixture ditstribution. Thus, we want to build a mixture model that can capture the exchangeability of both words and documents. This line of thinking leads to Latent Dirichlet Allocation.

Latent Dirichlet Allocation

What is Latent Dirichlet Allocation?

LDC is a generative statistical model that allows sets of observations to be explained by hidden groups explained why some parts of data are similar. — Wikipedia

Latent Dirichlet allocation (LDA) is a generative probabilistic model of a corpus. The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words.

So we can give one statement about LDA's definition: LDA is a generative statistical model used for dimensionality reduction. Its basic idea is that each document can be represented as mixtures over hidden topics, and each topic is characterized by word distributions.

Assumptions of LDA

Each document is a collection of bag-of-words, which means the order of words can be neglected.
Doc-topic distribution: each document can be explained by unobserved topics
Topic-word distribution: each topic is represented as a distribution over words

When doing topic modleling, the information available to us includes:

The text collection or corpus
Number of topics

The information not available for us includes:

The actual topics
Topic distribution for each document

Text Mining

TF-IDF

Latent Semantic Indexing

Latent Dirichlet Allocation

推荐阅读更多精彩内容