Feature Extraction and Preprocessing

Extracting features from categorical variables

categorical/nominal is discrete, a def against continuous variable
encoding: one-of-K, One-Hot

**Extracting features from the text

The bag-of-words representation (词库)

Bag-of-words can be thought of as an extension to one-hot encoding
vocabulary:= the corpus's unique words set

CountVectorizer
:converts the characters in the documents to lowercase, and tokenizes词块化 the documents. Tokenization is the process of splitting a string into tokens, or meaningful sequences of characters

e.g.

from sklearn.feature_extraction.text import CountVectorizer
corpus = [  
    'UNC played Duke in basketball',  
    'Duke lost the basketball game',  
    'I ate a sandwich'  
]  
vectorizer = CountVectorizer()  
print (vectorizer.fit_transform(corpus).todense())  
print (vectorizer.vocabulary_)  
out：  
[[0 1 1 0 1 0 1 0 0 1]  
 [0 1 1 1 0 1 0 0 1 0]  
 [1 0 0 0 0 0 0 1 0 0]]  
{'game': 3, 'ate': 0, 'sandwich': 7, 'lost': 5, 'duke': 2, 'unc': 9, 'played': 6, 'in': 4, 'the': 8, 'basketball': 1}

problems

sparse vectors :High-dimensional feature vectors that have many zero-valued elements
curse of dimensionality

Stop-word filtering

1.A basic strategy to reduce the dimensions of the feature space is to convert all of the text to lowercase
2.stop words, words that are common to most of the documents in the corpus (remove it)
e.g. determiners like a an the
use

vectorizer = CountVectorizer(stop_words='english')
...
[[0 1 1 0 0 1 0 1]
 [0 1 1 1 1 0 0 0]
 [1 0 0 0 0 0 1 0]]
{u'duke': 2, u'basketball': 1, u'lost': 4, u'played': 5, u'game': 3, u'sandwich': 6, u'unc': 7, u'ate': 0}

Stemming词跟还原 and lemmatization词形还原

corpus = [ 'I am gathering ingredients for the sandwich.', 'There were many wizards at the gathering.']

We will use the Natural Language Tool Kit (NTLK) to stem and lemmatize the corpus. NLTK can be installed using the instructions at http://www.nltk.org/install.html. After installation, execute the following code

import nltk
nltk.download()
...
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()  
print(lemmatizer.lemmatize('gathering','v'))  
print(lemmatizer.lemmatize('gathering','n'))  
  
out:  
gather  
gathering

Extending bag-of-words with TF-IDF weights

lead to a dissimilarity between small and large text
so now we use frequency instead of counts
1.normalizing raw term counts
integer to rate: $tf(t,d) = \frac{f(t,d) + 1}{||x||}$ or no +1 in the nominator
f(t,d) frequency of term t in document d
||x|| norm (L-2 generally) of the count vector

2.logarithmically scaled term frequencies
mitigates the bias for longer documents
$tf(t,d) = \log(f(t,d)+1)$

3.Augmented term frequencies
further mitigates the bias for longer documents
$tf(t,d) = .5 + \frac{.5*f(t,d)}{maxf(w,d):w \in d}$
${maxf(w,d):w \in d}$ is greatest frequency of all of the words in document d

4.IDF
The inverse document frequency (IDF) is a measure of how rare or common a word is in a corpus.
$idf(t,d) = log\frac{N}{1+|d\in D:t \in d|}$ or no +1 in the denominator
N: total number of docs in the corpus
$d\in D:t \in d$ : the number of documents in the corpus that contain the term t

5.TF-IDF
TF*IDF

from sklearn.feature_extraction.text import TfidfVectorizer  
corpus = [  
'The dog ate a sandwich and I ate a sandwich',  
'The wizard transfigured a sandwich'  
]  
vectorizer = TfidfVectorizer(stop_words='english')  
print(vectorizer.fit_transform(corpus).todense())  
print(vectorizer.vocabulary_)  
  
out:  
[[ 0.75458397  0.37729199  0.53689271  0.          0.        ]  
 [ 0.          0.          0.44943642  0.6316672   0.6316672 ]]  
{'wizard': 4, 'transfigured': 3, 'ate': 0, 'dog': 1, 'sandwich': 2}

NOTE: the outcome vector has been normalized!!!!!!!!!!!!!!! and sklearn uses different formulas

summary
1).单词频率对文档意思有重要作用，但是在对比不同文档时，还需考虑文档的长度，可通过scikit-learn的TfdfTransformer类对词频(term frequency)特征向量归一化实现不同文档向量的可比性(L2范数)。

2).对数词频调整方法（logarithmically scaled term frequencies），把词频调整到一个更小的范围

3).词频放大法（augmented term frequencies），适用于消除较长文档的差异，scikit-learn没有现成可用的词频放大公式，不过通过CountVectorizer可以实现。

归一化，对数调整词频和词频放大三支方法都消除文档不同大小对词频的影响,另一个问题仍然存在，那就是特征向量里高频词的权重更大，即使这些词在文集内其他文档里面也经常出现。这些词可以被看成是该文集的停用词，因为它们太普遍，对区分文档的意思没任何作用。

4).逆向文件频率（inverse document frequency，IDF）就是用来度量文集中单词频率的。

单词的TF-IDF值就是其频率与逆向文件频率的乘积,TfdfTransformer类默认返回TF-IDF值，其参数use_idf默认为True。由于TF-IDF加权特征向量经常用来表示文本，所以scikit-learn提供了TfidfVectorizer类将CountVectorizer和TfdfTransformer类封装在一起。

有空学习下图像处理，很有意思

MML(skl)——C3