MML(skl)——C3

Feature Extraction and Preprocessing

Extracting features from categorical variables

categorical/nominal is discrete, a def against continuous variable
encoding: one-of-K, One-Hot

**Extracting features from the text

The bag-of-words representation (词库)

Bag-of-words can be thought of as an extension to one-hot encoding
vocabulary:= the corpus's unique words set

CountVectorizer
:converts the characters in the documents to lowercase, and tokenizes词块化 the documents. Tokenization is the process of splitting a string into tokens, or meaningful sequences of characters

e.g.

from sklearn.feature_extraction.text import CountVectorizer
corpus = [  
    'UNC played Duke in basketball',  
    'Duke lost the basketball game',  
    'I ate a sandwich'  
]  
vectorizer = CountVectorizer()  
print (vectorizer.fit_transform(corpus).todense())  
print (vectorizer.vocabulary_)  
out:  
[[0 1 1 0 1 0 1 0 0 1]  
 [0 1 1 1 0 1 0 0 1 0]  
 [1 0 0 0 0 0 0 1 0 0]]  
{'game': 3, 'ate': 0, 'sandwich': 7, 'lost': 5, 'duke': 2, 'unc': 9, 'played': 6, 'in': 4, 'the': 8, 'basketball': 1}  

problems

sparse vectors :High-dimensional feature vectors that have many zero-valued elements
curse of dimensionality

Stop-word filtering

1.A basic strategy to reduce the dimensions of the feature space is to convert all of the text to lowercase
2.stop words, words that are common to most of the documents in the corpus (remove it)
e.g. determiners like a an the
use

vectorizer = CountVectorizer(stop_words='english')
...
[[0 1 1 0 0 1 0 1]
 [0 1 1 1 1 0 0 0]
 [1 0 0 0 0 0 1 0]]
{u'duke': 2, u'basketball': 1, u'lost': 4, u'played': 5, u'game': 3, u'sandwich': 6, u'unc': 7, u'ate': 0}

Stemming词跟还原 and lemmatization词形还原

corpus = [ 'I am gathering ingredients for the sandwich.', 'There were many wizards at the gathering.']

We will use the Natural Language Tool Kit (NTLK) to stem and lemmatize the corpus. NLTK can be installed using the instructions at http://www.nltk.org/install.html. After installation, execute the following code

import nltk
nltk.download()
...
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()  
print(lemmatizer.lemmatize('gathering','v'))  
print(lemmatizer.lemmatize('gathering','n'))  
  
out:  
gather  
gathering 

Extending bag-of-words with TF-IDF weights

lead to a dissimilarity between small and large text
so now we use frequency instead of counts
1.normalizing raw term counts
integer to rate:tf(t,d) = \frac{f(t,d) + 1}{||x||} or no +1 in the nominator
f(t,d) frequency of term t in document d
||x|| norm (L-2 generally) of the count vector

2.logarithmically scaled term frequencies
mitigates the bias for longer documents
tf(t,d) = \log(f(t,d)+1)

3.Augmented term frequencies
further mitigates the bias for longer documents
tf(t,d) = .5 + \frac{.5*f(t,d)}{maxf(w,d):w \in d}
{maxf(w,d):w \in d} is greatest frequency of all of the words in document d

4.IDF
The inverse document frequency (IDF) is a measure of how rare or common a word is in a corpus.
idf(t,d) = log\frac{N}{1+|d\in D:t \in d|} or no +1 in the denominator
N: total number of docs in the corpus
d\in D:t \in d : the number of documents in the corpus that contain the term t

5.TF-IDF
TF*IDF

from sklearn.feature_extraction.text import TfidfVectorizer  
corpus = [  
'The dog ate a sandwich and I ate a sandwich',  
'The wizard transfigured a sandwich'  
]  
vectorizer = TfidfVectorizer(stop_words='english')  
print(vectorizer.fit_transform(corpus).todense())  
print(vectorizer.vocabulary_)  
  
out:  
[[ 0.75458397  0.37729199  0.53689271  0.          0.        ]  
 [ 0.          0.          0.44943642  0.6316672   0.6316672 ]]  
{'wizard': 4, 'transfigured': 3, 'ate': 0, 'dog': 1, 'sandwich': 2}  

NOTE: the outcome vector has been normalized!!!!!!!!!!!!!!! and sklearn uses different formulas

summary
1).单词频率对文档意思有重要作用,但是在对比不同文档时,还需考虑文档的长度,可通过scikit-learn的TfdfTransformer类对词频(term frequency)特征向量归一化实现不同文档向量的可比性(L2范数)。

2).对数词频调整方法(logarithmically scaled term frequencies),把词频调整到一个更小的范围

3).词频放大法(augmented term frequencies),适用于消除较长文档的差异,scikit-learn没有现成可用的词频放大公式,不过通过CountVectorizer可以实现。

归一化,对数调整词频和词频放大三支方法都消除文档不同大小对词频的影响,另一个问题仍然存在,那就是特征向量里高频词的权重更大,即使这些词在文集内其他文档里面也经常出现。这些词可以被看成是该文集的停用词,因为它们太普遍,对区分文档的意思没任何作用。

4).逆向文件频率(inverse document frequency,IDF)就是用来度量文集中单词频率的。

单词的TF-IDF值就是其频率与逆向文件频率的乘积,TfdfTransformer类默认返回TF-IDF值,其参数use_idf默认为True。由于TF-IDF加权特征向量经常用来表示文本,所以scikit-learn提供了TfidfVectorizer类将CountVectorizer和TfdfTransformer类封装在一起。


有空学习下图像处理 ,很有意思

©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi阅读 7,449评论 0 10
  • If you do not like the man, just look at him for a while....
    kanthegel阅读 454评论 0 5
  • 有时候,追求财富的问题不在于财富,而在于“追求”,从人类有历史以来,人都在尝试追求一些不朽的事物,这是由...
    沐芝阳阅读 1,580评论 0 1
  • 最近对培养新习惯越来越有心得,已经慢慢培养出几个新习惯了,也没有感觉太难。 如果一件事很难,就把它分解成容易做的小...
    银河星海阅读 428评论 0 1
  • 失足与觉醒 许自立刚刚醒来,刘金莲就来到了他的身边。 一阵少妇的体香夹杂着浴液的香味,随着刘金莲的靠近,冲入了许自...
    高丘上阅读 360评论 13 19