jupyter_文本特征抽取_2预处理（归一化、停用词、词根处理）

#本Notebook讲解向量归一化问题、停用词、词根
#作者：thirsd
#归一化。对于同一文章重复多次后，根据直观理解，相似度应该是相同的，但直接欧式距离，导致特征比原文章相似度降低，因为欧式距离增加。
#停用词。很多文章中of most等词汇特征缺乏明显性，去除这些无意义的词汇能减低维度，提升效率；并避免影响总体特征
#词根词。对于apple和apples，go\goes\went同一词根，对文本的语义是相近的

####################################归一化#####################################################

#1.1 问题描述
from sklearn.feature_extraction.text import CountVectorizer
vectorizer=CountVectorizer(min_df=1)
#列出所有的帖子
posts=["How to format my disk","hard disk format at","How to format my disk How to format my disk How to format my disk"]
#对于帖子进行词袋转化
x=vectorizer.fit_transform(posts)
print("feature_name:%s" % vectorizer.get_feature_names())

#获取样本数和特征个数
num_samples,num_features=x.shape


#针对新帖子的向量化
newpost="how to format my computer's disk"
new_post_vec=vectorizer.transform([newpost])

#定义文章向量相似度，采用词频向量的欧式距离
import scipy as sp
def dist_raw(v1,v2):
    delta=v1-v2
    return sp.linalg.norm(delta)

feature_name:[u'at', u'disk', u'format', u'hard', u'how', u'my', u'to']



#计算 new_post同所有帖子的欧式距离（dist_raw）,记录最相近的一个
import sys
best_doc=None
best_dist=sys.maxint
best_i=None
for i in range(0,num_samples):
    post=posts[i]
    if post==newpost:
        continue
    post_vec=x.getrow(i)
    print("post_vec's shape:%s, new_post_vec's shape:%s" %(post_vec.shape,new_post_vec.shape))
    dist=dist_raw(post_vec.toarray(),new_post_vec.toarray())
    print "=== Post %i with dist=%.2f: %s" %(i,dist,post)
    if dist<best_dist:
        best_dist=dist
        best_i=i
print ("newpost :%s" % newpost)
print ("Best post is %i with dist=%.2f. Post Content:%s" %(best_i,best_dist,posts[best_i]))

post_vec's shape:(1, 7), new_post_vec's shape:(1, 7)
=== Post 0 with dist=0.00: How to format my disk
post_vec's shape:(1, 7), new_post_vec's shape:(1, 7)
=== Post 1 with dist=2.24: hard disk format at
post_vec's shape:(1, 7), new_post_vec's shape:(1, 7)
=== Post 2 with dist=4.47: How to format my disk How to format my disk How to format my disk
newpost :how to format my computer's disk
Best post is 0 with dist=0.00. Post Content:How to format my disk



#根据直接，第0片的帖子，同第2片相似度一样，但却发现欧式距离增大

print(x.getrow(0).toarray())
print(x.getrow(2).toarray())
print(new_post_vec.toarray())

[[0 1 1 0 1 1 1]]
[[0 3 3 0 3 3 3]]
[[0 1 1 0 1 1 1]]



#1.2 问题解决
#对帖子进行归一化处理，得到单位长度为1的向量
def dist_norm(v1,v2):
    v1_normed = v1/sp.linalg.norm(v1.toarray())
    v2_normed = v2/sp.linalg.norm(v2.toarray())
    delta = v1_normed - v2_normed
    return sp.linalg.norm(delta.toarray())

#差异仅为调用归一化后的距离计算公式
#计算 new_post同所有帖子的归一化后的欧式距离（dist_norm）,记录最相近的一个
import sys
best_doc=None
best_dist=sys.maxint
best_i=None
for i in range(0,num_samples):
    post=posts[i]
    if post==newpost:
        continue
    post_vec=x.getrow(i)
    print("post_vec's shape:%s, new_post_vec's shape:%s" %(post_vec.shape,new_post_vec.shape))
    dist=dist_norm(post_vec,new_post_vec)
    print "=== Post %i with dist=%.2f: %s" %(i,dist,post)
    if dist<best_dist:
        best_dist=dist
        best_i=i
print ("newpost :%s" % newpost)
print ("Best post is %i with dist=%.2f. Post Content:%s" %(best_i,best_dist,posts[best_i]))

post_vec's shape:(1, 7), new_post_vec's shape:(1, 7)
=== Post 0 with dist=0.00: How to format my disk
post_vec's shape:(1, 7), new_post_vec's shape:(1, 7)
=== Post 1 with dist=1.05: hard disk format at
post_vec's shape:(1, 7), new_post_vec's shape:(1, 7)
=== Post 2 with dist=0.00: How to format my disk How to format my disk How to format my disk
newpost :how to format my computer's disk
Best post is 0 with dist=0.00. Post Content:How to format my disk



#归一化后，重复文本的欧式距离相同，相似度相同。

####################################停用词#####################################################

#2.1 问题描述
#类似of to at 这些词不是很重要，这些词语经常出现在各种不同的文本中，这种词叫停用词。
#最佳的选择是删除所偶这样的高频词语，因为它们对于区分文本并没有多大的帮助。
print("原始的feature_name:%s" % vectorizer.get_feature_names())
print("原始的samples: %d ,#features: %d" % (num_samples,num_features))

原始的feature_name:[u'at', u'disk', u'format', u'hard', u'how', u'my', u'to']
原始的samples: 3 ,#features: 7



# 其中at to 等词汇意义不大，对于处理相似度重要性较低，而且维度高时，消耗计算资源

# 2.2 问题解决
vectorizer_stopword=CountVectorizer(min_df=1,stop_words="english")
#对于帖子进行词袋转化
x_stopword=vectorizer_stopword.fit_transform(posts)
print("新的feature_name:%s" % vectorizer_stopword.get_feature_names())
#获取样本数和特征个数
num_samples_stopword,num_features_stopword=x_stopword.shape
print("新的samples: %d ,#features: %d" % (num_samples_stopword,num_features_stopword))

#停用词，可以自己维护一个停用词列表。如果不清楚，可以使用一个包含318个单词的英文停用词表。
print("english stop list sample:%s" % sorted(vectorizer_stopword.get_stop_words())[0:20])

新的feature_name:[u'disk', u'format', u'hard']
新的samples: 3 ,#features: 3
english stop list sample:['a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amoungst']



#处理后，新的词袋列表减少指3个。当文本集海量时，尤其有效

new_post_vec_stopword=vectorizer_stopword.transform([newpost])

#差异仅为调用归一、使用停用词后的距离计算公式
for i in range(0,num_samples):
    post=posts[i]
    if post==newpost:
        continue
    post_vec=x_stopword.getrow(i)
    print("post_vec's shape:%s, new_post_vec's shape:%s" %(post_vec.shape,new_post_vec.shape))
    dist=dist_norm(post_vec,new_post_vec_stopword)
    print "=== Post %i with dist=%.2f: %s" %(i,dist,post)
    if dist<best_dist:
        best_dist=dist
        best_i=i
print ("newpost :%s" % newpost)
print ("Best post is %i with dist=%.2f. Post Content:%s" %(best_i,best_dist,posts[best_i]))

post_vec's shape:(1, 3), new_post_vec's shape:(1, 7)
=== Post 0 with dist=0.00: How to format my disk
post_vec's shape:(1, 3), new_post_vec's shape:(1, 7)
=== Post 1 with dist=0.61: hard disk format at
post_vec's shape:(1, 3), new_post_vec's shape:(1, 7)
=== Post 2 with dist=0.00: How to format my disk How to format my disk How to format my disk
newpost :how to format my computer's disk
Best post is 0 with dist=0.00. Post Content:How to format my disk



#可以明显看出效果并没有的改变（当日，效率是要增加的，尤其在海量数据才能体现处理）

####################################词根词#####################################################

#3.0 依赖的NLTK介绍
#scikit没有默认的词根处理器，可以通过NLTK（自然语言处理工具包）
#你可以在Python Text Processing with NLTK 2.0 Cookbook中找到一个很棒的NLTK教程
import nltk.stem
s=nltk.stem.SnowballStemmer('english')
print s.stem("go")
print s.stem("goes")
print s.stem("going")
print s.stem("went")

go
goe
go
went



#注意：词干处理的结果并不一定是有效的英文单词

#3.2 问题处理
#在把帖子传入CountVectorizer之前，需要对于词根进行处理。该类可以定制预处理和词语切分节点的操作。
#当另一种方法，可以避免我们需要亲自对词语进行切分和归一的问题，采用重写build_analyzer的方法实现
import nltk.stem
english_stemmer = nltk.stem.SnowballStemmer('english')
class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer,self).build_analyzer()
        return lambda doc:(english_stemmer.stem(word) for word in analyzer(doc))


#列出所有的帖子
posts_root=["How to format my disks","hard disk formating at","How to formated my disks"]

#采用没有词根的处理方式
vectorizer_stopword_noroot=CountVectorizer(min_df=1,stop_words="english")
#对于帖子进行词袋转化
x_stopword_noroot=vectorizer_stopword_noroot.fit_transform(posts_root)
print("新的feature_name_noroot:%s" % vectorizer_stopword_noroot.get_feature_names())
#获取样本数和特征个数
num_samples_stopword_noroot,num_features_stopword_noroot=x_stopword_noroot.shape
print("新的samples_noroot: %d ,#features_noroot: %d" % (num_samples_stopword_noroot,num_features_stopword_noroot))


#采用词根的处理方式
#对于帖子进行词袋转化
vectorizer_stopword_root=StemmedCountVectorizer(min_df=1,stop_words="english")
#对于帖子进行词袋转化
x_stopword_root=vectorizer_stopword_root.fit_transform(posts_root)
print("新的feature_name_root:%s" % vectorizer_stopword_root.get_feature_names())
#获取样本数和特征个数
num_samples_stopword_root,num_features_stopword_root=x_stopword_root.shape
print("新的samples_root: %d ,#features_root: %d" % (num_samples_stopword_root,num_features_stopword_root))

新的feature_name_noroot:[u'disk', u'disks', u'format', u'formated', u'formating', u'hard']
新的samples_noroot: 3 ,#features_noroot: 6
新的feature_name_root:[u'disk', u'format', u'hard']
新的samples_root: 3 ,#features_root: 3



#可以看到，通过词根处理，特征有效降低，泛化能力有明显增强

# 4 引入新的问题
#特征词词语在帖子中出现的次数，默认较大的特征意味这这个词语对帖子更重要，区分度更高。
#但例如邮件中的subject\sender\reciver，每个邮件中都出现的词语就不一样了
#我们可以通过max_df参数来删除，但设置90%，还是89%呢？即使设置了一个参数，但也有这样的问题：一些词语正好要比其他词语更具有区分性。

#这只能通过统计每个帖子的词频，并且对出现在多个帖子中的词语在权重上打折扣来解决。
#换而言之，就是某些词语经常出现在一些特定的帖子中，而其他地方很少出现的时候，我们会赋予该词语较高的权重。

#词频-反转文档频率（TF-IDF）就是处理该问题的。TF代表统计部分，而IDF把权重折扣考虑进去。

最后编辑于：2018.01.28 16:36:36

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 219,701评论 6赞 508
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 93,649评论 3赞 396
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 166,037评论 0赞 356
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 58,994评论 1赞 295
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 68,018评论 6赞 395
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 51,796评论 1赞 308
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,481评论 3赞 420
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 39,370评论 0赞 276
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 45,868评论 1赞 319
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 38,014评论 3赞 338
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 40,153评论 1赞 352
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,832评论 5赞 346
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,494评论 3赞 331
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 32,039评论 0赞 22
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 33,156评论 1赞 272
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 48,437评论 3赞 373
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 45,131评论 2赞 356

jupyter_文本特征抽取_2预处理（归一化、停用词、词根处理）

推荐阅读更多精彩内容