NLP入门(二)探究TF-IDF的原理

TF-IDF介绍

  TF-IDF是NLP中一种常用的统计方法,用以评估一个字词对于一个文件集或一个语料库中的其中一份文件的重要程度,通常用于提取文本的特征,即关键词。字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。
  在NLP中,TF-IDF的计算公式如下:

tfidf = tf*idf.

其中,tf是词频(Term Frequency),idf为逆向文件频率(Inverse Document Frequency)。
  tf为词频,即一个词语在文档中的出现频率,假设一个词语在整个文档中出现了i次,而整个文档有N个词语,则tf的值为i/N.
  idf为逆向文件频率,假设整个文档有n篇文章,而一个词语在k篇文章中出现,则idf值为

idf=\log_{2}(\frac{n}{k}).

当然,不同地方的idf值计算公式会有稍微的不同。比如有些地方会在分母的k上加1,防止分母为0,还有些地方会让分子,分母都加上1,这是smoothing技巧。在本文中,还是采用最原始的idf值计算公式,因为这与gensim里面的计算公式一致。
  假设整个文档有D篇文章,则单词i在第j篇文章中的tfidf值为

gensim中tfidf的计算公式

  以上就是TF-IDF的计算方法。

文本介绍及预处理

  我们将采用以下三个示例文本:

text1 ="""
Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal. 
Unqualified, the word football is understood to refer to whichever form of football is the most popular 
in the regional context in which the word appears. Sports commonly called football in certain places 
include association football (known as soccer in some countries); gridiron football (specifically American 
football or Canadian football); Australian rules football; rugby football (either rugby league or rugby union); 
and Gaelic football. These different variations of football are known as football codes.
"""

text2 = """
Basketball is a team sport in which two teams of five players, opposing one another on a rectangular court, 
compete with the primary objective of shooting a basketball (approximately 9.4 inches (24 cm) in diameter) 
through the defender's hoop (a basket 18 inches (46 cm) in diameter mounted 10 feet (3.048 m) high to a backboard 
at each end of the court) while preventing the opposing team from shooting through their own hoop. A field goal is 
worth two points, unless made from behind the three-point line, when it is worth three. After a foul, timed play stops 
and the player fouled or designated to shoot a technical foul is given one or more one-point free throws. The team with 
the most points at the end of the game wins, but if regulation play expires with the score tied, an additional period 
of play (overtime) is mandated.
"""

text3 = """
Volleyball, game played by two teams, usually of six players on a side, in which the players use their hands to bat a 
ball back and forth over a high net, trying to make the ball touch the court within the opponents’ playing area before 
it can be returned. To prevent this a player on the opposing team bats the ball up and toward a teammate before it touches 
the court surface—that teammate may then volley it back across the net or bat it to a third teammate who volleys it across 
the net. A team is allowed only three touches of the ball before it must be returned over the net.
"""

这三篇文章分别是关于足球,篮球,排球的介绍,它们组成一篇文档。
  接下来是文本的预处理部分。
  首先是对文本去掉换行符,然后是分句,分词,再去掉其中的标点,完整的Python代码如下,输入的参数为文章text:

import nltk
import string

# 文本预处理
# 函数:text文件分句,分词,并去掉标点
def get_tokens(text):
    text = text.replace('\n', '')
    sents = nltk.sent_tokenize(text)  # 分句
    tokens = []
    for sent in sents:
        for word in nltk.word_tokenize(sent):  # 分词
            if word not in string.punctuation: # 去掉标点
                tokens.append(word)
    return tokens

  接着,去掉文章中的通用词(stopwords),然后统计每个单词的出现次数,完整的Python代码如下,输入的参数为文章text:

from nltk.corpus import stopwords     #停用词

# 对原始的text文件去掉停用词
# 生成count字典,即每个单词的出现次数
def make_count(text):
    tokens = get_tokens(text)
    filtered = [w for w in tokens if not w in stopwords.words('english')]    #去掉停用词
    count = Counter(filtered)
    return count

以text3为例,生成的count字典如下:

Counter({'ball': 4, 'net': 4, 'teammate': 3, 'returned': 2, 'bat': 2, 'court': 2, 'team': 2, 'across': 2, 'touches': 2, 'back': 2, 'players': 2, 'touch': 1, 'must': 1, 'usually': 1, 'side': 1, 'player': 1, 'area': 1, 'Volleyball': 1, 'hands': 1, 'may': 1, 'toward': 1, 'A': 1, 'third': 1, 'two': 1, 'six': 1, 'opposing': 1, 'within': 1, 'prevent': 1, 'allowed': 1, '’': 1, 'playing': 1, 'played': 1, 'volley': 1, 'surface—that': 1, 'volleys': 1, 'opponents': 1, 'use': 1, 'high': 1, 'teams': 1, 'bats': 1, 'To': 1, 'game': 1, 'make': 1, 'forth': 1, 'three': 1, 'trying': 1})

Gensim中的TF-IDF

  对文本进行预处理后,对于以上三个示例文本,我们都会得到一个count字典,里面是每个文本中单词的出现次数。下面,我们将用gensim中的已实现的TF-IDF模型,来输出每篇文章中TF-IDF排名前三的单词及它们的tfidf值,完整的代码如下:

from nltk.corpus import stopwords     #停用词
from gensim import corpora, models, matutils

#training by gensim's Ifidf Model
def get_words(text):
    tokens = get_tokens(text)
    filtered = [w for w in tokens if not w in stopwords.words('english')]
    return filtered

# get text
count1, count2, count3 = get_words(text1), get_words(text2), get_words(text3)
countlist = [count1, count2, count3]
# training by TfidfModel in gensim
dictionary = corpora.Dictionary(countlist)
new_dict = {v:k for k,v in dictionary.token2id.items()}
corpus2 = [dictionary.doc2bow(count) for count in countlist]
tfidf2 = models.TfidfModel(corpus2)
corpus_tfidf = tfidf2[corpus2]

# output
print("\nTraining by gensim Tfidf Model.......\n")
for i, doc in enumerate(corpus_tfidf):
    print("Top words in document %d"%(i + 1))
    sorted_words = sorted(doc, key=lambda x: x[1], reverse=True)    #type=list
    for num, score in sorted_words[:3]:
        print("    Word: %s, TF-IDF: %s"%(new_dict[num], round(score, 5)))

输出的结果如下:

Training by gensim Tfidf Model.......

Top words in document 1
    Word: football, TF-IDF: 0.84766
    Word: rugby, TF-IDF: 0.21192
    Word: known, TF-IDF: 0.14128
Top words in document 2
    Word: play, TF-IDF: 0.29872
    Word: cm, TF-IDF: 0.19915
    Word: diameter, TF-IDF: 0.19915
Top words in document 3
    Word: net, TF-IDF: 0.45775
    Word: teammate, TF-IDF: 0.34331
    Word: across, TF-IDF: 0.22888

输出的结果还是比较符合我们的预期的,比如关于足球的文章中提取了football, rugby关键词,关于篮球的文章中提取了plat, cm关键词,关于排球的文章中提取了net, teammate关键词。

自己动手实践TF-IDF模型

  有了以上我们对TF-IDF模型的理解,其实我们自己也可以动手实践一把,这是学习算法的最佳方式!
  以下是笔者实践TF-IDF的代码(接文本预处理代码):

import math

# 计算tf
def tf(word, count):
    return count[word] / sum(count.values())
# 计算count_list有多少个文件包含word
def n_containing(word, count_list):
    return sum(1 for count in count_list if word in count)

# 计算idf
def idf(word, count_list):
    return math.log2(len(count_list) / (n_containing(word, count_list)))    #对数以2为底
# 计算tf-idf
def tfidf(word, count, count_list):
    return tf(word, count) * idf(word, count_list)

# TF-IDF测试
count1, count2, count3 = make_count(text1), make_count(text2), make_count(text3)
countlist = [count1, count2, count3]
print("Training by original algorithm......\n")
for i, count in enumerate(countlist):
    print("Top words in document %d"%(i + 1))
    scores = {word: tfidf(word, count, countlist) for word in count}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)    #type=list
    # sorted_words = matutils.unitvec(sorted_words)
    for word, score in sorted_words[:3]:
        print("    Word: %s, TF-IDF: %s"%(word, round(score, 5)))

输出结果如下:

Training by original algorithm......

Top words in document 1
    Word: football, TF-IDF: 0.30677
    Word: rugby, TF-IDF: 0.07669
    Word: known, TF-IDF: 0.05113
Top words in document 2
    Word: play, TF-IDF: 0.05283
    Word: inches, TF-IDF: 0.03522
    Word: worth, TF-IDF: 0.03522
Top words in document 3
    Word: net, TF-IDF: 0.10226
    Word: teammate, TF-IDF: 0.07669
    Word: across, TF-IDF: 0.05113

可以看到,笔者自己动手实践的TF-IDF模型提取的关键词与gensim一致,至于篮球中为什么后两个单词不一致,是因为这些单词的tfidf一样,随机选择的结果不同而已。但是有一个问题,那就是计算得到的tfidf值不一样,这是什么原因呢?
  查阅gensim中计算tf-idf值的源代码(https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/tfidfmodel.py):

TfidfModel类的参数
normalize参数的说明

也就是说,gensim对得到的tf-idf向量做了规范化(normalize),将其转化为单位向量。因此,我们需要在刚才的代码中加入规范化这一步,代码如下:

import numpy as np

# 对向量做规范化, normalize
def unitvec(sorted_words):
    lst = [item[1] for item in sorted_words]
    L2Norm = math.sqrt(sum(np.array(lst)*np.array(lst)))
    unit_vector = [(item[0], item[1]/L2Norm) for item in sorted_words]
    return unit_vector

# TF-IDF测试
count1, count2, count3 = make_count(text1), make_count(text2), make_count(text3)
countlist = [count1, count2, count3]
print("Training by original algorithm......\n")
for i, count in enumerate(countlist):
    print("Top words in document %d"%(i + 1))
    scores = {word: tfidf(word, count, countlist) for word in count}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)    #type=list
    sorted_words = unitvec(sorted_words)   # normalize
    for word, score in sorted_words[:3]:
        print("    Word: %s, TF-IDF: %s"%(word, round(score, 5)))

输出结果如下:

Training by original algorithm......

Top words in document 1
    Word: football, TF-IDF: 0.84766
    Word: rugby, TF-IDF: 0.21192
    Word: known, TF-IDF: 0.14128
Top words in document 2
    Word: play, TF-IDF: 0.29872
    Word: shooting, TF-IDF: 0.19915
    Word: diameter, TF-IDF: 0.19915
Top words in document 3
    Word: net, TF-IDF: 0.45775
    Word: teammate, TF-IDF: 0.34331
    Word: back, TF-IDF: 0.22888

现在的输出结果与gensim得到的结果一致!

总结

  Gensim是Python做NLP时鼎鼎大名的模块,有空还是多读读源码吧!以后,我们还会继续介绍TF-IDF在其它方面的应用,欢迎大家交流~

注意:本人现已开通微信公众号: Python爬虫与算法(微信号为:easy_web_scrape), 欢迎大家关注哦~~

本文的完整代码如下:

import nltk
import math
import string
from nltk.corpus import stopwords     #停用词
from collections import Counter       #计数
from gensim import corpora, models, matutils

text1 ="""
Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal. 
Unqualified, the word football is understood to refer to whichever form of football is the most popular 
in the regional context in which the word appears. Sports commonly called football in certain places 
include association football (known as soccer in some countries); gridiron football (specifically American 
football or Canadian football); Australian rules football; rugby football (either rugby league or rugby union); 
and Gaelic football. These different variations of football are known as football codes.
"""

text2 = """
Basketball is a team sport in which two teams of five players, opposing one another on a rectangular court, 
compete with the primary objective of shooting a basketball (approximately 9.4 inches (24 cm) in diameter) 
through the defender's hoop (a basket 18 inches (46 cm) in diameter mounted 10 feet (3.048 m) high to a backboard 
at each end of the court) while preventing the opposing team from shooting through their own hoop. A field goal is 
worth two points, unless made from behind the three-point line, when it is worth three. After a foul, timed play stops 
and the player fouled or designated to shoot a technical foul is given one or more one-point free throws. The team with 
the most points at the end of the game wins, but if regulation play expires with the score tied, an additional period 
of play (overtime) is mandated.
"""

text3 = """
Volleyball, game played by two teams, usually of six players on a side, in which the players use their hands to bat a 
ball back and forth over a high net, trying to make the ball touch the court within the opponents’ playing area before 
it can be returned. To prevent this a player on the opposing team bats the ball up and toward a teammate before it touches 
the court surface—that teammate may then volley it back across the net or bat it to a third teammate who volleys it across 
the net. A team is allowed only three touches of the ball before it must be returned over the net.
"""

# 文本预处理
# 函数:text文件分句,分词,并去掉标点
def get_tokens(text):
    text = text.replace('\n', '')
    sents = nltk.sent_tokenize(text)  # 分句
    tokens = []
    for sent in sents:
        for word in nltk.word_tokenize(sent):  # 分词
            if word not in string.punctuation: # 去掉标点
                tokens.append(word)
    return tokens

# 对原始的text文件去掉停用词
# 生成count字典,即每个单词的出现次数
def make_count(text):
    tokens = get_tokens(text)
    filtered = [w for w in tokens if not w in stopwords.words('english')]    #去掉停用词
    count = Counter(filtered)
    return count

# 计算tf
def tf(word, count):
    return count[word] / sum(count.values())
# 计算count_list有多少个文件包含word
def n_containing(word, count_list):
    return sum(1 for count in count_list if word in count)

# 计算idf
def idf(word, count_list):
    return math.log2(len(count_list) / (n_containing(word, count_list)))    #对数以2为底
# 计算tf-idf
def tfidf(word, count, count_list):
    return tf(word, count) * idf(word, count_list)

import numpy as np

# 对向量做规范化, normalize
def unitvec(sorted_words):
    lst = [item[1] for item in sorted_words]
    L2Norm = math.sqrt(sum(np.array(lst)*np.array(lst)))
    unit_vector = [(item[0], item[1]/L2Norm) for item in sorted_words]
    return unit_vector

# TF-IDF测试
count1, count2, count3 = make_count(text1), make_count(text2), make_count(text3)
countlist = [count1, count2, count3]
print("Training by original algorithm......\n")
for i, count in enumerate(countlist):
    print("Top words in document %d"%(i + 1))
    scores = {word: tfidf(word, count, countlist) for word in count}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)    #type=list
    sorted_words = unitvec(sorted_words)   # normalize
    for word, score in sorted_words[:3]:
        print("    Word: %s, TF-IDF: %s"%(word, round(score, 5)))

#training by gensim's Ifidf Model
def get_words(text):
    tokens = get_tokens(text)
    filtered = [w for w in tokens if not w in stopwords.words('english')]
    return filtered

# get text
count1, count2, count3 = get_words(text1), get_words(text2), get_words(text3)
countlist = [count1, count2, count3]
# training by TfidfModel in gensim
dictionary = corpora.Dictionary(countlist)
new_dict = {v:k for k,v in dictionary.token2id.items()}
corpus2 = [dictionary.doc2bow(count) for count in countlist]
tfidf2 = models.TfidfModel(corpus2)
corpus_tfidf = tfidf2[corpus2]

# output
print("\nTraining by gensim Tfidf Model.......\n")
for i, doc in enumerate(corpus_tfidf):
    print("Top words in document %d"%(i + 1))
    sorted_words = sorted(doc, key=lambda x: x[1], reverse=True)    #type=list
    for num, score in sorted_words[:3]:
        print("    Word: %s, TF-IDF: %s"%(new_dict[num], round(score, 5)))
        
"""
输出结果:

Training by original algorithm......

Top words in document 1
    Word: football, TF-IDF: 0.84766
    Word: rugby, TF-IDF: 0.21192
    Word: word, TF-IDF: 0.14128
Top words in document 2
    Word: play, TF-IDF: 0.29872
    Word: inches, TF-IDF: 0.19915
    Word: points, TF-IDF: 0.19915
Top words in document 3
    Word: net, TF-IDF: 0.45775
    Word: teammate, TF-IDF: 0.34331
    Word: bat, TF-IDF: 0.22888

Training by gensim Tfidf Model.......

Top words in document 1
    Word: football, TF-IDF: 0.84766
    Word: rugby, TF-IDF: 0.21192
    Word: known, TF-IDF: 0.14128
Top words in document 2
    Word: play, TF-IDF: 0.29872
    Word: cm, TF-IDF: 0.19915
    Word: diameter, TF-IDF: 0.19915
Top words in document 3
    Word: net, TF-IDF: 0.45775
    Word: teammate, TF-IDF: 0.34331
    Word: across, TF-IDF: 0.22888
"""
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 212,080评论 6 493
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,422评论 3 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 157,630评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,554评论 1 284
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,662评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 49,856评论 1 290
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,014评论 3 408
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,752评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,212评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,541评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,687评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,347评论 4 331
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,973评论 3 315
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,777评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,006评论 1 266
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,406评论 2 360
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,576评论 2 349

推荐阅读更多精彩内容