这里保存一个TF-IDF 的python实现,供以后参考。
TF-IDF介绍
TF
这里就是Term Frequency,表示一个词在一个文档中的出现频率,在一个文档中出现次数越高的词越重要。计算公式如下(i 为word,j 为文档):
IDF
IDF表示一个词在越多的文档中出现越不重要,比如一些stop words,这里是总文档数除以词i所出现的文档数,计算公式如下
TF-IDF
tf_idf这里是tf和idf相乘即可。
python 实现
代码中如果有错请大家评论提醒,以免误人子弟:》
from math import log10
#docList is the corpus with each element a doc, each doc is a list of words
def tfidf(docList):
docNum=len(docList)
term_df = dict()
for doc in docList:
for term in set(doc):
if term not in term_df:
term_df [term]=1.0
else:
term_df[term]+=1.0
for term in term_df:
term_df[term] = log10(docNum/term_df[term])
term_tfidf=dict()
doc_id=0
for doc in docList:
term_tfidf[doc_id] = dict()
term_tf = dict()
for term in doc:
if term not in term_tf:
term_tf[term]=1.0
else:
term_tf[term]+=1.0
docLen=len(doc)
for term in doc:
tfidf = term_tf[term]/docLen * term_df[term]
term_tfidf[doc_id][term] =tfidf
doc_id+=1
return term_tfidf;