一般来说,文本主题提取主要涉两种方法,
- 基于TF-IDF的文本关键字提取。
- 基于TextRank的文本关键字提取。
除此之外,还有很多模型和方法能够提取文本主题,特别是对于大文本提取,但这是另外一个话题,有兴趣的可以自行参考其他教程。这一章主要介绍TF-IDF和TextRank文本关键字提取。
使用TF-IDF提取关键字
目标文本经过清洗和停用词处理后,一般可以认为剩下的均为有目标含义的词。如果需要对其特征进行进一步的提取,那么提取的应该是那些最能代表文章的元素,包括词、短语、句子、标点符号以及其他信息的词。从词的角度考虑,需要提取对文章表达贡献度最大的词。
IF-IDF计算公式定义如下,
对于一个在文档j里的词语i
其中,
表示在文档j中词语i出现的次数。
表示包含词语i的文档数量。
N表示文档总数。
从对公式的解读可以得出,TF-IDF的主要思想,即如果某个词或者短语在某一篇文战中出现的频率TF(Term Frequency)高,并且在其他文档中出现的次数很少,则认为此词或者短语具有很好的类别区分能力,适合用来分类。其中,TF表示词条在文章中出现的频率,
词频TF计算
逆文档频率IDF计算
IDF(Inverse Document Frequency)主要的思想是包含某个词的文档越少,这个词的区分度就越大,也就是IDF越大。IDF计算公式为,
TF-IDF的计算实际上就是,
TF-IDF是一种用于信息检索与勘测的常用加权技术,也是一种统计方法,可用来衡量一个词对一个文件集的重要程度。字词在某文本文件中出现的次数越多,则该字词越重要;而该字词在整个文件集中出现的次数越高,该字词的重要性越低。该算法在数据挖掘、文本处理和信息检索等领域得到了广泛的应用,其中最常见的应用是从一个文章中提取关键字。
TF-IDF的实现
首先是IDF的计算,代码如下,
import jax
def idf(corpus):
dfi = {}
N = 0.0
# Number of occurences
for document in corpus:
N += 1
counted = []
for word in document:
if word not in counted:
counted.append(word)
if word in dfi:
dfi[word] += 1
else:
dfi[word] = 1
idfs = {}
for word in dfi:
idfs[word] = jax.numpy.log(N / float(dfi[word]))
return idfs
下一步是使用计算好的IDF计算每个文档的TF-IDF值,
def idf(corpus):
dfi = {}
N = 0.0
# Number of occurrences
for document in corpus:
N += 1
counted = []
for word in document:
if word not in counted:
counted.append(word)
if word in dfi:
dfi[word] += 1
else:
dfi[word] = 1
idfs = {}
for word in dfi:
idfs[word] = jax.numpy.log(N / float(dfi[word]))
return idfs
word_occurences_in_given_document[word] = 1
idfs = idf(corpus)
for word in tfs:
word_occurences_in_given_document[word] *= idfs[word]
sorted_values = sorted(word_occurences_in_given_document.items(), key = lambda item: item[1], reverse = True)
sorted_values = [value[0] for value in sorted_values]
tfidf_strings.append(sorted_values)
return sorted_values
注意,上述代码是对公式
的实现,也就是
为某个词在文档中出现的次数,不是对词频TF = \frac{某个词在单个文本中出现的次数}{该词在整个语料中出现的次数}实现。
整个计算tf-idf的完整代码如下,
import jax
def idf(corpus):
dfi = {}
N = 0.0
# Number of occurences
for document in corpus:
N += 1
counted = []
for word in document:
if word not in counted:
counted.append(word)
if word in dfi:
dfi[word] += 1
else:
dfi[word] = 1
idfs = {}
for word in dfi:
idfs[word] = jax.numpy.log(N / float(dfi[word]))
return idfs
def tfidf(corpus):
tfs = {}
tfidf_strings = []
for document in corpus:
word_occurences_in_given_document = {}
for word in document:
if word in tfs:
word_occurences_in_given_document[word] += 1
else:
word_occurences_in_given_document[word] = 1
idfs = idf(corpus)
for word in tfs:
word_occurences_in_given_document[word] *= idfs[word]
sorted_values = sorted(word_occurences_in_given_document.items(), key = lambda item: item[1], reverse = True)
sorted_values = [value[0] for value in sorted_values]
tfidf_strings.append(sorted_values)
return tfidf_strings.append
建立词矩阵
import sys
sys.path.append("../52/")
import AgNewsCsvReader
sys.path.append("../53/")
import Word2VecGenSim
def train():
labels, titles, descriptions = AgNewsCsvReader.setup()
name = "/tmp/CorpusWord2Vec.bin"
callback = LossLogger()
descriptions = tfidf(descriptions)
model = Word2VecGenSimretrain(descriptions, name, callback)
text = "Inspection now is in progress"
vectorize(model, text)
def main():
train()
if __name__ == "__main__":
main()
将TF-IDF包装成一个类
将TF-IDF的计算函数单独整合到一个类中,以供后续使用。代码如下,
import jax
import sys
import gensim
sys.path.append("../52/")
import AgNewsCsvReader
sys.path.append("../53/")
import Word2VecGenSim
import gensim
sys.path.append("../52/")
class TfIdf:
def __init__(self, corpus, model = None):
self.corpus = corpus
self.model = model
self.idfs = self.__idf()
def __idf(self):
dfi = {}
N = 0.0
# Number of occurences
for document in self.corpus:
N += 1
counted = []
for word in document:
if word not in counted:
counted.append(word)
if word in dfi:
dfi[word] += 1
else:
dfi[word] = 1
idfs = {}
for word in dfi:
idfs[word] = jax.numpy.log(N / float(dfi[word]))
return idfs
def tfIdf(self):
tfs = {}
tfidf_strings = []
for document in self.corpus:
word_occurences_in_given_document = {}
for word in document:
if word in tfs:
word_occurences_in_given_document[word] += 1
else:
word_occurences_in_given_document[word] = 1
for word in tfs:
word_occurences_in_given_document[word] *= self.idfs[word]
sorted_values = sorted(word_occurences_in_given_document.items(), key = lambda item: item[1], reverse = True)
sorted_values = [value[0] for value in sorted_values]
tfidf_strings.append(sorted_values)
return tfidf_strings
class LossLogger(gensim.models.callbacks.CallbackAny2Vec):
""""
Output loss at each epoch
"""
def __init__(self):
self.epoch = 1
self.losses = []
def on_train_begin(self,model):
print("Train started")
def on_epoch_begin(self, model):
print(f"Epoch {self.epoch}", end = '\t')
def on_epoch_end(self, model):
loss = model.get_latest_training_loss()
self.losses.append(loss)
print(f"Loss: {loss}")
self.epoch += 1
def on_train_end(self, model):
print("Train ended")
def train():
labels, titles, descriptions = AgNewsCsvReader.setup()
name = "/tmp/CorpusWord2Vec.bin"
callback = LossLogger()
tfIdf = TfIdf(descriptions)
descriptions = tfIdf.tfIdf()
model = Word2VecGenSim.retrain(descriptions, name, callback)
text = "Inspection now is in progress"
Word2VecGenSim.vectorize(model, text)
def main():
train()
if __name__ == "__main__":
main()
运行结果打印输出如下,
Train started
Epoch 1 Loss: 0.0
Epoch 2 Loss: 0.0
Epoch 3 Loss: 0.0
Epoch 4 Loss: 0.0
Epoch 5 Loss: 0.0
Train ended
[[ 4.96069849e-01 -1.67209613e+00 8.89949799e-02 1.07414138e+00
4.26580548e-01 -4.97303933e-01 -1.49051189e+00 -4.17422444e-01
8.84718835e-01 -6.95034862e-01 -1.05203867e+00 7.81460464e-01
-1.31654954e+00 7.27441072e-01 -2.07940507e+00 1.23919857e+00
-3.00522029e-01 -8.13091099e-01 -1.17256045e+00 7.90289760e-01
-8.54466200e-01 -6.26001135e-02 -1.21285820e+00 -1.00276256e+00
2.50522971e-01 -1.04276180e+00 -1.89150259e-01 1.97796082e+00
6.57185256e-01 6.98286355e-01 7.66776383e-01 9.28378165e-01
1.07921612e+00 -6.29490077e-01 -4.23403442e-01 8.48681927e-01
2.88632929e-01 -1.38907087e+00 -1.18036139e+00 5.19349515e-01
1.43541837e+00 -5.25178015e-02 -2.84197122e-01 1.09127915e+00
1.10283756e+00 2.31341094e-01 -7.93484390e-01 8.79477441e-01
9.37548280e-02 1.71932125e+00 -2.86833227e-01 -3.72722864e-01
1.90851021e+00 -2.20313538e-02 -6.29367590e-01 -8.73027518e-02
7.52119124e-02 -2.04270408e-01 -3.61324817e-01 -6.27556801e-01
-6.35196030e-01 1.59466898e+00 -3.46995920e-01 -2.06190050e-01]
[ 1.76200703e-01 -3.89839470e-01 8.96097273e-02 4.98303622e-02
7.93483481e-02 -3.81868362e-01 1.21796411e-02 -1.25424609e-01
-4.28376019e-01 -5.14898300e-02 1.83273166e-01 -7.06694126e-02
-2.95564890e-01 -3.16819400e-01 7.89013066e-05 2.81881005e-01
-2.93145925e-01 1.28112026e-02 -1.75852761e-01 5.40404558e-01
3.26927722e-01 1.76672205e-01 -4.41782363e-02 -3.85691732e-01
-4.31915186e-02 -2.13175290e-03 -1.21829897e-01 2.62107879e-01
6.77114204e-02 1.48053750e-01 3.76765169e-02 1.20105937e-01
2.62329858e-02 -1.63009286e-01 -1.02827944e-01 2.02095389e-01
-2.39163697e-01 -1.48303628e-01 -2.10601334e-02 2.05104262e-01
5.38723320e-02 2.39459842e-01 -1.60894677e-01 7.97450393e-02
4.51797899e-03 -1.90956011e-01 9.13655236e-02 -1.37388455e-02
-8.62598494e-02 1.93306252e-01 1.72400519e-01 5.38419932e-02
-1.43037345e-02 1.72264293e-01 -1.47748575e-01 -2.80846089e-01
2.30912566e-01 -3.10832918e-01 -3.37104760e-02 5.44575155e-02
-1.02778643e-01 -3.06686126e-02 -5.33233434e-02 7.89295533e-04]
[ 5.65004468e-01 -6.24979317e-01 2.49599636e-01 -7.79090524e-02
2.90717721e-01 -7.93466926e-01 -2.15552717e-01 -5.00133634e-01
-9.31059599e-01 -2.92390734e-01 4.29491341e-01 -3.81187648e-02
-6.44094110e-01 -1.68293715e-01 1.33224621e-01 6.90573633e-01
-4.43393201e-01 -8.41982961e-02 -2.42455527e-01 8.32186162e-01
3.22877705e-01 2.47170150e-01 1.32656902e-01 -4.53047246e-01
1.65028945e-01 -1.78530104e-02 -3.06671411e-01 9.25685465e-01
2.99194723e-01 5.34004033e-01 -4.79875840e-02 4.43493307e-01
-2.58198410e-01 4.23368961e-02 -2.16563329e-01 4.20637995e-01
-5.77895164e-01 2.23520352e-03 -1.28733888e-01 3.48236859e-01
4.42022383e-01 5.56681812e-01 -3.32573265e-01 2.27029875e-01
1.26723479e-03 -3.69715035e-01 1.63047358e-01 -3.05177510e-01
-1.79782454e-02 4.74121124e-01 3.50482136e-01 -7.71325082e-02
-3.62917930e-02 5.43722570e-01 -2.28166834e-01 -6.83868110e-01
7.66589046e-01 -3.43827784e-01 2.08208486e-01 -2.25625828e-01
-3.03718925e-01 -2.49719307e-01 -6.51984140e-02 -2.95166135e-01]
[-2.90451795e-01 -1.19539177e+00 1.79127604e-01 2.12810606e-01
4.19164628e-01 7.84275353e-01 5.16766191e-01 8.23276818e-01
8.44484866e-01 -1.07573509e+00 1.07982826e+00 1.60321876e-01
-7.97771215e-01 -1.78780711e+00 2.93222931e-03 -1.84084345e-02
-1.21997036e-01 -7.68034160e-01 -2.12774023e-01 1.57086396e+00
-3.01641613e-01 2.08482426e-02 -9.64548945e-01 -2.98457444e-01
-5.57536125e-01 -8.49156618e-01 1.06047344e+00 4.81399029e-01
3.82250071e-01 -1.22886352e-01 2.83816409e+00 -1.97423249e-02
-1.87647969e-01 3.27942916e-03 3.06393355e-01 2.02040291e+00
-1.26779592e+00 -5.64161897e-01 4.54727113e-01 2.41001773e+00
1.78840065e+00 1.30741343e-01 1.27594578e+00 5.93739867e-01
2.46201262e-01 -1.06042480e+00 5.33313632e-01 3.21452580e-02
-9.09399569e-01 7.06232369e-01 8.45255256e-01 6.77717090e-01
9.28814650e-01 1.49612474e+00 -8.37044001e-01 4.98314053e-02
1.54765964e+00 -5.33700138e-02 5.08897789e-02 -1.28234315e+00
3.74768227e-01 1.30633920e-01 6.30517244e-01 8.68744016e-01]]
使用TextRank提取关键字
TextRank算法的核心思想来自著名的网页排序算法PageRank,如下图。
PageRank是Sergey Brin和Larry Page于1998年www7会议上提出来的,用来解决链接分析中网页排名问题。在衡量一个网页的排名时,可以认为,
- 当一个网页被越多网页所链接时,该网页排名越靠前。
- 排名高的网页应具有更大的表决权,即当一个网页被排名高的网页所链接时,其重要性也应提高。
TextRank算法,见下图,与PageRank类似,其将文本拆分成最小组成单元(词汇),作为网络节点,组成词汇图模型。TextRank在迭代计算词汇权重时与PageRank一样,理论上是需要计算边权的。为了简化计算,通常会默认相同的初始权重,以及在分配相邻词汇权重时进行均分。
TextRank关键字提取方法
TextRank用于对文本关键字进行提取,步骤如下,
- 把给定的文本T按照完整的句子进行分割。
- 对每个句子进行分词和词性标注处理,并过滤掉停用词,只保留指定词性的单词,如名次、动词、形容词。
- 构建关键词图
。其中,V为节点集,由每个词之间的相似度作为连接的边值。
- 根据下面的公式迭代传播各节点的权重,直至收敛,
对节点权重进行倒序排序,作为按重要程度排列的关键字。
TextRank类的实现
下面实现TextRank计算类,代码如下,
class TextRank:
def __init__(self, strings):
self.strings = strings
self.filters = self.__get_words()
self.win = self.__get_win()
self.dictionary = self.__get_dictionary()
def __get_words(self):
words = []
for string in self.strings:
for word in string:
words.append(word)
return words;
def __get_win(self):
win = {}
for i in range(len(self.filters)):
if self.filters[i] not in win.keys():
win[self.filters[i]] = set()
if i - 5 < 0:
index = 0
else:
index = i - 5
for j in self.filters[index: i + 5]:
win[self.filters[i]].add(j)
return win
def __get_dictionary(self):
time = 0
scores = {w: 1.0 for w in self.filters}
while time < 50:
for key, value in self.win.items():
score = scores[key] / len(value)
scores[k] = 0
for i in value:
scores[i] += score
time += 1
_dictionary = {}
for key in scores:
self.dictionary[key] = scores[key]
return self.dictionary
def __get_text_rank(self, string):
_dictionary = {}
values = []
for word in string:
if word in self.dictionary.keys():
_dictionary[word] = self.dictionary[word]
sorted_values = sorted(_dictionary.items(), key = lambda word_tfidf: word_tfidf[1], reverse = False)
for value in sorted_values:
values.append(value[0])
return values
TextRank是实现关键字提取的方法,相对于数据集来说,如前面所述,关键字提取并非必需,可以自行决定是否深入了解。
结论
本章介绍了两种文本关键字提取方法TF-IDF和TextRank,以及两种方法的计算公式。选学章节,酌情学习。