第54章文本主题提取

一般来说，文本主题提取主要涉两种方法，

基于TF-IDF的文本关键字提取。
基于TextRank的文本关键字提取。

除此之外，还有很多模型和方法能够提取文本主题，特别是对于大文本提取，但这是另外一个话题，有兴趣的可以自行参考其他教程。这一章主要介绍TF-IDF和TextRank文本关键字提取。

使用TF-IDF提取关键字

目标文本经过清洗和停用词处理后，一般可以认为剩下的均为有目标含义的词。如果需要对其特征进行进一步的提取，那么提取的应该是那些最能代表文章的元素，包括词、短语、句子、标点符号以及其他信息的词。从词的角度考虑，需要提取对文章表达贡献度最大的词。

IF-IDF计算公式定义如下，

对于一个在文档j里的词语i
$w_{i, j} = tf_{i, j} \times log\left( \frac{N}{df_{i}} \right)$
其中，

$tf_{i, j}$ 表示在文档j中词语i出现的次数。

$df_{i}$ 表示包含词语i的文档数量。

N表示文档总数。

从对公式的解读可以得出，TF-IDF的主要思想，即如果某个词或者短语在某一篇文战中出现的频率TF（Term Frequency）高，并且在其他文档中出现的次数很少，则认为此词或者短语具有很好的类别区分能力，适合用来分类。其中，TF表示词条在文章中出现的频率，

词频TF计算

$词频TF = \frac{某个词在单个文本中出现的次数}{该词在整个语料中出现的次数}$

逆文档频率IDF计算

IDF（Inverse Document Frequency）主要的思想是包含某个词的文档越少，这个词的区分度就越大，也就是IDF越大。IDF计算公式为，

$逆文档频率IDF = \log\frac{语料库中文本数}{预料中包含该词的文本数 + 1}$

TF-IDF的计算实际上就是，
$TF-IDF = 词频 x 逆文档频率 = TF x IDF$

TF-IDF是一种用于信息检索与勘测的常用加权技术，也是一种统计方法，可用来衡量一个词对一个文件集的重要程度。字词在某文本文件中出现的次数越多，则该字词越重要；而该字词在整个文件集中出现的次数越高，该字词的重要性越低。该算法在数据挖掘、文本处理和信息检索等领域得到了广泛的应用，其中最常见的应用是从一个文章中提取关键字。

TF-IDF的实现

首先是IDF的计算，代码如下，


import jax

def idf(corpus):
    
    dfi = {}
    N = 0.0
    
    # Number of occurences
    for document in corpus:
        
        N += 1
        
        counted = []
        
        for word in document:
            
            if word not in counted:
                
                counted.append(word)
                
                if word in dfi:
                    
                    dfi[word] += 1
                    
                else:
                    
                    dfi[word] = 1
    
    idfs = {}
    
    for word in dfi:
        
        idfs[word] = jax.numpy.log(N / float(dfi[word]))
        
    return idfs

下一步是使用计算好的IDF计算每个文档的TF-IDF值，


def idf(corpus):
    
    dfi = {}
    N = 0.0
    
    # Number of occurrences
    for document in corpus:
        
        N += 1
        
        counted = []
        
        for word in document:
            
            if word not in counted:
                
                counted.append(word)
                
                if word in dfi:
                    
                    dfi[word] += 1
                    
                else:
                    
                    dfi[word] = 1
    
    idfs = {}
    
    for word in dfi:
        
        idfs[word] = jax.numpy.log(N / float(dfi[word]))
        
    return idfs

                word_occurences_in_given_document[word] = 1
        
        idfs = idf(corpus)
        
        for word in tfs:
            
            word_occurences_in_given_document[word] *= idfs[word]
            sorted_values = sorted(word_occurences_in_given_document.items(), key = lambda  item: item[1], reverse = True)
            sorted_values = [value[0] for value in sorted_values]
            
            tfidf_strings.append(sorted_values)
            
    return sorted_values

注意，上述代码是对公式
$w_{i, j} = tf_{i, j} \times log\left( \frac{N}{df_{i}} \right)$ 的实现，也就是 $tf_{i, j}$ 为某个词在文档中出现的次数，不是对词频TF = \frac{某个词在单个文本中出现的次数}{该词在整个语料中出现的次数}实现。

整个计算tf-idf的完整代码如下，


import jax

def idf(corpus):
    
    dfi = {}
    N = 0.0
    
    # Number of occurences
    for document in corpus:
        
        N += 1
        
        counted = []
        
        for word in document:
            
            if word not in counted:
                
                counted.append(word)
                
                if word in dfi:
                    
                    dfi[word] += 1
                    
                else:
                    
                    dfi[word] = 1
    
    idfs = {}
    
    for word in dfi:
        
        idfs[word] = jax.numpy.log(N / float(dfi[word]))
        
    return idfs

def tfidf(corpus):
    
    tfs = {}
    tfidf_strings = []
    
    for document in corpus:
        
        word_occurences_in_given_document = {}
        
        for word in document:
            
            if word in tfs:
                word_occurences_in_given_document[word] += 1
            else:
                word_occurences_in_given_document[word] = 1
        
        idfs = idf(corpus)
        
        for word in tfs:
            
            word_occurences_in_given_document[word] *= idfs[word]
            sorted_values = sorted(word_occurences_in_given_document.items(), key = lambda  item: item[1], reverse = True)
            sorted_values = [value[0] for value in sorted_values]
            
            tfidf_strings.append(sorted_values)
            
    return tfidf_strings.append

建立词矩阵


import sys

sys.path.append("../52/")
import AgNewsCsvReader

sys.path.append("../53/")
import Word2VecGenSim

def train():
    
    labels, titles, descriptions = AgNewsCsvReader.setup()
    
    name =  "/tmp/CorpusWord2Vec.bin"
    
    callback = LossLogger()
    
    descriptions = tfidf(descriptions)
    
    model = Word2VecGenSimretrain(descriptions, name, callback)
   
    text = "Inspection now is in progress"
    
    vectorize(model, text)
    
def main():
    
    train()
    
if __name__ == "__main__":
    
    main()

将TF-IDF包装成一个类

将TF-IDF的计算函数单独整合到一个类中，以供后续使用。代码如下，


import jax
import sys
import gensim

sys.path.append("../52/")
import AgNewsCsvReader

sys.path.append("../53/")
import Word2VecGenSim

import gensim

sys.path.append("../52/")

class TfIdf:
    
    def __init__(self, corpus, model = None):
        
        self.corpus = corpus
        self.model = model
        self.idfs = self.__idf()
        
    def __idf(self):
        
        dfi = {}
        N = 0.0
        
        # Number of occurences
        for document in self.corpus:
            
            N += 1
            
            counted = []
            
            for word in document:
                
                if word not in counted:
                    
                    counted.append(word)
                    
                    if word in dfi:
                        
                        dfi[word] += 1
                        
                    else:
                        
                        dfi[word] = 1
        
        idfs = {}
        
        for word in dfi:
            
            idfs[word] = jax.numpy.log(N / float(dfi[word]))
            
        return idfs
    
 def tfIdf(self):
        
        tfs = {}
        tfidf_strings = []
        
        for document in self.corpus:
            
            word_occurences_in_given_document = {}
            
            for word in document:
                
                if word in tfs:
                    word_occurences_in_given_document[word] += 1
                else:
                    word_occurences_in_given_document[word] = 1
            
            for word in tfs:
                
                word_occurences_in_given_document[word] *= self.idfs[word]
                sorted_values = sorted(word_occurences_in_given_document.items(), key = lambda  item: item[1], reverse = True)
                sorted_values = [value[0] for value in sorted_values]
                
                tfidf_strings.append(sorted_values)
                
        return tfidf_strings

class LossLogger(gensim.models.callbacks.CallbackAny2Vec):
    
    """"

    Output loss at each epoch

    """
    
    def __init__(self):
        
        self.epoch = 1
        self.losses = []
    
    def on_train_begin(self,model):
        
        print("Train started")
        
    def on_epoch_begin(self, model):
        
        print(f"Epoch {self.epoch}", end = '\t')

def on_epoch_end(self, model):
        
        loss = model.get_latest_training_loss()
        
        self.losses.append(loss)
        
        print(f"Loss: {loss}")
        
        self.epoch += 1
        
    def on_train_end(self, model):
        
        print("Train ended")
        
def train():
    
    labels, titles, descriptions = AgNewsCsvReader.setup()
    
    name =  "/tmp/CorpusWord2Vec.bin"
    
    callback = LossLogger()
    
    tfIdf = TfIdf(descriptions)
    
    descriptions = tfIdf.tfIdf()
    
    model = Word2VecGenSim.retrain(descriptions, name, callback)
   
    text = "Inspection now is in progress"
    
    Word2VecGenSim.vectorize(model, text)
    
def main():
    
    train()
    
if __name__ == "__main__":
    
    main()

运行结果打印输出如下，


Train started
Epoch 1 Loss: 0.0
Epoch 2 Loss: 0.0
Epoch 3 Loss: 0.0
Epoch 4 Loss: 0.0
Epoch 5 Loss: 0.0
Train ended
[[ 4.96069849e-01 -1.67209613e+00  8.89949799e-02  1.07414138e+00
   4.26580548e-01 -4.97303933e-01 -1.49051189e+00 -4.17422444e-01
   8.84718835e-01 -6.95034862e-01 -1.05203867e+00  7.81460464e-01
  -1.31654954e+00  7.27441072e-01 -2.07940507e+00  1.23919857e+00
  -3.00522029e-01 -8.13091099e-01 -1.17256045e+00  7.90289760e-01
  -8.54466200e-01 -6.26001135e-02 -1.21285820e+00 -1.00276256e+00
   2.50522971e-01 -1.04276180e+00 -1.89150259e-01  1.97796082e+00
   6.57185256e-01  6.98286355e-01  7.66776383e-01  9.28378165e-01
   1.07921612e+00 -6.29490077e-01 -4.23403442e-01  8.48681927e-01
   2.88632929e-01 -1.38907087e+00 -1.18036139e+00  5.19349515e-01
   1.43541837e+00 -5.25178015e-02 -2.84197122e-01  1.09127915e+00
   1.10283756e+00  2.31341094e-01 -7.93484390e-01  8.79477441e-01
   9.37548280e-02  1.71932125e+00 -2.86833227e-01 -3.72722864e-01
   1.90851021e+00 -2.20313538e-02 -6.29367590e-01 -8.73027518e-02
   7.52119124e-02 -2.04270408e-01 -3.61324817e-01 -6.27556801e-01
  -6.35196030e-01  1.59466898e+00 -3.46995920e-01 -2.06190050e-01]
 [ 1.76200703e-01 -3.89839470e-01  8.96097273e-02  4.98303622e-02
   7.93483481e-02 -3.81868362e-01  1.21796411e-02 -1.25424609e-01
  -4.28376019e-01 -5.14898300e-02  1.83273166e-01 -7.06694126e-02
  -2.95564890e-01 -3.16819400e-01  7.89013066e-05  2.81881005e-01
  -2.93145925e-01  1.28112026e-02 -1.75852761e-01  5.40404558e-01
   3.26927722e-01  1.76672205e-01 -4.41782363e-02 -3.85691732e-01
  -4.31915186e-02 -2.13175290e-03 -1.21829897e-01  2.62107879e-01
   6.77114204e-02  1.48053750e-01  3.76765169e-02  1.20105937e-01
   2.62329858e-02 -1.63009286e-01 -1.02827944e-01  2.02095389e-01
  -2.39163697e-01 -1.48303628e-01 -2.10601334e-02  2.05104262e-01
   5.38723320e-02  2.39459842e-01 -1.60894677e-01  7.97450393e-02
   4.51797899e-03 -1.90956011e-01  9.13655236e-02 -1.37388455e-02
  -8.62598494e-02  1.93306252e-01  1.72400519e-01  5.38419932e-02
  -1.43037345e-02  1.72264293e-01 -1.47748575e-01 -2.80846089e-01
   2.30912566e-01 -3.10832918e-01 -3.37104760e-02  5.44575155e-02
  -1.02778643e-01 -3.06686126e-02 -5.33233434e-02  7.89295533e-04]
[ 5.65004468e-01 -6.24979317e-01  2.49599636e-01 -7.79090524e-02
   2.90717721e-01 -7.93466926e-01 -2.15552717e-01 -5.00133634e-01
  -9.31059599e-01 -2.92390734e-01  4.29491341e-01 -3.81187648e-02
  -6.44094110e-01 -1.68293715e-01  1.33224621e-01  6.90573633e-01
  -4.43393201e-01 -8.41982961e-02 -2.42455527e-01  8.32186162e-01
   3.22877705e-01  2.47170150e-01  1.32656902e-01 -4.53047246e-01
   1.65028945e-01 -1.78530104e-02 -3.06671411e-01  9.25685465e-01
   2.99194723e-01  5.34004033e-01 -4.79875840e-02  4.43493307e-01
  -2.58198410e-01  4.23368961e-02 -2.16563329e-01  4.20637995e-01
  -5.77895164e-01  2.23520352e-03 -1.28733888e-01  3.48236859e-01
   4.42022383e-01  5.56681812e-01 -3.32573265e-01  2.27029875e-01
   1.26723479e-03 -3.69715035e-01  1.63047358e-01 -3.05177510e-01
  -1.79782454e-02  4.74121124e-01  3.50482136e-01 -7.71325082e-02
  -3.62917930e-02  5.43722570e-01 -2.28166834e-01 -6.83868110e-01
   7.66589046e-01 -3.43827784e-01  2.08208486e-01 -2.25625828e-01
  -3.03718925e-01 -2.49719307e-01 -6.51984140e-02 -2.95166135e-01]
 [-2.90451795e-01 -1.19539177e+00  1.79127604e-01  2.12810606e-01
   4.19164628e-01  7.84275353e-01  5.16766191e-01  8.23276818e-01
   8.44484866e-01 -1.07573509e+00  1.07982826e+00  1.60321876e-01
  -7.97771215e-01 -1.78780711e+00  2.93222931e-03 -1.84084345e-02
  -1.21997036e-01 -7.68034160e-01 -2.12774023e-01  1.57086396e+00
  -3.01641613e-01  2.08482426e-02 -9.64548945e-01 -2.98457444e-01
  -5.57536125e-01 -8.49156618e-01  1.06047344e+00  4.81399029e-01
   3.82250071e-01 -1.22886352e-01  2.83816409e+00 -1.97423249e-02
  -1.87647969e-01  3.27942916e-03  3.06393355e-01  2.02040291e+00
  -1.26779592e+00 -5.64161897e-01  4.54727113e-01  2.41001773e+00
   1.78840065e+00  1.30741343e-01  1.27594578e+00  5.93739867e-01
   2.46201262e-01 -1.06042480e+00  5.33313632e-01  3.21452580e-02
  -9.09399569e-01  7.06232369e-01  8.45255256e-01  6.77717090e-01
   9.28814650e-01  1.49612474e+00 -8.37044001e-01  4.98314053e-02
   1.54765964e+00 -5.33700138e-02  5.08897789e-02 -1.28234315e+00
   3.74768227e-01  1.30633920e-01  6.30517244e-01  8.68744016e-01]]

使用TextRank提取关键字

TextRank算法的核心思想来自著名的网页排序算法PageRank，如下图。

图1 网页排序算法PageRank

PageRank是Sergey Brin和Larry Page于1998年www7会议上提出来的，用来解决链接分析中网页排名问题。在衡量一个网页的排名时，可以认为，

当一个网页被越多网页所链接时，该网页排名越靠前。
排名高的网页应具有更大的表决权，即当一个网页被排名高的网页所链接时，其重要性也应提高。

TextRank算法，见下图，与PageRank类似，其将文本拆分成最小组成单元（词汇），作为网络节点，组成词汇图模型。TextRank在迭代计算词汇权重时与PageRank一样，理论上是需要计算边权的。为了简化计算，通常会默认相同的初始权重，以及在分配相邻词汇权重时进行均分。

图2 TextRank算法1

图3 TextRank算法2

TextRank关键字提取方法

TextRank用于对文本关键字进行提取，步骤如下，

把给定的文本T按照完整的句子进行分割。
对每个句子进行分词和词性标注处理，并过滤掉停用词，只保留指定词性的单词，如名次、动词、形容词。
构建关键词图 $G = \left( V, E \right)$ 。其中，V为节点集，由每个词之间的相似度作为连接的边值。
根据下面的公式迭代传播各节点的权重，直至收敛，

$WS\left( V_{i} \right) = \left( 1 - d \right) + d\times \sum_{V_{j}\in In\left( V_{i} \right)}^{}\frac{w_{ji}}{\sum_{V_{k}\in Out\left( V_{j} \right)}^{}w_{jk}}WS\left( V_{j} \right)$

图4 迭代传播各节点权重直至收敛的公式

对节点权重进行倒序排序，作为按重要程度排列的关键字。

TextRank类的实现

下面实现TextRank计算类，代码如下，


class TextRank:
    
    def __init__(self, strings):
        
        self.strings = strings
        self.filters = self.__get_words()
        self.win = self.__get_win()
        self.dictionary = self.__get_dictionary()
        
    def __get_words(self):
        
        words = []
        
        for string in self.strings:
            
            for word in string:
                
                words.append(word)
                
        return words;
    
    def __get_win(self):
        
        win = {}
        
        for i in range(len(self.filters)):
            
            if self.filters[i] not in win.keys():
                
                win[self.filters[i]] = set()
                
            if i - 5 < 0:
                
                index = 0
                
            else:
                
                index = i - 5
                
            for j in self.filters[index: i + 5]:
                
                win[self.filters[i]].add(j)
                
        return win
    
def __get_dictionary(self):
        
        time = 0
        
        scores = {w: 1.0 for w in self.filters}
        
        while time < 50:
            
            for key, value in self.win.items():
                
                score = scores[key] / len(value)
                
                scores[k] = 0
                
                for i in value:
                    
                    scores[i] += score
                    
            time += 1
            
        _dictionary = {}
        
        for key in scores:
            
            self.dictionary[key] = scores[key]
            
        return self.dictionary
    
    def __get_text_rank(self, string):
        
        _dictionary = {}
        values = []
        
        for word in string:
            
            if word in self.dictionary.keys():
                
                _dictionary[word] = self.dictionary[word]
                
        sorted_values = sorted(_dictionary.items(), key = lambda word_tfidf: word_tfidf[1], reverse = False)
        
        for value in sorted_values:
            
            values.append(value[0])
            
        return values

TextRank是实现关键字提取的方法，相对于数据集来说，如前面所述，关键字提取并非必需，可以自行决定是否深入了解。

结论

本章介绍了两种文本关键字提取方法TF-IDF和TextRank，以及两种方法的计算公式。选学章节，酌情学习。

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 221,576评论 6赞 515
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 94,515评论 3赞 399
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 168,017评论 0赞 360
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 59,626评论 1赞 296
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 68,625评论 6赞 397
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 52,255评论 1赞 308
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,825评论 3赞 421
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 39,729评论 0赞 276
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 46,271评论 1赞 320
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 38,363评论 3赞 340
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 40,498评论 1赞 352
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 36,183评论 5赞 350
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,867评论 3赞 333
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 32,338评论 0赞 24
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 33,458评论 1赞 272
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 48,906评论 3赞 376
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 45,507评论 2赞 359

第54章 文本主题提取