gensim模型(2)——Doc2Vec

Doc2Vec模型(Doc2Vec Model)

介绍Gensim的Doc2Vec模型且展示其在Lee Corpus上的用法。

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Doc2Vec是一个将每个文档表示为向量的模型。这个教程介绍了该模型并演示了如何训练和评估它。

下面是我们将要完成的事情清单:

  1. 回顾相关模型:词袋(bag-of-words),Word2Vec以及Doc2Vec
  2. 加载和预处理训练和测试语料库(查看Corpus
  3. 使用训练语料库训练一个Doc2Vec模型
  4. 演示训练后的模型如何被用于推理向量
  5. 评估模型
  6. 在测试语料库上测试模型

回顾:词袋(Review: Bag-of-words)

注意:如果你已经很熟悉这些模型,可以随意跳过这些回顾章节。

你可能已经从向量章节中熟悉了词袋模型。该模型把每个文档变换为一个固定长度的正型向量。例如,给定句子:

John likes to watch movies. Mary likes movies too.

John also likes to watch football games. Mary hates football.

该模型输出向量:

[1, 2, 1, 1, 2, 1, 1, 0, 0, 0, 0]

[1, 1, 1, 1, 0, 1, 0, 1, 2, 1, 1]

每个向量含有10个元素,其中每个元素是一个特定单词出现在文档中的计数。元素的顺序是任意的。在上述例子中,元素的顺序对应单词:["John", "likes", "to", "watch", "movies", "Mary", "too", "also", "football", "games", "hates"]。

词袋模型的效果令人惊讶,但是它有几个缺点。

首先,它们丢失了有关单词顺序的所有信息:“John likes Mary”和“Mary likes John”都对应同一个向量。这里有一个解决方案:bag of n-grams模型考虑长度为n的单词短语来将文档表示为固定长度的向量。其可以捕获局部单词顺序,但需要忍受数据稀疏和高维度。

第二,该模型不会试图去学习潜在单词的意义。因此,向量之间的距离不会总反映它们在意义上的差异。Word2Vec解决了第二个问题。

回顾:Word2Vec模型(Review: Word2Vec Model)

Word2Vec是一个较新的模型,它通过使用一个浅层的神经网络将单词嵌入到一个低维的向量空间。该模型的结果是词向量集,其中在向量空间上相互靠近的向量在文档上也有相似的意义,且彼此相距遥远的词向量有不同的含义。例如,strong和powerful彼此相互靠近,但strong和Paris则相当远。

Gensim的Word2Vec类实现了这个模型。

使用Word2Vec模型,我们可以计算文档中的每个单词的向量。但是,如果我们想要为整个文档计算向量呢?我们可以对文档中的所有单词求平均——虽然这个方法粗糙且便捷,但它往往是有用的。然而,有一个更好的方法……

介绍:段落向量(Introducing: Paragraph Vector)

重要:在Gensim中,我们将段落向量记为Doc2Vec。

Le和Mikolov在2014年提出了Doc2Vec算法,这通常优于Word2vec向量的简单平均。

基本思路是:就好像文档有另一个浮点的单词状向量,它有助于所有训练预测,并像其他词向量一样更新,但我们将称之为文档矢量。Gensim的Doc2Vec类实现了此算法。

有两个实现:

  1. 段落向量——分布式记忆(Paragraph Vector - Distributed Memory, PV-DM)
  2. 段落向量——分布词袋(Paragraph Vector - Distributed Bag of Words, PV-DBOW)

重要:不要让下面的实现细节吓到你了。这是高级教材:如果感觉太多,可以移到下一章节

PV-DM和Word2Vec CBOW类似。文档向量是通过在合成任务上训练神经网络来获得的。该合成任务基于上下文词向量和完整文档的文档向量的平均值预测中心单词。

PV-DBOW和Word2Vec SG类似。文档向量是通过在合成任务上训练神经网络来获得的。该合成任务仅利用完整文档的文档向量来预测目标单词。(该模型通常与skip-gram相结合,使用文档向量和附近的词向量来预测单个目标单词,且一次仅预测一个。)

准备训练和测试数据(Prepare the Training and Test Data)

在此教程中,我们使用gensim中的Lee Background Corpus来训练模型。该语料库包含从澳大利亚广播公司的新闻邮件服务中选择的314个文档。该服务提供的文本电子邮件包含带标题的故事,且涵盖了许多广泛的主题。

同时,使用较短的包含50个文本Lee Corpus,我们可以用肉眼测试我们的模型。

import os
import gensim
# Set file names for train and test data
test_data_dir = os.path.join(gensim.__path__[0], 'test', 'test_data')
lee_train_file = os.path.join(test_data_dir, 'lee_background.cor')
lee_test_file = os.path.join(test_data_dir, 'lee.cor')

定义读取和处理文本的函数(Define a Function to Read and Preprocess Text)

下面,我们定义一个函数用于:

  1. 打开训练/测试文件(latin编码)
  2. 按行读取文件
  3. 对每行进行预处理(将文本标记为单独的单词,移除标点符号,设置为小写,等等)

我们读取的文件死一个语料库。文件的每一行是一个文档。

重要:为了训练模型,我们需要给训练语料库中的每个文档关联一个标签/数字。在我们的例子中,标签只是从零开始的行号。

import smart_open

def read_corpus(fname, tokens_only=False):
    with smart_open.open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            tokens = gensim.utils.simple_preprocess(line)
            if tokens_only:
                yield tokens
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

train_corpus = list(read_corpus(lee_train_file))
test_corpus = list(read_corpus(lee_test_file, tokens_only=True))

让我们看一看训练语料库

print(train_corpus[:2])

结果为:

[TaggedDocument(words=['hundreds', 'of', 'people', 'have', 'been', 'forced', 'to', 'vacate', 'their', 'homes', 'in', 'the', 'southern', 'highlands', 'of', 'new', 'south', 'wales', 'as', 'strong', 'winds', 'today', 'pushed', 'huge', 'bushfire', 'towards', 'the', 'town', 'of', 'hill', 'top', 'new', 'blaze', 'near', 'goulburn', 'south', 'west', 'of', 'sydney', 'has', 'forced', 'the', 'closure', 'of', 'the', 'hume', 'highway', 'at', 'about', 'pm', 'aedt', 'marked', 'deterioration', 'in', 'the', 'weather', 'as', 'storm', 'cell', 'moved', 'east', 'across', 'the', 'blue', 'mountains', 'forced', 'authorities', 'to', 'make', 'decision', 'to', 'evacuate', 'people', 'from', 'homes', 'in', 'outlying', 'streets', 'at', 'hill', 'top', 'in', 'the', 'new', 'south', 'wales', 'southern', 'highlands', 'an', 'estimated', 'residents', 'have', 'left', 'their', 'homes', 'for', 'nearby', 'mittagong', 'the', 'new', 'south', 'wales', 'rural', 'fire', 'service', 'says', 'the', 'weather', 'conditions', 'which', 'caused', 'the', 'fire', 'to', 'burn', 'in', 'finger', 'formation', 'have', 'now', 'eased', 'and', 'about', 'fire', 'units', 'in', 'and', 'around', 'hill', 'top', 'are', 'optimistic', 'of', 'defending', 'all', 'properties', 'as', 'more', 'than', 'blazes', 'burn', 'on', 'new', 'year', 'eve', 'in', 'new', 'south', 'wales', 'fire', 'crews', 'have', 'been', 'called', 'to', 'new', 'fire', 'at', 'gunning', 'south', 'of', 'goulburn', 'while', 'few', 'details', 'are', 'available', 'at', 'this', 'stage', 'fire', 'authorities', 'says', 'it', 'has', 'closed', 'the', 'hume', 'highway', 'in', 'both', 'directions', 'meanwhile', 'new', 'fire', 'in', 'sydney', 'west', 'is', 'no', 'longer', 'threatening', 'properties', 'in', 'the', 'cranebrook', 'area', 'rain', 'has', 'fallen', 'in', 'some', 'parts', 'of', 'the', 'illawarra', 'sydney', 'the', 'hunter', 'valley', 'and', 'the', 'north', 'coast', 'but', 'the', 'bureau', 'of', 'meteorology', 'claire', 'richards', 'says', 'the', 'rain', 'has', 'done', 'little', 'to', 'ease', 'any', 'of', 'the', 'hundred', 'fires', 'still', 'burning', 'across', 'the', 'state', 'the', 'falls', 'have', 'been', 'quite', 'isolated', 'in', 'those', 'areas', 'and', 'generally', 'the', 'falls', 'have', 'been', 'less', 'than', 'about', 'five', 'millimetres', 'she', 'said', 'in', 'some', 'places', 'really', 'not', 'significant', 'at', 'all', 'less', 'than', 'millimetre', 'so', 'there', 'hasn', 'been', 'much', 'relief', 'as', 'far', 'as', 'rain', 'is', 'concerned', 'in', 'fact', 'they', 've', 'probably', 'hampered', 'the', 'efforts', 'of', 'the', 'firefighters', 'more', 'because', 'of', 'the', 'wind', 'gusts', 'that', 'are', 'associated', 'with', 'those', 'thunderstorms'], tags=[0]),
 TaggedDocument(words=['indian', 'security', 'forces', 'have', 'shot', 'dead', 'eight', 'suspected', 'militants', 'in', 'night', 'long', 'encounter', 'in', 'southern', 'kashmir', 'the', 'shootout', 'took', 'place', 'at', 'dora', 'village', 'some', 'kilometers', 'south', 'of', 'the', 'kashmiri', 'summer', 'capital', 'srinagar', 'the', 'deaths', 'came', 'as', 'pakistani', 'police', 'arrested', 'more', 'than', 'two', 'dozen', 'militants', 'from', 'extremist', 'groups', 'accused', 'of', 'staging', 'an', 'attack', 'on', 'india', 'parliament', 'india', 'has', 'accused', 'pakistan', 'based', 'lashkar', 'taiba', 'and', 'jaish', 'mohammad', 'of', 'carrying', 'out', 'the', 'attack', 'on', 'december', 'at', 'the', 'behest', 'of', 'pakistani', 'military', 'intelligence', 'military', 'tensions', 'have', 'soared', 'since', 'the', 'raid', 'with', 'both', 'sides', 'massing', 'troops', 'along', 'their', 'border', 'and', 'trading', 'tit', 'for', 'tat', 'diplomatic', 'sanctions', 'yesterday', 'pakistan', 'announced', 'it', 'had', 'arrested', 'lashkar', 'taiba', 'chief', 'hafiz', 'mohammed', 'saeed', 'police', 'in', 'karachi', 'say', 'it', 'is', 'likely', 'more', 'raids', 'will', 'be', 'launched', 'against', 'the', 'two', 'groups', 'as', 'well', 'as', 'other', 'militant', 'organisations', 'accused', 'of', 'targetting', 'india', 'military', 'tensions', 'between', 'india', 'and', 'pakistan', 'have', 'escalated', 'to', 'level', 'not', 'seen', 'since', 'their', 'war'], tags=[1])]

同样,测试语料库看起来是:

print(test_corpus[:2])

结果为:

[['the', 'national', 'executive', 'of', 'the', 'strife', 'torn', 'democrats', 'last', 'night', 'appointed', 'little', 'known', 'west', 'australian', 'senator', 'brian', 'greig', 'as', 'interim', 'leader', 'shock', 'move', 'likely', 'to', 'provoke', 'further', 'conflict', 'between', 'the', 'party', 'senators', 'and', 'its', 'organisation', 'in', 'move', 'to', 'reassert', 'control', 'over', 'the', 'party', 'seven', 'senators', 'the', 'national', 'executive', 'last', 'night', 'rejected', 'aden', 'ridgeway', 'bid', 'to', 'become', 'interim', 'leader', 'in', 'favour', 'of', 'senator', 'greig', 'supporter', 'of', 'deposed', 'leader', 'natasha', 'stott', 'despoja', 'and', 'an', 'outspoken', 'gay', 'rights', 'activist'], ['cash', 'strapped', 'financial', 'services', 'group', 'amp', 'has', 'shelved', 'million', 'plan', 'to', 'buy', 'shares', 'back', 'from', 'investors', 'and', 'will', 'raise', 'million', 'in', 'fresh', 'capital', 'after', 'profits', 'crashed', 'in', 'the', 'six', 'months', 'to', 'june', 'chief', 'executive', 'paul', 'batchelor', 'said', 'the', 'result', 'was', 'solid', 'in', 'what', 'he', 'described', 'as', 'the', 'worst', 'conditions', 'for', 'stock', 'markets', 'in', 'years', 'amp', 'half', 'year', 'profit', 'sank', 'per', 'cent', 'to', 'million', 'or', 'share', 'as', 'australia', 'largest', 'investor', 'and', 'fund', 'manager', 'failed', 'to', 'hit', 'projected', 'per', 'cent', 'earnings', 'growth', 'targets', 'and', 'was', 'battered', 'by', 'falling', 'returns', 'on', 'share', 'markets']]

值得注意的是,测试语料库是有列表组成的列表,且不包含任何的标签。

训练模型(Training the Model)

现在,我们将实例化一个Doc2Vec模型,其中向量尺寸是50维且在训练语料库上的迭代次数为40。为了丢弃出现次数非常非常少的单词,我们将最小文字计数设置为2。(如果没有各种有代表性的例子,保留这种不常用的单词往往会使模型变得更糟!)发表的的段向量论文结果中的典型迭代计数(使用几万到数百万文档)为10-20。更多的迭代需要更多的时间,并最终达到递减的点。

然而,这个一个非常小的包含短文档(几百个单词)的数据集(300个文档)。增加训练次数有时可以帮助处理这样小的数据集。

model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)

创建一个词汇表

model.build_vocab(train_corpus)

结果为:

2020-09-30 21:08:55,026 : INFO : collecting all words and their counts
2020-09-30 21:08:55,027 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2020-09-30 21:08:55,043 : INFO : collected 6981 word types and 300 unique tags from a corpus of 300 examples and 58152 words
2020-09-30 21:08:55,043 : INFO : Loading a fresh vocabulary
2020-09-30 21:08:55,064 : INFO : effective_min_count=2 retains 3955 unique words (56% of original 6981, drops 3026)
2020-09-30 21:08:55,064 : INFO : effective_min_count=2 leaves 55126 word corpus (94% of original 58152, drops 3026)
2020-09-30 21:08:55,098 : INFO : deleting the raw counts dictionary of 6981 items
2020-09-30 21:08:55,100 : INFO : sample=0.001 downsamples 46 most-common words
2020-09-30 21:08:55,100 : INFO : downsampling leaves estimated 42390 word corpus (76.9% of prior 55126)
2020-09-30 21:08:55,149 : INFO : estimated required memory for 3955 words and 50 dimensions: 3679500 bytes
2020-09-30 21:08:55,149 : INFO : resetting layer weights

本质上,词汇表是一个从训练语料库中提取的包含所有唯一单词的列表(可通过model.wv.index_to_key访问)。每个单词的额外属性可通过model.wv.get_vecattr()方法访问。例如,查看“penalty”出现训练语料库中的出现次数:

print(f"Word 'penalty' appeared {model.wv.get_vecattr('penalty', 'count')} times in the training corpus.")

结果为:

Word 'penalty' appeared 4 times in the training corpus.

接下来,需要在语料库上训练模型。如果在使用优化后的Gensim(带有BLAS库),训练时间不会超过3秒。如果没有使用BLAS库,训练时间不会超过2分钟。因此,如果你在意时间,请使用带有BLAS的优化Gensim。

model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

结果为:

2021-11-16 10:05:02,148 : INFO : Doc2Vec lifecycle event {'msg': 'training model with 3 workers on 3955 vocabulary and 50 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2021-11-16T10:05:02.148868', 'gensim': '4.1.2', 'python': '3.8.8 (default, Apr 13 2021, 12:59:45) \n[Clang 10.0.0 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'train'}
2021-11-16 10:05:02,205 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,209 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,209 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,210 : INFO : EPOCH - 1 : training on 58152 raw words (42680 effective words) took 0.1s, 773539 effective words/s
2021-11-16 10:05:02,257 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,259 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,260 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,260 : INFO : EPOCH - 2 : training on 58152 raw words (42645 effective words) took 0.0s, 888781 effective words/s
2021-11-16 10:05:02,305 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,307 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,308 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,309 : INFO : EPOCH - 3 : training on 58152 raw words (42665 effective words) took 0.0s, 937587 effective words/s
2021-11-16 10:05:02,354 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,355 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,356 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,357 : INFO : EPOCH - 4 : training on 58152 raw words (42653 effective words) took 0.0s, 955655 effective words/s
2021-11-16 10:05:02,401 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,403 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,404 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,404 : INFO : EPOCH - 5 : training on 58152 raw words (42717 effective words) took 0.0s, 945565 effective words/s
2021-11-16 10:05:02,445 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,446 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,447 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,448 : INFO : EPOCH - 6 : training on 58152 raw words (42625 effective words) took 0.0s, 1035609 effective words/s
2021-11-16 10:05:02,486 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,487 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,488 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,489 : INFO : EPOCH - 7 : training on 58152 raw words (42800 effective words) took 0.0s, 1099399 effective words/s
2021-11-16 10:05:02,525 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,527 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,527 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,528 : INFO : EPOCH - 8 : training on 58152 raw words (42803 effective words) took 0.0s, 1150303 effective words/s
2021-11-16 10:05:02,566 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,568 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,568 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,569 : INFO : EPOCH - 9 : training on 58152 raw words (42763 effective words) took 0.0s, 1105952 effective words/s
2021-11-16 10:05:02,604 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,608 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,609 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,609 : INFO : EPOCH - 10 : training on 58152 raw words (42715 effective words) took 0.0s, 1150022 effective words/s
2021-11-16 10:05:02,649 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,651 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,651 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,651 : INFO : EPOCH - 11 : training on 58152 raw words (42628 effective words) took 0.0s, 1100282 effective words/s
2021-11-16 10:05:02,689 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,690 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,691 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,692 : INFO : EPOCH - 12 : training on 58152 raw words (42673 effective words) took 0.0s, 1115292 effective words/s
2021-11-16 10:05:02,731 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,732 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,733 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,733 : INFO : EPOCH - 13 : training on 58152 raw words (42519 effective words) took 0.0s, 1093006 effective words/s
2021-11-16 10:05:02,770 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,771 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,772 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,773 : INFO : EPOCH - 14 : training on 58152 raw words (42698 effective words) took 0.0s, 1154425 effective words/s
2021-11-16 10:05:02,809 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,809 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,810 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,810 : INFO : EPOCH - 15 : training on 58152 raw words (42717 effective words) took 0.0s, 1198759 effective words/s
2021-11-16 10:05:02,848 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,850 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,852 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,852 : INFO : EPOCH - 16 : training on 58152 raw words (42670 effective words) took 0.0s, 1070404 effective words/s
2021-11-16 10:05:02,889 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,890 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,890 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,891 : INFO : EPOCH - 17 : training on 58152 raw words (42785 effective words) took 0.0s, 1181380 effective words/s
2021-11-16 10:05:02,928 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,929 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,929 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,930 : INFO : EPOCH - 18 : training on 58152 raw words (42781 effective words) took 0.0s, 1151716 effective words/s
2021-11-16 10:05:02,967 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,967 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,968 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,968 : INFO : EPOCH - 19 : training on 58152 raw words (42722 effective words) took 0.0s, 1191799 effective words/s
2021-11-16 10:05:03,005 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,006 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,006 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,007 : INFO : EPOCH - 20 : training on 58152 raw words (42545 effective words) took 0.0s, 1196913 effective words/s
2021-11-16 10:05:03,043 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,046 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,047 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,048 : INFO : EPOCH - 21 : training on 58152 raw words (42669 effective words) took 0.0s, 1088880 effective words/s
2021-11-16 10:05:03,087 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,088 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,089 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,089 : INFO : EPOCH - 22 : training on 58152 raw words (42641 effective words) took 0.0s, 1085220 effective words/s
2021-11-16 10:05:03,127 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,128 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,129 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,130 : INFO : EPOCH - 23 : training on 58152 raw words (42682 effective words) took 0.0s, 1118567 effective words/s
2021-11-16 10:05:03,170 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,171 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,172 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,172 : INFO : EPOCH - 24 : training on 58152 raw words (42579 effective words) took 0.0s, 1068513 effective words/s
2021-11-16 10:05:03,229 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,232 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,235 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,236 : INFO : EPOCH - 25 : training on 58152 raw words (42758 effective words) took 0.1s, 688556 effective words/s
2021-11-16 10:05:03,285 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,286 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,286 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,287 : INFO : EPOCH - 26 : training on 58152 raw words (42724 effective words) took 0.0s, 940922 effective words/s
2021-11-16 10:05:03,328 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,330 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,330 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,331 : INFO : EPOCH - 27 : training on 58152 raw words (42712 effective words) took 0.0s, 1043624 effective words/s
2021-11-16 10:05:03,368 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,370 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,371 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,371 : INFO : EPOCH - 28 : training on 58152 raw words (42606 effective words) took 0.0s, 1107016 effective words/s
2021-11-16 10:05:03,407 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,408 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,409 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,409 : INFO : EPOCH - 29 : training on 58152 raw words (42713 effective words) took 0.0s, 1192136 effective words/s
2021-11-16 10:05:03,446 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,448 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,449 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,449 : INFO : EPOCH - 30 : training on 58152 raw words (42619 effective words) took 0.0s, 1140147 effective words/s
2021-11-16 10:05:03,486 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,487 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,487 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,488 : INFO : EPOCH - 31 : training on 58152 raw words (42653 effective words) took 0.0s, 1161469 effective words/s
2021-11-16 10:05:03,524 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,527 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,527 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,528 : INFO : EPOCH - 32 : training on 58152 raw words (42698 effective words) took 0.0s, 1135948 effective words/s
2021-11-16 10:05:03,565 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,566 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,567 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,567 : INFO : EPOCH - 33 : training on 58152 raw words (42689 effective words) took 0.0s, 1161094 effective words/s
2021-11-16 10:05:03,603 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,605 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,606 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,606 : INFO : EPOCH - 34 : training on 58152 raw words (42571 effective words) took 0.0s, 1153284 effective words/s
2021-11-16 10:05:03,643 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,644 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,645 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,645 : INFO : EPOCH - 35 : training on 58152 raw words (42741 effective words) took 0.0s, 1136146 effective words/s
2021-11-16 10:05:03,683 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,684 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,684 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,685 : INFO : EPOCH - 36 : training on 58152 raw words (42825 effective words) took 0.0s, 1160456 effective words/s
2021-11-16 10:05:03,722 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,723 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,724 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,724 : INFO : EPOCH - 37 : training on 58152 raw words (42707 effective words) took 0.0s, 1152339 effective words/s
2021-11-16 10:05:03,763 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,764 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,764 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,764 : INFO : EPOCH - 38 : training on 58152 raw words (42561 effective words) took 0.0s, 1126831 effective words/s
2021-11-16 10:05:03,801 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,803 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,803 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,803 : INFO : EPOCH - 39 : training on 58152 raw words (42737 effective words) took 0.0s, 1154606 effective words/s
2021-11-16 10:05:03,842 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,842 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,843 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,843 : INFO : EPOCH - 40 : training on 58152 raw words (42875 effective words) took 0.0s, 1138368 effective words/s
2021-11-16 10:05:03,844 : INFO : Doc2Vec lifecycle event {'msg': 'training on 2326080 raw words (1707564 effective words) took 1.7s, 1008257 effective words/s', 'datetime': '2021-11-16T10:05:03.844291', 'gensim': '4.1.2', 'python': '3.8.8 (default, Apr 13 2021, 12:59:45) \n[Clang 10.0.0 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'train'}

现在,通过将单词列表传递给model.infer_vector函数,我们可以使用训练后的模型为任意文本推理其向量。这个向量可以通过cosine相似度与其它向量进行比较。

vector = model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires'])
print(vector)

结果为:

[-0.08478509  0.05011684  0.0675064  -0.19926868 -0.1235586   0.01768214
 -0.12645927  0.01062329  0.06113973  0.35424358  0.01320948  0.07561274
 -0.01645093  0.0692549   0.08346193 -0.01599065  0.08287009 -0.0139379
 -0.17772709 -0.26271465  0.0442089  -0.04659882 -0.12873884  0.28799203
 -0.13040264  0.12478471 -0.14091878 -0.09698066 -0.07903259 -0.10124907
 -0.28239366  0.13270256  0.04445919 -0.24210942 -0.1907376  -0.07264525
 -0.14167067 -0.22816683 -0.00663796  0.23165748 -0.10436232 -0.01028251
 -0.04064698  0.08813146  0.01072008 -0.149789    0.05923386  0.16301566
  0.05815683  0.1258063]

请注意,infer_vector()不把字符串作为输入,而是将字符串的标记列表作为输入,该标记应当以原始训练文档对象的单词属性相同的方式进行标记化。

还要注意,由于基础训练/推理算法是利用内部随机化的迭代近似问题,因此同一文本的重复推论将返回略有不同的向量。

评估模型(Assessing the Model)

为了评估我们的新模型,我们将首先推理训练语料库中每个文档的新向量,将推理出的向量与训练语料库进行比较,然后根据自相似性返回文档的等级。基本上,我们假装训练语料库是一些新的看不见的数据,然后看看它们如何与训练模型进行比较。期望是,我们可能已经过拟合我们的模型(即,所有的排名将小于2),所以我们应该能够很容易地找到类似的文件。此外,我们将跟踪排名的第二,以便比较不太相似的文件。

ranks = []
second_ranks = []
for doc_id in range(len(train_corpus)):
    inferred_vector = model.infer_vector(train_corpus[doc_id].words)
    sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))
    rank = [docid for docid, sim in sims].index(doc_id)
    ranks.append(rank)

    second_ranks.append(sims[1])

让我们计算一下每个文档在训练语料库方面的排名。由于使用了随机数种子和小型语料库,结果因运行而异。

import collections

counter = collections.Counter(ranks)
print(counter)

结果为:

Counter({0: 292, 1: 8})

基本上,超过95%的推理文档被发现与自身最相似,大约 5%的文档与另一份文档错误地最相似。根据训练向量检查推理向量是一种"理智检查",即模型的行为是否以有用的一致方式进行,尽管不是真正的"准确"值。

这是很好的,并不完全令人惊讶。我们可以举一个例子:

print('Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))

结果为:

Document (299): «australia will take on france in the doubles rubber of the davis cup tennis final today with the tie levelled at wayne arthurs and todd woodbridge are scheduled to lead australia in the doubles against cedric pioline and fabrice santoro however changes can be made to the line up up to an hour before the match and australian team captain john fitzgerald suggested he might do just that we ll make team appraisal of the whole situation go over the pros and cons and make decision french team captain guy forget says he will not make changes but does not know what to expect from australia todd is the best doubles player in the world right now so expect him to play he said would probably use wayne arthurs but don know what to expect really pat rafter salvaged australia davis cup campaign yesterday with win in the second singles match rafter overcame an arm injury to defeat french number one sebastien grosjean in three sets the australian says he is happy with his form it not very pretty tennis there isn too many consistent bounces you are playing like said bit of classic old grass court rafter said rafter levelled the score after lleyton hewitt shock five set loss to nicholas escude in the first singles rubber but rafter says he felt no added pressure after hewitt defeat knew had good team to back me up even if we were down he said knew could win on the last day know the boys can win doubles so even if we were down still feel we are good enough team to win and vice versa they are good enough team to beat us as well»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/m,d50,n5,w5,mc2,s0.001,t3):

MOST (299, 0.9490002989768982): «australia will take on france in the doubles rubber of the davis cup tennis final today with the tie levelled at wayne arthurs and todd woodbridge are scheduled to lead australia in the doubles against cedric pioline and fabrice santoro however changes can be made to the line up up to an hour before the match and australian team captain john fitzgerald suggested he might do just that we ll make team appraisal of the whole situation go over the pros and cons and make decision french team captain guy forget says he will not make changes but does not know what to expect from australia todd is the best doubles player in the world right now so expect him to play he said would probably use wayne arthurs but don know what to expect really pat rafter salvaged australia davis cup campaign yesterday with win in the second singles match rafter overcame an arm injury to defeat french number one sebastien grosjean in three sets the australian says he is happy with his form it not very pretty tennis there isn too many consistent bounces you are playing like said bit of classic old grass court rafter said rafter levelled the score after lleyton hewitt shock five set loss to nicholas escude in the first singles rubber but rafter says he felt no added pressure after hewitt defeat knew had good team to back me up even if we were down he said knew could win on the last day know the boys can win doubles so even if we were down still feel we are good enough team to win and vice versa they are good enough team to beat us as well»

SECOND-MOST (104, 0.7925528883934021): «australian cricket captain steve waugh has supported fast bowler brett lee after criticism of his intimidatory bowling to the south african tailenders in the first test in adelaide earlier this month lee was fined for giving new zealand tailender shane bond an unsportsmanlike send off during the third test in perth waugh says tailenders should not be protected from short pitched bowling these days you re earning big money you ve got responsibility to learn how to bat he said mean there no times like years ago when it was not professional and sort of bowlers code these days you re professional our batsmen work very hard at their batting and expect other tailenders to do likewise meanwhile waugh says his side will need to guard against complacency after convincingly winning the first test by runs waugh says despite the dominance of his side in the first test south africa can never be taken lightly it only one test match out of three or six whichever way you want to look at it so there lot of work to go he said but it nice to win the first battle definitely it gives us lot of confidence going into melbourne you know the big crowd there we love playing in front of the boxing day crowd so that will be to our advantage as well south africa begins four day match against new south wales in sydney on thursday in the lead up to the boxing day test veteran fast bowler allan donald will play in the warm up match and is likely to take his place in the team for the second test south african captain shaun pollock expects much better performance from his side in the melbourne test we still believe that we didn play to our full potential so if we can improve on our aspects the output we put out on the field will be lot better and we still believe we have side that is good enough to beat australia on our day he said»

MEDIAN (57, 0.24077531695365906): «afghanistan new interim government is to meet for the first time later today after an historic inauguration ceremony in the afghan capital kabul interim president hamid karzai and his fellow cabinet members are looking to start rebuilding afghanistan war ravaged economy mr karzai says he expects the reconstruction to cost many billions of dollars after years of war afghanistan must go from an economy of war to an economy of peace mr karzai said those people who ve earned living by taking the gun must be enabled with programs with plans with projects to put the gun aside and go to the various other forms of economic activity that can bring them livelihood he said»

LEAST (243, -0.0900598019361496): «four afghan factions have reached agreement on an interim cabinet during talks in germany the united nations says the administration which will take over from december will be headed by the royalist anti taliban commander hamed karzai it concludes more than week of negotiations outside bonn and is aimed at restoring peace and stability to the war ravaged country the year old former deputy foreign minister who is currently battling the taliban around the southern city of kandahar is an ally of the exiled afghan king mohammed zahir shah he will serve as chairman of an interim authority that will govern afghanistan for six month period before loya jirga or grand traditional assembly of elders in turn appoints an month transitional government meanwhile united states marines are now reported to have been deployed in eastern afghanistan where opposition forces are closing in on al qaeda soldiers reports from the area say there has been gun battle between the opposition and al qaeda close to the tora bora cave complex where osama bin laden is thought to be hiding in the south of the country american marines are taking part in patrols around the air base they have secured near kandahar but are unlikely to take part in any assault on the city however the chairman of the joint chiefs of staff general richard myers says they are prepared for anything they are prepared for engagements they re robust fighting force and they re absolutely ready to engage if that required he said»

请注意,上面最相似的文档(通常为同一个文本)具有接近1.0的相似性分数。但是,排名第二的文档的相似性分数应显著降低(假设文档实际上不同),当我们检查文本本身时,推理结果就变得显而易见。

我们可以反复运行下一个单元格,以查看其他目标文档比较的采样。

# Pick a random document from the corpus and infer a vector from the model
import random
doc_id = random.randint(0, len(train_corpus) - 1)

# Compare and print the second-most-similar document
print('Train Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
sim_id = second_ranks[doc_id]
print('Similar Document {}: «{}»\n'.format(sim_id, ' '.join(train_corpus[sim_id[0]].words)))

结果为:

Train Document (158): «the afl leading goal kicker tony lockett will nominate for the pre season draft after all lockett approached the sydney swans about return to the game last week but after much media speculation decided it was not in the best interests of his family to come out of retirement today the year old changed his mind he has informed the swans of his intention to nominate for next tuesday pre season draft in statement released short time ago lockett says last week he felt rushed and did not feel comfortable with his decision he says over the weekend he had time to think the matter through with his family who support his comeback sydney says it is delighted lockett has decided to make return and it intends to draft him»

Similar Document (246, 0.7817429900169373): «the afl all time leading goalkicker tony lockett will decide within the next week if he will make comeback lockett has told the sydney swans he is interested in coming out of retirement and placing himself in this month pre season draft lockett retired at the end of the season and will turn in march swans chief executive kelvin templeton says the club would welcome lockett back we re not putting any undue pressure on him mr templeton said the approach really came from tony to us rather than the other way mr templeton says if lockett does make comeback the club would not expect him to play every game he certainly could play role albeit reduced role from the one the fans knew him to hold couple of years back he said»

测试模型(Testing the Model)

使用上述相同方法,我们将为随机选择的测试文档推理向量,并用肉眼将文档与我们的模型进行比较。

# Pick a random document from the test corpus and infer a vector from the model
doc_id = random.randint(0, len(test_corpus) - 1)
inferred_vector = model.infer_vector(test_corpus[doc_id])
sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))

# Compare and print the most/median/least similar documents from the train corpus
print('Test Document ({}): «{}»\n'.format(doc_id, ' '.join(test_corpus[doc_id])))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))

结果为:

Test Document (19): «the united nations was determined that its showpiece environment summit the biggest conference the world has ever witnessed should be staged in africa the venue however could not be further removed from the grim realities of life in the rest of africa johannesburg exclusive and formerly whites only suburb of sandton is the wealthiest neighbourhood in the continent just few kilometres from sandton begins the sprawling alexandra township where nearly million people live in squalor organisers of the conference which begins today seem determined that the two worlds should be kept as far apart as possible»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/m,d50,n5,w5,mc2,s0.001,t3):

MOST (298, 0.6037775278091431): «university of canberra academic proposal for republic will be one of five discussed at an historic conference starting in corowa today the conference is part of centenary of federation celebrations and recognises the corowa conference of which began the process towards the federation of australia in university of canberra law lecturer bedeharris is proposing three referenda to determine the republic issue they would decide on whether the monarchy should be replaced the codification powers for head of state and the choice of republic model doctor harris says any constitutional change must involve all australians think it is very important that the people of australia be given the opporunity to choose or be consulted at every stage of the process»

MEDIAN (270, 0.27405408024787903): «businessmen solomon lew and lindsay fox have called on the federal government to help break qantas dominance to ensure their bid for ansett is successful the pair met with the victorian premier steve bracks yesterday to update him on the progress of the bid over the weekend the federal government ruled out further assistance for the proposal mr lew says he has not requested financial assistance from the government but review of trade practices could be important he says he is also hopeful the government will help break qantas dominance of the aviation industry we are concerned of the fact that at this point in time the largest competitor has over per cent market share and the deputy prime minister john anderson did quote both to lindsay and myself and publicly that he would regulate it to per cent mr lew said he says the bid does not require any other government help at no time did we ever ask the government for any grant or any cash payment or any dollars from taxpayers what we asked for was for business from the government which will be forthcoming in our opinion and an assurance that there would be trade practices review of the current airline situation»

LEAST (153, -0.07414346933364868): «at least two helicopters have landed near tora bora mountain in eastern afghanistan in what could be the start of raid against al qaeda fighters an afp journalist said the helicopters landed around pm local time am aedt few hours after al qaeda fighters rejected deadline set by afghan militia leaders for them to surrender or face death us warplanes have been bombing the network of caves and tunnels for eight days as part of the hunt for al qaeda leader osama bin laden several witnesses have spoken in recent days of seeing members of us or british special forces near the frontline between the local afghan militia and the followers of bin laden they could not be seen but could be clearly heard as they came into land and strong lights were seen in the same district us bombers and other warplanes staged series of attacks on the al qaeda positions in the white mountains after bin laden fighters failed to surrender all four crew members of us bomber that has crashed in the indian ocean near diego garcia have been rescued us military officials said pentagon spokesman navy captain timothy taylor said initial reports said that all four were aboard the destroyer uss russell which was rushed to the scene after the crash the bomber which usually carries crew of four and is armed with bombs and cruise missiles was engaged in the air war over afghanistan pentagon officials said they had heard about the crash just after am aedt and were unable to say whether the plane was headed to diego garcia or flying from the indian ocean island it is thought the australian arrested in afghanistan for fighting alongside the taliban is from adelaide northern suburbs but the salisbury park family of year old david hicks is remaining silent the president of adelaide islamic society walli hanifi says mr hicks approached him in having just returned from kosovo where he had developed an interest in islam he says mr hicks wanted to know more about the faith but left after few weeks late yesterday afternoon mr hicks salisbury park family told media the australian federal police had told them not to comment local residents confirmed member of the family called mr hicks had travelled to kosovo in recent years and has not been seen for around three years but most including karen white agree they cannot imagine mr hicks fighting for terrorist regime not unless he changed now but when he left here no he wasn he just normal teenage adult boy she said but man known as nick told channel ten he is sure the man detained in afghanistan is his friend david he says in david told him about training in the kosovo liberation army he gone through six weeks basic training how he been in the trenches you know killed few people you know confirmed kills and had few of his mates killed as well the man said»
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容