语料库和向量空间(Corpora and Vector Spaces)
展示将文本转换为向量空间表示。
也介绍将语料库流和持久性以各种格式引入磁盘。
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
首先,让我们创建一个包含9个短文档的小型语料库:
从字符串到向量(From Strings to Vectors)
这一次,让我们从表示为字符串的文档开始:
documents = [
"Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey",
]
这是一个包含9个文档的小型语料库,每一个文档只包含单个句子。
首先,让我们符号化文档,移除通用单词(使用一个stoplist)和只在文档中出现过一次的单词:
from pprint import pprint
from collections import defaultdict
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]
# remove words that appear only once
frequency = defaultdict(int)
for text in texts:
for token in text:
frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1] for text in texts]
pprint(texts)
结果为:
[['human', 'interface', 'computer'],
['survey', 'user', 'computer', 'system', 'response', 'time'],
['eps', 'user', 'interface', 'system'],
['system', 'human', 'system', 'eps'],
['user', 'response', 'time'],
['trees'],
['graph', 'trees'],
['graph', 'minors', 'trees'],
['graph', 'minors', 'survey']]
您处理文档的方式可能会有所不同;这里,我仅仅小写化所有单词,并随后根据空格进行划分以符号化文档。实际上,事实上,我使用这种特殊的(简单和低效的)设置来模仿迪尔韦斯特(Deerwester)等人在原创LSA文章上所做的实验。
处理文档的方式多种多样,而且依赖应用程序和语言,因此我决定不以任何界面限制它们。此外,文档由从其中提取的特征表示,而不是由其"表面"字符串形式表示:如何获得特征由您决定。下面,我描述了一种常见的通用方法(称为词袋),但是请注意不同的应用领域需要不同的特征,且和以往一样, 都是garbage in, grabage out(nonsense (garbage) input data produces nonsense output)。
为了将文档转换为向量,我们将使用一种被称为词袋的文本表示。在该表示中,每一个文档被表示为一个向量,其中每个向量元素表示为一个问答对,其形式为:
问题:单词“system”在文档中出现几次
回答:一次
仅通过其(整数)ID来表示问题是有优势的。问题和其id间的映射被称为字典:
from gensim import corpora
dictionary = corpora.Dictionary(texts)
dictionary.save("./deerwester.dict")
print(dictionary)
结果为:
2021-11-03 09:12:16,942 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2021-11-03 09:12:16,942 : INFO : built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)
2021-11-03 09:12:16,967 : INFO : Dictionary lifecycle event {'msg': "built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)", 'datetime': '2021-11-03T09:12:16.943477', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 2021, 12:59:45) \n[Clang 10.0.0 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'created'}
2021-11-03 09:12:16,969 : INFO : Dictionary lifecycle event {'fname_or_handle': './deerwester.dict', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2021-11-03T09:12:16.969476', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 2021, 12:59:45) \n[Clang 10.0.0 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'saving'}
2021-11-03 09:12:16,971 : INFO : saved ./deerwester.dict
Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)
这里我们使用gensim.corpora.dictionary.Dictionary将不同的整数ids分配给所有出现在语料库中单词。该类扫描文本,收集单词数量和相关统计数据。最后,我们看到这里有12个不同的单词在处理过的语料库中,这意味着每个文档将被表示为12个数(也就是一个12维的向量)。单词和其id的映射如下:
pprint(dictionary.token2id)
结果为:
{'computer': 0, 'human': 1, 'interface': 2, 'response': 3, 'survey': 4, 'system': 5, 'time': 6, 'user': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}
将符号化后的文档转换为向量:
new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec) # # the word "interaction" does not appear in the dictionary and is ignored
doc2bow()函数只需计算每个不同单词的发生次数,将单词转换为对应的整数ID,并返回为作为稀疏向量的结果。因此稀疏向量[(0, 1), (1, 1)]读取:在文档“Human computer interaction”中,单词computer(id为0)以及human(id为1)出现过1次,字典中的其他单词(隐含地)出现0次。
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('./deerwester.mm', corpus) # store to disk, for later use
print(corpus)
结果为:
2021-11-03 10:18:06,847 : INFO : storing corpus in Matrix Market format to ./deerwester.mm
2021-11-03 10:18:06,848 : INFO : saving sparse matrix to ./deerwester.mm
2021-11-03 10:18:06,849 : INFO : PROGRESS: saving document #0
2021-11-03 10:18:06,851 : INFO : saved 9x12 matrix, density=25.926% (28/108)
2021-11-03 10:18:06,852 : INFO : saving MmCorpus index to ./deerwester.mm.index
[[(0, 1), (1, 1), (2, 1)], [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)], [(2, 1), (5, 1), (7, 1), (8, 1)], [(1, 1), (5, 2), (8, 1)], [(3, 1), (6, 1), (7, 1)], [(9, 1)], [(9, 1), (10, 1)], [(9, 1), (10, 1), (11, 1)], [(4, 1), (10, 1), (11, 1)]]
现在应该很清楚,id=10的向量特征表示问题“单词graph在文档中出现了多少次?“。对前6个文档,其回答为0;对剩下的三个文档,其回答为1。
语料库流-每次一个文档(Corpus Streaming – One Document at a Time)
请注意,上面的语料库作为一个普通的Python列表,完全存在于内存中。在这个简单的例子中,这并不重要,只是为了把事情说清楚。让我们假设语料库中有数以百万计的文档。把它们全部储存在RAM中是不现实的。此外,让我们假设文档存储在磁盘的文件中,每个文档一行。Gensim只要求语料库必须能够一次返回一个文档向量:
from smart_open import open
class MyCorpus:
def __iter__(self):
for line in open('https://radimrehurek.com/mycorpus.txt'):
# assume there's one document per line, tokens separated by whitespace
yield(dictionary.doc2bow(line.lower().split()))
Gensim的全部力量来自于这样一个事实:一个语料库不需要一定是一个list,或者一个Numpy的数组,或者一个Pandas的dataframe,或者是其他的。Gensim接受任何能够迭代并连续产生文档的对象。
# This flexibility allows you to create your own corpus classes that stream the
# documents directly from disk, network, database, dataframes... The models
# in Gensim are implemented such that they don't require all vectors to reside
# in RAM at once. You can even create the documents on the fly!
从这里下在样本mycorpus.txt文件。每个文档在单个文件中占一行的假设并不重要。您可以定制iter()函数来适应您任意的输入格式。浏览目录,解析XML,访问网络……只需解析您的输入来检索每个文档中干净的符号列表,然后通过字典将符号转换为其ID,并在iter内返回由此产生的稀疏向量。
corpus_memory_friendly = MyCorpus() # doesn't load the corpus into memory!
print(corpus_memory_friendly)
结果为:
<__main__.MyCorpus object at 0x7f92f7c17d90>
语料库现在是一个对象。我们没有定义打印它的方法,因此上面打印的是对象在内存中的输出地址。没什么用。为了查看文档向量,让我们迭代整个语料库并打印每个文档向量(一次一个):
for vector in corpus_memory_friendly: # load one vector into memory at a time
print(vector)
结果为:
[(0, 1), (1, 1), (2, 1)]
[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(2, 1), (5, 1), (7, 1), (8, 1)]
[(1, 1), (5, 2), (8, 1)]
[(3, 1), (6, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(4, 1), (10, 1), (11, 1)]
虽然输出与普通Python列表相同,但语料库现在对内存更友好,因为最多一个向量同时存储在RAM中。你的语料库现在可以像您想要的一样大。
类似地,为了避免在内存中加载所有的文档来构造字典:
# collect statistics about all tokens
dictionary = corpora.Dictionary(line.lower().split() for line in open('https://radimrehurek.com/mycorpus.txt'))
# remove stop words and words that appear only once
stop_ids = [dictionary.token2id[stopword] for stopword in stoplist if stopword in dictionary.token2id]
once_ids = [tokenid for tokenid, docfreq in dictionary.dfs.items() if docfreq == 1]
dictionary.filter_tokens(stop_ids + once_ids) # remove stop words and words that appear only once
dictionary.compactify() # remove gaps in id sequence after words that were removed
print(dictionary)
结果为:
2021-11-03 11:19:05,517 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2021-11-03 11:19:05,518 : INFO : built Dictionary(42 unique tokens: ['abc', 'applications', 'computer', 'for', 'human']...) from 9 documents (total 69 corpus positions)
2021-11-03 11:19:05,519 : INFO : Dictionary lifecycle event {'msg': "built Dictionary(42 unique tokens: ['abc', 'applications', 'computer', 'for', 'human']...) from 9 documents (total 69 corpus positions)", 'datetime': '2021-11-03T11:19:05.519274', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 2021, 12:59:45) \n[Clang 10.0.0 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'created'}
Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)
这就是它的全部!至少就字袋表示而言。当然,我们如何处理这样的语料库是另一个问题:目前还不清楚计算不同单词的频率如何有用。事实证明,事实并非如此,我们需要首先对这个简单的表示方式进行转换,然后我们才能使用它来计算任何有意义的文档与文档的相似性。变换是下一个教程的内容(主题和变换(Topics and Transformations)),但在此之前,让我们把注意力转向语料库的持久性上。
语料库格式(Corpus Formats)
这里存在多个文件格式,用于序列化向量空间语料库(向量序列)到磁盘。Gensim通过早前提到的分流语料库界面(steaming corpus interface)来实现它们:文档以懒惰的方式从(存储到)磁盘中读取,一次一个文档,而整个语料库不会同时读入主内存。
一个更值得注意的文件格式是Market Matrix format。将一个语料库以Market Matrix format格式存储:
创建一个作为简单Python列表且只包含2个文档的语料库
corpus = [[(1, 0.5)], []] # make one document empty, for the heck of it
corpora.MmCorpus.serialize('./corpus.mm', corpus)
结果为:
2021-11-04 09:42:02,152 : INFO : storing corpus in Matrix Market format to ./corpus.mm
2021-11-04 09:42:02,153 : INFO : saving sparse matrix to ./corpus.mm
2021-11-04 09:42:02,154 : INFO : PROGRESS: saving document #0
2021-11-04 09:42:02,154 : INFO : saved 2x2 matrix, density=25.000% (1/4)
2021-11-04 09:42:02,155 : INFO : saving MmCorpus index to ./corpus.mm.index
其他的格式包括 Joachim’s SVMlight format,Blei’s LDA-C format以及GibbsLDA++ format。
corpora.SvmLightCorpus.serialize('./corpus.svmlight', corpus)
corpora.BleiCorpus.serialize('./corpus.lda-c', corpus)
corpora.LowCorpus.serialize('./corpus.low', corpus)
结果为:
2021-11-04 09:51:35,949 : INFO : converting corpus to SVMlight format: ./corpus.svmlight
2021-11-04 09:51:35,950 : INFO : saving SvmLightCorpus index to ./corpus.svmlight.index
2021-11-04 09:51:35,951 : INFO : no word id mapping provided; initializing from corpus
2021-11-04 09:51:35,952 : INFO : storing corpus in Blei's LDA-C format into ./corpus.lda-c
2021-11-04 09:51:35,953 : INFO : saving vocabulary of 2 words to ./corpus.lda-c.vocab
2021-11-04 09:51:35,954 : INFO : saving BleiCorpus index to ./corpus.lda-c.index
2021-11-04 09:51:35,955 : INFO : no word id mapping provided; initializing from corpus
2021-11-04 09:51:35,956 : INFO : storing corpus in List-Of-Words format into ./corpus.low
2021-11-04 09:51:35,956 : WARNING : List-of-words format can only save vectors with integer elements; 1 float entries were truncated to integer value
2021-11-04 09:51:35,957 : INFO : saving LowCorpus index to ./corpus.low.index
相对地,从Matrix Market文件导入语料库迭代器:
corpus = corpora.MmCorpus('./corpus.mm')
结果为:
2021-11-04 09:53:50,068 : INFO : loaded corpus index from ./corpus.mm.index
2021-11-04 09:53:50,069 : INFO : initializing cython corpus reader from ./corpus.mm
2021-11-04 09:53:50,071 : INFO : accepted corpus with 2 documents, 2 features, 1 non-zero entries
语料库对象是流,因此通常无法直接打印它们:
print(corpus)
结果为:
MmCorpus(2 documents, 2 features, 1 non-zero entries)
相反,查看语料库的内容:
# one way of printing a corpus: load it entirely into memory
print(list(corpus)) # calling list() will convert any sequence to a plain Python list
结果为:
[[(1, 0.5)], []]
或者
# another way of doing it: print one document at a time, making use of the streaming interface
for doc in corpus:
print(doc)
结果为:
[(1, 0.5)]
[]
第二种方法明显对内存更友好,但是在测试和验证的目的下,没有什么方法能打败调用list(corpus)的简单性。
为了以Blei’s LDA-C format保存同样的Matrix Market文档流:
corpora.BleiCorpus.serialize('./corpus.lda-c', corpus)
结果为
2021-11-04 10:38:29,423 : INFO : no word id mapping provided; initializing from corpus
2021-11-04 10:38:29,424 : INFO : storing corpus in Blei's LDA-C format into ./corpus.lda-c
2021-11-04 10:38:29,425 : INFO : saving vocabulary of 2 words to ./corpus.lda-c.vocab
2021-11-04 10:38:29,426 : INFO : saving BleiCorpus index to ./corpus.lda-c.index
通过这种方式,gensim也可以用作内存效率高的I/O格式转换工具:只需使用一种格式加载文档流,然后立即以另一种格式保存它。添加新格式很容易,以SVMlight语料库的代码为例。
与NumPy和SciPy的兼容性(Compatibility with NumPy and SciPy)
Gensim也包含高效的实用函数来帮助从/到numpy矩阵的转换
import gensim
import numpy as np
numpy_matrix = np.random.randint(10, size=[5, 2]) # random matrix as an example
corpus = gensim.matutils.Dense2Corpus(numpy_matrix)
# numpy_matrix = gensim.matutils.corpus2dense(corpus, num_terms=number_of_corpus_features)
以及从/到scipy.sparse矩阵
import scipy.sparse
scipy_sparse_matrix = scipy.sparse.random(5, 2) # random sparse matrix as example
corpus = gensim.matutils.Sparse2Corpus(scipy_sparse_matrix)
scipy_csc_matrix = gensim.matutils.corpus2csc(corpus)
下一步(What Next?)
请参阅Topics and Transformations。
参考(Reference)
供完整参考(想要将字典修剪到更小的尺寸?优化公司与NumPy/SciPy阵列之间的转换?),请参阅API参考。