在处理URL字符串过程中,需要将字符串填充或剪切为统一长度。
因此在这里面用到了Tensorflow中的VocabularyProcessor API。
Function Interface:
tf.contrib.learn.preprocessing.VocabularyProcessor (max_document_length, min_frequency=0, vocabulary=None, tokenizer_fn=None)
参数:
max_document_length: 文档的最大长度。如果文本的长度大于最大长度,那么它会被剪切,反之则用0填充。
min_frequency: 词频的最小值,出现次数小于最小词频则不会被收录到词表中。
vocabulary: CategoricalVocabulary 对象。
tokenizer_fn:分词函数
使用这个函数时一般分为几个动作:
1.首先将列表里面的词生成一个词典;
2.按列表中的顺序给每一个词进行排序,每一个词都对应一个序号(从1开始,<UNK>的序号为0)
3.按照原始列表顺序,将原来的词全部替换为它所对应的序号
4.同时如果大于最大长度的词将进行剪切,小于最大长度的词将进行填充
5.然后将其转换为列表,进而转换为一个array
import numpy as np
from tensorflow.contrib import learn
x_text = ['This is a cat','This must be boy', 'This is a a dog']
max_document_length = max([len(x.split(" ")) for x in x_text])
## Create the vocabularyprocessor object, setting the max lengh of the documents.
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
## Transform the documents using the vocabulary.
x = np.array(list(vocab_processor.fit_transform(x_text)))
## Extract word:id mapping from the object.
vocab_dict = vocab_processor.vocabulary_._mapping
## Sort the vocabulary dictionary on the basis of values(id).
## Both statements perform same task.
#sorted_vocab = sorted(vocab_dict.items(), key=operator.itemgetter(1))
sorted_vocab = sorted(vocab_dict.items(), key = lambda x : x[1])
## Treat the id's as index into list and create a list of words in the ascending order of id's
## word with id i goes at index i of the list.
vocabulary = list(list(zip(*sorted_vocab))[0])
print(vocabulary)
print(x)
分步结果:
>>> w=list(vocab_processor.fit_transform(x_text))
>>> w
[array([1, 2, 3, 4, 0]), array([1, 5, 6, 7, 0]), array([1, 2, 3, 3, 8])]
>>> vocab_dict = vocab_processor.vocabulary_._mapping
>>> vocab_dict
{'cat': 4, '<UNK>': 0, 'a': 3, 'be': 6, 'dog': 8, 'boy': 7, 'is': 2, 'must': 5, 'This': 1}
>>> sorted_vocab = sorted(vocab_dict.items(), key = lambda x : x[1])
>>> sorted_vocab
[('<UNK>', 0), ('This', 1), ('is', 2), ('a', 3), ('cat', 4), ('must', 5), ('be', 6), ('boy', 7), ('dog', 8)]
>>> vocabulary = list(list(zip(*sorted_vocab))[0])
>>> vocabulary
['<UNK>', 'This', 'is', 'a', 'cat', 'must', 'be', 'boy', 'dog']
>>> w=list(zip(*sorted_vocab))
>>> w
[('<UNK>', 'This', 'is', 'a', 'cat', 'must', 'be', 'boy', 'dog'), (0, 1, 2, 3, 4, 5, 6, 7, 8)]