用数字表示文本
机器学习模型将向量(数字数组)作为输入。在处理文本时,我们必须先想出一种策略,将字符串转换为数字(或将文本“向量化”),然后再其馈入模型。在本部分中,我们将探究实现这一目标的三种策略。
独热编码
作为第一个想法,我们可以对词汇表中的每个单词进行“独热”编码。考虑这样一句话:“The cat sat on the mat”。这句话中的词汇(或唯一单词)是(cat、mat、on、sat、the)。为了表示每个单词,我们将创建一个长度等于词汇量的零向量,然后在与该单词对应的索引中放置一个 1。下图显示了这种方法。
为了创建一个包含句子编码的向量,我们可以将每个单词的独热向量连接起来。
要点:这种方法效率低下。一个独热编码向量十分稀疏(这意味着大多数索引为零)。假设我们的词汇表中有 10,000 个单词。为了对每个单词进行独热编码,我们将创建一个其中 99.99% 的元素都为零的向量。
用一个唯一的数字编码每个单词
我们可以尝试的第二种方法是使用唯一的数字来编码每个单词。继续上面的示例,我们可以将 1 分配给“cat”,将 2 分配给“mat”,依此类推。然后,我们可以将句子“The cat sat on the mat”编码为一个密集向量,例如 [5, 1, 4, 3, 5, 2]。这种方法是高效的。现在,我们有了一个密集向量(所有元素均已满),而不是稀疏向量。
但是,这种方法有两个缺点:
整数编码是任意的(它不会捕获单词之间的任何关系)。
对于要解释的模型而言,整数编码颇具挑战。例如,线性分类器针对每个特征学习一个权重。由于任何两个单词的相似性与其编码的相似性之间都没有关系,因此这种特征权重组合没有意义。
单词嵌入向量
单词嵌入向量为我们提供了一种使用高效、密集表示的方法,其中相似的单词具有相似的编码。重要的是,我们不必手动指定此编码。嵌入向量是浮点值的密集向量(向量的长度是您指定的参数)。它们是可以训练的参数(模型在训练过程中学习的权重,与模型学习密集层权重的方法相同),无需手动为嵌入向量指定值。8 维的单词嵌入向量(对于小型数据集)比较常见,而在处理大型数据集时最多可达 1024 维。维度更高的嵌入向量可以捕获单词之间的细粒度关系,但需要更多的数据来学习。
上面是一个单词嵌入向量的示意图。每个单词都表示为浮点值的 4 维向量。还可以将嵌入向量视为“查找表”。学习完这些权重后,我们可以通过在表中查找对应的密集向量来编码每个单词。
下面来看看代码
import io
import os
import re
import shutil
import string
import tensorflow as tf
from datetime import datetime
from tensorflow.keras import Model, Sequential
from tensorflow.keras.layers import Activation, Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
print(tf.__version__)
"""
输出:2.5.0-dev20201226
"""
下载IMDB数据集
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
dataset = tf.keras.utils.get_file("aclImdb_v1.tar.gz", url,
untar=True, cache_dir='.',
cache_subdir='')
dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')
os.listdir(dataset_dir)
"""
输出:
Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
84131840/84125825 [==============================] - 138s 2us/step
['imdb.vocab', 'imdbEr.txt', 'README', 'test', 'train']
"""
train文件夹中有pos与neg两个关于电影评论的文件夹,其中数据分别被标记为positive与negative,你可以使用这两个文件夹中的数据去训练一个二元分类模型
train_dir = os.path.join(dataset_dir, 'train')
os.listdir(train_dir)
"""
输出:
['labeledBow.feat',
'neg',
'pos',
'unsup',
'unsupBow.feat',
'urls_neg.txt',
'urls_pos.txt',
'urls_unsup.txt']
"""
在创建数据集之前应该先移除多余的文件夹,例如unsup
remove_dir = os.path.join(train_dir, 'unsup')
shutil.rmtree(remove_dir)
下一步,用tf.keras.preprocessing.text_dataset_from_directory函数创建一个tf.data.Dataset。用train文件夹中数据创建train与validation数据集,validation所占比例为20%(即validation_split为0.2)
batch_size = 1024
seed = 123
train_ds = tf.keras.preprocessing.text_dataset_from_directory(
'aclImdb/train', batch_size=batch_size, validation_split=0.2,
subset='training', seed=seed)
val_ds = tf.keras.preprocessing.text_dataset_from_directory(
'aclImdb/train', batch_size=batch_size, validation_split=0.2,
subset='validation', seed=seed)
"""
输出:
Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
"""
检查数据集中的评论数据以及对应的标签
for text_batch, label_batch in train_ds.take(1):
for i in range(5):
print(label_batch[i].numpy(), text_batch.numpy()[i])
"""
输出:
0 b"Oh My God! Please, for the love of all that is holy, Do Not Watch This Movie! It it 82 minutes of my life I will never get back. Sure, I could have stopped watching half way through. But I thought it might get better. It Didn't. Anyone who actually enjoyed this movie is one seriously sick and twisted individual. No wonder us Australians/New Zealanders have a terrible reputation when it comes to making movies. Everything about this movie is horrible, from the acting to the editing. I don't even normally write reviews on here, but in this case I'll make an exception. I only wish someone had of warned me before I hired this catastrophe"
1 b'This movie is SOOOO funny!!! The acting is WONDERFUL, the Ramones are sexy, the jokes are subtle, and the plot is just what every high schooler dreams of doing to his/her school. I absolutely loved the soundtrack as well as the carefully placed cynicism. If you like monty python, You will love this film. This movie is a tad bit "grease"esk (without all the annoying songs). The songs that are sung are likable; you might even find yourself singing these songs once the movie is through. This musical ranks number two in musicals to me (second next to the blues brothers). But please, do not think of it as a musical per say; seeing as how the songs are so likable, it is hard to tell a carefully choreographed scene is taking place. I think of this movie as more of a comedy with undertones of romance. You will be reminded of what it was like to be a rebellious teenager; needless to say, you will be reminiscing of your old high school days after seeing this film. Highly recommended for both the family (since it is a very youthful but also for adults since there are many jokes that are funnier with age and experience.'
0 b"Alex D. Linz replaces Macaulay Culkin as the central figure in the third movie in the Home Alone empire. Four industrial spies acquire a missile guidance system computer chip and smuggle it through an airport inside a remote controlled toy car. Because of baggage confusion, grouchy Mrs. Hess (Marian Seldes) gets the car. She gives it to her neighbor, Alex (Linz), just before the spies turn up. The spies rent a house in order to burglarize each house in the neighborhood until they locate the car. Home alone with the chicken pox, Alex calls 911 each time he spots a theft in progress, but the spies always manage to elude the police while Alex is accused of making prank calls. The spies finally turn their attentions toward Alex, unaware that he has rigged devices to cleverly booby-trap his entire house. Home Alone 3 wasn't horrible, but probably shouldn't have been made, you can't just replace Macauley Culkin, Joe Pesci, or Daniel Stern. Home Alone 3 had some funny parts, but I don't like when characters are changed in a movie series, view at own risk."
0 b"There's a good movie lurking here, but this isn't it. The basic idea is good: to explore the moral issues that would face a group of young survivors of the apocalypse. But the logic is so muddled that it's impossible to get involved.<br /><br />For example, our four heroes are (understandably) paranoid about catching the mysterious airborne contagion that's wiped out virtually all of mankind. Yet they wear surgical masks some times, not others. Some times they're fanatical about wiping down with bleach any area touched by an infected person. Other times, they seem completely unconcerned.<br /><br />Worse, after apparently surviving some weeks or months in this new kill-or-be-killed world, these people constantly behave like total newbs. They don't bother accumulating proper equipment, or food. They're forever running out of fuel in the middle of nowhere. They don't take elementary precautions when meeting strangers. And after wading through the rotting corpses of the entire human race, they're as squeamish as sheltered debutantes. You have to constantly wonder how they could have survived this long... and even if they did, why anyone would want to make a movie about them.<br /><br />So when these dweebs stop to agonize over the moral dimensions of their actions, it's impossible to take their soul-searching seriously. Their actions would first have to make some kind of minimal sense.<br /><br />On top of all this, we must contend with the dubious acting abilities of Chris Pine. His portrayal of an arrogant young James T Kirk might have seemed shrewd, when viewed in isolation. But in Carriers he plays on exactly that same note: arrogant and boneheaded. It's impossible not to suspect that this constitutes his entire dramatic range.<br /><br />On the positive side, the film *looks* excellent. It's got an over-sharp, saturated look that really suits the southwestern US locale. But that can't save the truly feeble writing nor the paper-thin (and annoying) characters. Even if you're a fan of the end-of-the-world genre, you should save yourself the agony of watching Carriers."
0 b'I saw this movie at an actual movie theater (probably the $2.00 one) with my cousin and uncle. We were around 11 and 12, I guess, and really into scary movies. I remember being so excited to see it because my cool uncle let us pick the movie (and we probably never got to do that again!) and sooo disappointed afterwards!! Just boring and not scary. The only redeeming thing I can remember was Corky Pigeon from Silver Spoons, and that wasn\'t all that great, just someone I recognized. I\'ve seen bad movies before and this one has always stuck out in my mind as the worst. This was from what I can recall, one of the most boring, non-scary, waste of our collective $6, and a waste of film. I have read some of the reviews that say it is worth a watch and I say, "Too each his own", but I wouldn\'t even bother. Not even so bad it\'s good.'
"""
创建一个高性能的数据集(dataset)
这是加载数据时应该使用的两种重要方法,以确保I/O不会阻塞
-
.cache()
:将数据从磁盘加载后保留在内存中。这将确保数据集在训练模型时不会成为瓶颈。如果数据集太大,无法放入内存,也可以使用此方法创建一个性能良好的磁盘缓存,它比许多小文件读取效率更高。 -
.prefetch()
:使数据预处理与模型的训练交替进行
AUTOTUNE = tf.data.AUTOTUNE
train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
使用嵌入层(Embedding层)
Embedding层可以理解成一个从整数索引(代表特定词汇)映射到密集向量(该单词对应的embeddings)的一个查找表。你可以通过试验确定最佳嵌入维度,就和你确定Dense层的最佳神经元个数那样做。
# 输入1000个单词,每个单词用5个维度的向量表示
embedding_layer = tf.keras.layers.Embedding(1000, 5)
当你创建Embedding层时,Embedding层的权重(weights)将会和其他层(layer)一样被随机初始化。在训练过程中,权重会逐渐通过反向传播来进行调整。训练过后,embeddings层将会粗略的编码词汇之间的相似性(这个是针对你所训练模型的特定问题的)。
如果将整数传递给嵌入层,则结果将用嵌入表中的向量替换每个整数。
result = embedding_layer(tf.constant([1,2,3]))
result.numpy()
"""
输出:
array([[-0.01827962, 0.033703 , 0.02065292, 0.00335936, -0.00998179],
[ 0.00618695, -0.02138543, -0.01288087, 0.03814398, -0.02176479],
[-0.02900024, 0.03794893, -0.03229412, 0.04951945, 0.03212232]],
dtype=float32)
"""
对于文本或序列问题,嵌入向量层采用整数组成的 2D 张量,其形状为 (samples, sequence_length),其中每个条目都是一个整数序列。它可以嵌入可变长度的序列。您可以在形状为 (32, 10)(32 个长度为 10 的序列组成的批次)或 (64, 15)(64 个长度为 15 的序列组成的批次)的批次上方嵌入向量层。
返回的张量比输入多一个轴,嵌入向量沿新的最后一个轴对齐。向其传递 (2, 3) 输入批次,输出为 (2, 3, N)
result = embedding_layer(tf.constant([[0,1,2],[3,4,5]]))
result.shape
"""
输出:TensorShape([2, 3, 5])
"""
当给定一个序列批次作为输入时,嵌入向量层将返回形状为 (samples, sequence_length, embedding_dimensionality) 的 3D 浮点张量。
代码可以私信我
注: 本文参考了官网并对其进行了删减以及部分注释与修改