1. 简称
论文《A Neural Probabilistic Language Model》简称NNLM,作者Yoshua Bengio,经典的神经语言模型。
2. 摘要
统计语言模型建模的目标是学习语言中单词序列的联合概率函数。由于维数上的灾难,这本质上是困难的:基于n-gram的传统但非常成功的方法是通过连接在训练集中看到的非常短的重叠序列来获得泛化。
我们建议通过学习单词的分布式表示来对抗维数的灾难。该模型同时学习:
- 每个单词的分布式表示
- 用这些表示的单词序列的概率函数
获得泛化是因为如果以前从未见过的单词序列与形成已经看过的句子的单词(在具有附近表示的意义上)的单词构成,则该单词序列获得很高的概率。
在合理的时间内训练如此大的模型(具有数百万个参数)本身就是一个重大的挑战。
本论文报告了使用神经网络进行概率函数的实验,在两个文本语料库上表明,所提出的方法显著改进了最先进n元语法模型,并且所提出的方法允许利用更长的上下文。
3. 核心
更准确地说,神经网络使用softmax输出层计算以下函数,该函数可确保总和为1的正概率:
是每个输出单词的未归一化对数概率,其计算如下,参数为和:
在逐个元素地应用双曲正切的情况下,任选为零(无直接连接),并且是单词特征层激活向量,其是来自矩阵的输入单词特征的级联:
设是隐藏单元的数量,是与每个单词相关联的特征的数量。当不需要从字特征到输出的直接连接时,矩阵被设置为0。模型的自由参数是输出偏差(具有元素),隐藏层偏差(具有元素),隐藏到输出权重(矩阵),词特征到输出权重(矩阵),隐藏层权重(矩阵),单词特征(矩阵):
自由参数的个数为。主导因素为。请注意,在理论上,如果在权重和上有权重衰减,但在上没有,那么和可以向零收敛,而会爆炸。在实践中,我们没有观察到这种行为时,随机梯度上升训练。
神经网络上的随机梯度上升包括在呈现训练语料库的第t个单词之后执行以下迭代更新:
其中ε是“学习率”。
注意,在每个示例之后不需要更新或访问很大一部分参数:单词特征C(J)的所有单词j中没有出现在输入窗口中。
代码编写
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
# 准备词表与相关字典
vocab = set(sentence)
print(vocab)
word2index = {w:i for i, w in enumerate(vocab)}
print(word2index)
index2word = {i:w for i, w in enumerate(vocab)}
print(index2word)
# 准备N-gram训练数据 each tuple is ([word_i-2, word_i-1], target word)
trigrams = [([sentence[i], sentence[i+1]], sentence[i+2]) for i in range(len(sentence)-2)]
print(trigrams[0])
# 模型所需参数
CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
# 创建模型
class NGramLanguageModler(nn.Module):
def __init__(self, vocab_size, context_size, embedding_dim, hidden_dim):
super(NGramLanguageModler, self).__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.linear1 = nn.Linear(context_size * embedding_dim, hidden_dim)
self.linear2 = nn.Linear(context_size * embedding_dim, vocab_size)
self.linear3 = nn.Linear(hidden_dim, vocab_size)
def forward(self, inputs):
embeds = self.embedding(inputs).view(1, -1)
out = torch.tanh(self.linear1(embeds))
out = self.linear3(out) + self.linear2(embeds)
return out
losses = []
loss_function = nn.CrossEntropyLoss()
model = NGramLanguageModler(len(vocab), CONTEXT_SIZE, EMBEDDING_DIM, 128)
optimizer = optim.SGD(model.parameters(), lr = 0.001)
for epoch in range(10):
total_loss = 0
for context, target in trigrams:
# Step 1. Prepare the inputs to be passed to the model
context_idx = torch.tensor([[word2index[w]] for w in context], dtype=torch.long)
# Step 2. Before passing in a new instance, you need to zero out the gradients from the old instance
model.zero_grad()
# Step 3. Run forward pass
out = model(context_idx)
# Step 4. Compute your loss function.
loss = loss_function(out, torch.tensor([word2index[target]], dtype=torch.long))
# Step 5. Do the backword pass and update the gradient
loss.backward()
optimizer.step()
# Get the Python number from a 1-element Tensor by calling tensor.item()
total_loss += loss.item()
losses.append(total_loss)
print(losses) # The loss decreased every iteration over the training data!
# 结果
{'and', 'the', 'answer', 'gazed', 'besiege', 'To', "'This", 'mine', 'old', 'Thy', 'own', 'blood', 'now,', 'thy', 'say,', "youth's", 'worth', 'thriftless', 'of', 'Will', 'a', 'use,', 'thine', 'where', 'count,', 'Shall', 'Where', 'sum', 'much', "deserv'd", 'succession', 'new', 'held:', 'to', 'And', 'praise.', 'When', 'livery', 'all-eating', "beauty's", 'within', 'be', 'treasure', 'weed', 'How', 'deep', 'all', 'trenches', 'more', 'eyes,', "feel'st", 'beauty', 'sunken', 'forty', 'winters', 'This', 'shall', 'my', 'thou', 'proud', 'Proving', 'when', 'warm', 'dig', 'shame,', 'lusty', 'in', 'small', 'field,', 'an', 'it', 'couldst', 'make', 'thine!', "excuse,'", 'being', 'Then', 'art', 'brow,', 'see', 'cold.', 'fair', 'were', 'his', 'so', 'lies,', 'made', 'days;', 'child', 'If', 'on', 'praise', 'by', 'asked,', 'old,', "totter'd", 'Were'}
{'and': 0, 'the': 1, 'answer': 2, 'gazed': 3, 'besiege': 4, 'To': 5, "'This": 6, 'mine': 7, 'old': 8, 'Thy': 9, 'own': 10, 'blood': 11, 'now,': 12, 'thy': 13, 'say,': 14, "youth's": 15, 'worth': 16, 'thriftless': 17, 'of': 18, 'Will': 19, 'a': 20, 'use,': 21, 'thine': 22, 'where': 23, 'count,': 24, 'Shall': 25, 'Where': 26, 'sum': 27, 'much': 28, "deserv'd": 29, 'succession': 30, 'new': 31, 'held:': 32, 'to': 33, 'And': 34, 'praise.': 35, 'When': 36, 'livery': 37, 'all-eating': 38, "beauty's": 39, 'within': 40, 'be': 41, 'treasure': 42, 'weed': 43, 'How': 44, 'deep': 45, 'all': 46, 'trenches': 47, 'more': 48, 'eyes,': 49, "feel'st": 50, 'beauty': 51, 'sunken': 52, 'forty': 53, 'winters': 54, 'This': 55, 'shall': 56, 'my': 57, 'thou': 58, 'proud': 59, 'Proving': 60, 'when': 61, 'warm': 62, 'dig': 63, 'shame,': 64, 'lusty': 65, 'in': 66, 'small': 67, 'field,': 68, 'an': 69, 'it': 70, 'couldst': 71, 'make': 72, 'thine!': 73, "excuse,'": 74, 'being': 75, 'Then': 76, 'art': 77, 'brow,': 78, 'see': 79, 'cold.': 80, 'fair': 81, 'were': 82, 'his': 83, 'so': 84, 'lies,': 85, 'made': 86, 'days;': 87, 'child': 88, 'If': 89, 'on': 90, 'praise': 91, 'by': 92, 'asked,': 93, 'old,': 94, "totter'd": 95, 'Were': 96}
{0: 'and', 1: 'the', 2: 'answer', 3: 'gazed', 4: 'besiege', 5: 'To', 6: "'This", 7: 'mine', 8: 'old', 9: 'Thy', 10: 'own', 11: 'blood', 12: 'now,', 13: 'thy', 14: 'say,', 15: "youth's", 16: 'worth', 17: 'thriftless', 18: 'of', 19: 'Will', 20: 'a', 21: 'use,', 22: 'thine', 23: 'where', 24: 'count,', 25: 'Shall', 26: 'Where', 27: 'sum', 28: 'much', 29: "deserv'd", 30: 'succession', 31: 'new', 32: 'held:', 33: 'to', 34: 'And', 35: 'praise.', 36: 'When', 37: 'livery', 38: 'all-eating', 39: "beauty's", 40: 'within', 41: 'be', 42: 'treasure', 43: 'weed', 44: 'How', 45: 'deep', 46: 'all', 47: 'trenches', 48: 'more', 49: 'eyes,', 50: "feel'st", 51: 'beauty', 52: 'sunken', 53: 'forty', 54: 'winters', 55: 'This', 56: 'shall', 57: 'my', 58: 'thou', 59: 'proud', 60: 'Proving', 61: 'when', 62: 'warm', 63: 'dig', 64: 'shame,', 65: 'lusty', 66: 'in', 67: 'small', 68: 'field,', 69: 'an', 70: 'it', 71: 'couldst', 72: 'make', 73: 'thine!', 74: "excuse,'", 75: 'being', 76: 'Then', 77: 'art', 78: 'brow,', 79: 'see', 80: 'cold.', 81: 'fair', 82: 'were', 83: 'his', 84: 'so', 85: 'lies,', 86: 'made', 87: 'days;', 88: 'child', 89: 'If', 90: 'on', 91: 'praise', 92: 'by', 93: 'asked,', 94: 'old,', 95: "totter'd", 96: 'Were'}
(['When', 'forty'], 'winters')
[542.6012270450592, 536.4575519561768, 530.3622291088104, 524.314457654953, 518.3134853839874, 512.3586511611938, 506.44934606552124, 500.58502769470215, 494.7652368545532, 488.98955368995667]
参考文献
- A Neural Probabilistic Language Model