Bidirectional LSTM-CRF Models for Sequence Tagging
题目:用于序列标记的双向 LSTM-CRF 模型 (2015)
Abstract
In this paper, we propose a variety of Long Short-Term Memory (LSTM) based models for sequence tagging. These models include LSTM networks, bidirectional LSTM (BI-LSTM) networks, LSTM with a Conditional Random Field (CRF) layer(LSTM-CRF) and bidirectional LSTM with a CRF layer (BI-LSTM-CRF). Our work is the first to apply a bidirectional LSTM CRF (denoted as BI-LSTM-CRF)model to NLP benchmark sequence tagging data sets.
在本文中,我们提出了多种基于长短期记忆 (LSTM) 的序列标记模型。 这些模型包括 LSTM 网络、双向 LSTM (BI-LSTM) 网络、带有条件随机场 (CRF) 层的 LSTM (LSTM-CRF) 和带有 CRF 层的双向 LSTM (BI-LSTM-CRF)。 我们的工作是第一个将双向 LSTM CRF(表示为 BI-LSTM-CRF)模型应用于 NLP 基准序列标记数据集。
We show that the BILSTM-CRF model can efficiently use both past and future input features thanks to a bidirectional LSTM component. It canal so use sentence level tag information thanks to a CRF layer. The BI-LSTM-CRF model can produce state of the art (or close to) accuracy on POS, chunking and NER data sets. In addition, it is robust and has less dependence on word embedding as compared to previous observations.
我们表明,由于双向 LSTM 组件,BILSTM-CRF 模型可以有效地使用过去和未来的输入特征。 由于 CRF 层,它可以使用句子级别的标签信息。 BI-LSTM-CRF 模型可以在 POS、分块和NER 数据集上产生最先进(或接近)的准确性。 此外,与之前的观察相比,它是稳健的并且对词嵌入的依赖较少。
1 Introduction
Sequence tagging including part of speech tagging (POS), chunking, and named entity recognition (NER) has been a classic NLP task. It has drawn research attention for a few decades. The output of taggers can be used for down streaming applications. For example, a named entity recognizer trained on user search queries can be utilized to identify which spans of text are products, thustriggering certain products ads. Another example is that such tag information can be used by a search engine to find relevant webpages.
包括词性标注 (POS)、分块和命名实体识别 (NER) 在内的序列标注一直是经典的 NLP 任务。 几十年来,它引起了研究的关注。 标记器的输出可用于下行流应用程序。 例如,可以使用在用户搜索查询上训练的命名实体识别器来识别哪些文本范围是产品,从而触发某些产品广告。 另一个例子是这样的标签信息可以被搜索引擎用来查找相关网页。
Most existing sequence tagging models are linear statistical models which include Hidden Markov Models (HMM), Maximum entropy Markov models (MEMMs) (McCallum et al., 2000), and Conditional Random Fields (CRF) (Lafferty et al., 2001). Convolutional network based models (Collobert et al., 2011) have been recently proposed to tackle sequence tagging problem. We denote such a model as Conv-CRF as it consists of a convolutional network and a CRF layer on the output (the term of sentence level loglikelihood (SSL) was used in the original paper). The Conv-CRF model has generated promising results on sequence tagging tasks.
大多数现有的序列标记模型是线性统计模型,包括隐马尔可夫模型 (HMM)、最大熵马尔可夫模型 (MEMMs) (McCallum 等人,2000) 和条件随机场 (CRF) (Lafferty 等人,2001)。 最近提出了基于卷积网络的模型 (Collobert et al., 2011) 来解决序列标记问题。 我们将这样的模型称为 Conv-CRF,因为它由卷积网络和输出上的 CRF 层组成(原始论文中使用了句子级对数似然 (SSL) 术语)。 Conv-CRF 模型在序列标记任务上产生了我们期望的结果。
In speech language understanding community, recurrent neural network (Mesnil et al., 2013; Yao et al., 2014) and convolutional nets (Xu and Sarikaya, 2013) based models have been recently proposed. Other relevant work includes (Graves et al., 2005; Graves et al., 2013) which proposed a bidirectional recurrent neural network for speech recognition.
在语音语言理解社区中,最近提出了基于循环神经网络 (Mesnil et al., 2013; Yao et al., 2014) 和卷积网络 (Xu and Sarikaya, 2013) 的模型。 其他相关工作包括 (Graves et al., 2005; Graves et al., 2013),它提出了一种用于语音识别的双向循环神经网络。
In this paper, we propose a variety of neural network based models to sequence tagging task. These models include LSTM networks, bidirectional LSTM networks (BI-LSTM), LSTM networks with a CRF layer (LSTM-CRF), and bidirectional LSTM networks with a CRF layer (BILSTM-CRF). Our contributions can be summarized as follows.
在本文中,我们提出了多种基于神经网络的模型来完成序列标注任务。 这些模型包括 LSTM 网络、双向 LSTM 网络 (BI-LSTM)、具有 CRF 层的 LSTM 网络 (LSTM-CRF) 和具有 CRF 层的双向 LSTM 网络 (BILSTM-CRF)。 我们的贡献可以总结如下。
- We systematically compare the performance of aforementioned models on NLP tagging data sets;
- Our work is the first to apply a bidirectional LSTM CRF (denoted as BI-LSTM-CRF) model to NLP benchmark sequence tagging data sets. This model can use both past and future input features thanks to a bidirectional LSTM component. In addition, this model can use sentence level tag information thanks to a CRF layer. Our model can produce state of the art (or close to) accuracy on POS, chunking and NER data sets;
- We show that BI-LSTMCRF model is robust and it has less dependence on word embedding as compared to previous observations (Collobert et al., 2011). It can produce accurate tagging performance without resorting to word embedding.
- 我们系统地比较了上述模型在 NLP 标记数据集上的性能;
- 我们的工作是第一个将双向 LSTM CRF(表示为 BI-LSTM-CRF)模型应用于 NLP 基准序列标记数据集。 由于双向 LSTM 组件,该模型可以使用过去和未来的输入特征。 此外,由于 CRF 层,该模型可以使用句子级别的标签信息。 我们的模型可以在 POS、分块和 NER 数据集上产生最先进(或接近)的准确性;
- 我们证明了 BI-LSTMCRF 模型是稳健的,与之前的观察相比,它对词嵌入的依赖更少(Collobert 等人,2011)。 它可以在不借助词嵌入的情况下产生准确的标记性能。
The remainder of the paper is organized as follows. Section 2 describes sequence tagging models used in this paper. Section 3 shows the training procedure. Section 4 reports the experiments results. Section 5 discusses related research. Finally Section 6 draws conclusions.
在本文的其余部分安排如下。 第 2 节描述了本文中使用的序列标记模型。 第 3 节显示了训练过程。 第 4 节报告了实验结果。 第 5 节讨论了相关研究。 最后,第 6 节得出结论。
2 Models
在本节中,我们描述了本文中使用的模型:LSTM、BI-LSTM、CRF、LSTM-CRF 和 BI-LSTM-CRF。
2.1 LSTM Networks
Recurrent neural networks (RNN) have been employed to produce promising results on a variety of tasks including language model (Mikolov et al., 2010; Mikolov et al., 2011) and speech recognition (Graves et al., 2005). A RNN maintains a memory based on history information, which enables the model to predict the current output conditioned on long distance features.
循环神经网络 (RNN) 已被用于在包括语言模型 (Mikolov et al., 2010; Mikolov et al., 2011) 和语音识别 (Graves et al., 2005) 在内的各种任务中产生期望的结果。 RNN 维护基于历史信息的memory,这使模型能够根据长距离特征预测当前输出。
Figure 1 shows the RNN structure (Elman, 1990) which has an input layer x, hidden layer h and output layer y. In named entity tagging context, x represents input features and y represents tags. Figure 1 illustrates a named entity recognition system in which each word is tagged with other (O) or one of four entity types: Person (PER), Location (LOC), Organization (ORG), and Miscellaneous (MISC). The sentence of EU rejects German call to boycott British lamb . is tagged as B-ORG O B-MISC O O O B-MISC O O, where B-, I- tags indicate beginning and intermediate positions of entities.
图 1 显示了 RNN 结构 (Elman, 1990),它具有输入层 x、隐藏层 h 和输出层 y。 在命名实体标注上下文中,x 代表输入特征,y 代表标签。 图 1 说明了一个命名实体识别系统,其中每个单词都被标记为其他 (O) 或以下四种实体类型之一:人 (PER)、位置 (LOC)、组织 (ORG) 和杂项 (MISC)。 EU rejects German call to boycott British lamb. (欧盟的判决拒绝了德国抵制英国羊肉的呼吁。) 被标记为 B-ORG O B-MISC O O O B-MISC O O,其中 B-、I- 标签表示实体的开始和中间位置。
An input layer represents features at time t. They could be one-hot-encoding for word feature, dense vector features, or sparse features. An input layer has the same dimensionality as feature size. An output layer represents a probability distribution over labels at time t. It has the same dimensionality as size of labels. Compared to feedforward network, a RNN introduces the connection between the previous hidden state and current hidden state (and thus the recurrent layer weight parameters). This recurrent layer is designed to store history information. The values in the hidden and output layers are computed as follows:
输入层表示时间 t 的特征。 它们可以是单词特征、密集向量特征或稀疏特征的 one-hot-encoding。 输入层具有与特征大小相同的维度。 输出层表示时间 t 的标签上的概率分布。 它具有与标签大小相同的维度。 与前馈网络相比,RNN 引入了先前隐藏状态和当前隐藏状态之间的连接(因此也引入了循环层权重参数)。 该循环层旨在存储历史信息。 隐藏层和输出层中的值计算如下:
where U, W, and V are the connection weights to be computed in training time, and f(z) and g(z) are sigmoid and softmax activation functions as follows.
其中 U、W 和 V 是要在训练时间内计算的连接权重,f(z) 和 g(z) 是 sigmoid 和 softmax 激活函数,如下所示。
In this paper, we apply Long Short-Term Memory (Hochreiter and Schmidhuber, 1997; Graves et al., 2005) to sequence tagging. Long ShortTerm Memory networks are the same as RNNs, except that the hidden layer updates are replaced by purpose-built memory cells. As a result, they may be better at finding and exploiting long range dependencies in the data. Fig. 2 illustrates a single LSTM memory cell (Graves et al., 2005). The LSTM memory cell is implemented as the following:
在本文中,我们将长短期记忆 (Hochreiter and Schmidhuber, 1997; Graves et al., 2005) 应用于序列标记。 Long ShortTerm Memory 网络与 RNN 相同,只是隐藏层更新被专门构建的记忆单元所取代。 因此,他们可能更善于发现和利用数据中的长期依赖关系。 图 2 说明了单个 LSTM 存储单元(Graves 等人,2005)。 LSTM 存储单元的实现如下:
where σ is the logistic sigmoid function, and i, f, o and c are the input gate, forget gate, output gate and cell vectors, all of which are the same size as the hidden vector h. The weight matrix subscripts have the meaning as the name suggests. For example, Whi is the hidden-input gate matrix, Wxo is the input-output gate matrix etc. The weight matrices from the cell to gate vectors (e.g. Wci) are diagonal, so element m in each gate vector only receives input from element m of the cell vector.
其中 σ 是逻辑 sigmoid 函数,i、f、o 和 c 是输入门、遗忘门、输出门和单元向量,它们的大小都与隐层向量 h 相同。 权重矩阵下标具有与名字相同的意义。 例如,Whi 是隐层输入门矩阵,Wxo 是输入输出门矩阵等。从cell到门向量(例如 Wci)的权重矩阵是对角线,所以元素 m 在每个门向量中仅接收来自cell向量的元素 m 的输入。
Fig. 3 shows a LSTM sequence tagging model which employs aforementioned LSTM memory cells (dashed boxes with rounded corners).
图 3 显示了一个 LSTM 序列标记模型,该模型采用了上述 LSTM 存储单元(带圆角的虚线框)。
2.2 Bidirectional LSTM Networks
In sequence tagging task, we have access to both past and future input features for a given time, we can thus utilize a bidirectional LSTM network (Figure 4) as proposed in (Graves et al., 2013). In doing so, we can efficiently make use of past features (via forward states) and future features (via backward states) for a specific time frame. We train bidirectional LSTM networks using back-propagation through time (BPTT)(Boden., 2002).
在序列标记任务中,我们可以在给定时间内访问过去和未来的输入特征,因此我们可以利用 (Graves et al., 2013) 中提出的双向 LSTM 网络(图 4)。 这样做,我们可以在特定的时间范围内有效地利用过去的特征(通过前向状态)和未来的特征(通过后向状态)。 我们使用随时间的反向传播 (BPTT) (Boden., 2002) 训练双向 LSTM 网络。
The forward and backward passes over the unfolded network over time are carried out in a similar way to regular network forward and backward passes, except that we need to unfold the hidden states for all time steps. We also need a special treatment at the beginning and the end of the data points. In our implementation, we do forward and backward for whole sentences and we only need to reset the hidden states to 0 at the begging of each sentence. We have batch implementation which enables multiple sentences to be processed at the same time.
随着时间的推移,展开网络的前向和后向传递与常规网络前向和后向传递类似,只是我们需要在所有时间步展开隐藏状态。 我们还需要对数据点的开头和结尾进行特殊处理。 在我们的实现中,我们对整个句子进行正向和反向操作,我们只需要在每个句子begging时将隐层状态重置为 0。 我们有批处理实现,可以同时处理多个句子。
2.3 CRF networks
There are two different ways to make use of neighbor tag information in predicting current tags. The first is to predict a distribution of tags for each time step and then use beam-like decoding to find optimal tag sequences. The work of maximum entropy classifier (Ratnaparkhi, 1996) and Maximum entropy Markov models (MEMMs) (McCallum et al., 2000) fall in this category. The second one is to focus on sentence level instead of individual positions, thus leading to Conditional Random Fields (CRF) models (Lafferty et al., 2001) (Fig. 5). Note that the inputs and outputs are directly connected, as opposed to LSTM and bidirectional LSTM networks where memory cells/recurrent components are employed.
在预测当前标签时,有两种不同的方法可以利用相邻标签信息。 首先是预测每个time step的标签分布,然后使用类似波束beam-like的解码来找到最佳标签序列。 最大熵分类器 (Ratnaparkhi, 1996) 和最大熵马尔可夫模型 (MEMMs) (McCallum et al., 2000) 的工作属于这一类。 第二个是关注句子级别而不是单个位置,从而引出条件随机场(CRF)模型(Lafferty 等人,2001)(图 5)。 请注意,输入和输出是直接连接的,这与使用记忆单元/循环组件的 LSTM 和双向 LSTM 网络不同。
It has been shown that CRFs can produce higher tagging accuracy in general. It is interesting that the relation between these two ways of using tag information bears resemblance to two ways of using input features (see aforementioned LSTM and BI-LSTM networks), and the results in this paper confirms the superiority of BI-LSTM compared to LSTM.
已经表明,CRF 通常可以产生更高的标记精度。 有趣的是,这两种使用标签信息的方式之间的关系与两种使用输入特征的方式有相似之处(参见前面提到的 LSTM 和 BI-LSTM 网络),本文的结果证实了 BI-LSTM 与 LSTM 相比的优越性 .
2.4 LSTM-CRF networks
We combine a LSTM network and a CRF network to form a LSTM-CRF model, which is shown in Fig. 6. This network can efficiently use past input features via a LSTM layer and sentence level tag information via a CRF layer.
我们将 LSTM 网络和 CRF 网络结合起来形成 LSTM-CRF 模型,如图 6 所示。该网络可以通过 LSTM 层有效地使用过去的输入特征,并通过 CRF 层有效地使用句子级标签信息。
A CRF layer is represented by lines which connect consecutive output layers. A CRF layer has a state transition matrix as parameters. With such a layer, we can efficiently use past and future tags to predict the current tag, which is similar to the use of past and future input features via a bidirectional LSTM network. We consider the matrix of scores fθ([x]T1) are output by the network. We drop the input [x]T1 for notation simplification. The element [fθ] i,t of the matrix is the score output by the network with parameters θ, for the sentence [x]T1 and for the i-th tag, at the t-th word.
CRF 层由连接连续输出层的线表示。 CRF 层有一个状态转换矩阵作为参数。有了这样一个层,我们可以有效地使用过去和未来的标签来预测当前的标签,这类似于通过双向 LSTM 网络使用过去和未来的输入特征。 我们认为分数矩阵 fθ([x]T1) 由网络输出。 我们删除输入 [x]T1 以简化符号。 矩阵的元素 [fθ] i,t 是网络以参数 θ 输出的分数,对于句子 [x]T1 和对于第 i 个标签,在第 t 个词。
We introduce a transition score [A]i,j to model the transition from i-th state to j-th for a pair of consecutive time steps. Note that this transition matrix is position independent. We now denote the new parameters for our network as ˜θ = θ∪ {[A]i,j ∀ i, j}. The score of a sentence [x]T1 along with a path of tags [i]T1 is then given by the sum of transition scores and network scores:
我们引入了一个转换分数 [A]i,j 来模拟一对连续time steps从第 i 个状态到第 j 个状态的转换。 请注意,此转换矩阵与位置无关。 我们现在将网络的新参数表示为 ∼θ = θ∪ {[A]i,j ∀ i, j}。 句子 [x]T1 以及标签路径 [i]T1 的得分由转换得分和网络得分之和给出:
The dynamic programming (Rabiner, 1989) can be used efficiently to compute [A]i,j and optimal tag sequences for inference. See (Lafferty et al., 2001) for details.
动态规划 (Rabiner, 1989) 可以有效地用于计算 [A]i,j 和用于推理的最佳标签序列。 有关详细信息,请参阅(Lafferty 等人,2001 年)。
- 第0个词的最终得分有三部分组成,第-1个词各个标签的得分,加上第0个词的发射分数(RNN给它的),加上转移分数(最开始的转移矩阵)
- 然后把矩阵按行求和就是对应各个标签的分数,使用torch.logsumexp求和,这样就得到了第0个词各个标签的最终得分
- 然后循环每一个词都有个最终得分
- 接下来计算最后一个词转移到stop的得分,越小越好(这是总路径的分数)总路径分数越小越好,目标路径的分数越大越好
- 训练模型一个是RNN,还有一个是CRF的转移矩阵
- 预测使用维特比解码
2.5 BI-LSTM-CRF networks
Similar to a LSTM-CRF network, we combine a bidirectional LSTM network and a CRF network to form a BI-LSTM-CRF network (Fig. 7). In addition to the past input features and sentence level tag information used in a LSTM-CRF model, a BILSTM-CRF model can use the future input features. The extra features can boost tagging accuracy as we will show in experiments.
与 LSTM-CRF 网络类似,我们将双向 LSTM 网络和 CRF 网络结合起来形成 BI-LSTM-CRF 网络(图 7)。 除了 LSTM-CRF 模型中使用的过去输入特征和句子级标签信息外,BILSTM-CRF 模型还可以使用未来输入特征。 正如我们将在实验中展示的那样,额外的功能可以提高标记的准确性。
3 Training procedure
All models used in this paper share a generic SGD forward and backward training procedure. We choose the most complicated model, BI-LSTM-CRF, to illustrate the training algorithm as shown in Algorithm 1. In each epoch, we divide the whole training data to batches and process one batch at a time. Each batch contains a list of sentences which is determined by the parameter of batch size. In our experiments, we use batch size of 100 which means to include sentences whose total length is no greater than 100.
本文中使用的所有模型都共享一个通用的梯度下降法 SGD 前向和后向训练过程。 我们选择最复杂的模型 BI-LSTM-CRF 来说明训练算法,如算法 1 所示。在每个 epoch 中,我们将整个训练数据分成批次,一次处理一个批次。 每个批次包含由批次大小参数确定的句子列表。 在我们的实验中,我们使用 100 的批量大小,这意味着包含总长度不超过 100 的句子。
对于每个批次,我们首先运行双向 LSTM-CRF 模型前向传递,其中包括 LSTM 的前向状态和后向状态的前向传递。 结果,我们得到了所有位置的所有标签的输出分数 fθ([x]T1)。 然后我们运行 CRF 层前向和后向传递来计算网络输出和状态转换边缘的梯度。 之后,我们可以将错误从输出反向传播到输入,其中包括 LSTM 的前向和后向状态的反向传播。 最后我们更新了网络参数,包括状态转移矩阵 [A]i,j∀i, j 和原始的双向 LSTM 参数 θ。
For each batch, we first run bidirectional LSTM-CRF model forward pass which includes the forward pass for both forward state and backward state of LSTM. As a result, we get the the output score fθ([x]T1) for all tags at all positions. We then run CRF layer forward and backward pass to compute gradients for network output and state transition edges. After that, we can back propagate the errors from the output to the input, which includes the backward pass for both forward and backward states of LSTM. Finally we update the network parameters which include the state transition matrix [A]i,j∀i, j, and the original bidirectional LSTM parameters θ.