Stanford / Winter 2020 CS224n A4,A5笔记

NMT 机器翻译模型

在Assignment4,5中，进一步理解encoder-decoder模型，并且认识到在实现项目之前，要清楚的了解每一个输入输出的矩阵的维度。

A4 词向量翻译模型

词向量输入的模型为嵌入LSTM的encoder-decoder。如下图所示，这是在翻译时，某一时刻的状态：

将这些嵌入内容馈送到双向编码器，从而为前向（→）和向后（←）LSTM生成隐藏状态和单元状态。

$\mathbf{h}_{i}^{\mathrm{enc}}=[\overleftarrow{\mathbf{h}_{i}^{\mathrm{enc}}} ; \overrightarrow{\mathbf{h}_{i}^{\mathrm{ent}}}] \text { where } \mathbf{h}_{i}^{\mathrm{enc}} \in \mathbb{R}^{2 h \times 1}, \overleftarrow{\mathbf{h}_{i}^{\mathrm{enc}}}, \overrightarrow{\mathbf{h}_{i}^{\mathrm{enc}}} \in \mathbb{R}^{h \times 1}$
$\mathbf{c}_{i}^{\mathrm{enc}}=[\overleftarrow{\mathbf{c}_{i}^{\mathrm{enc}}} ; \overrightarrow{\mathbf{c}_{i}^{\mathrm{enc}}}] \text { where } \mathbf{c}_{i}^{\mathrm{enc}} \in \mathbb{R}^{2 h \times 1}, \overleftarrow{\mathbf{c}_{i}^{\mathrm{enc}}}, \overrightarrow{\mathbf{c}_{i}^{\mathrm{en}}} \in \mathbb{R}^{h \times 1}$

使用编码器的最终隐藏状态和最终单元状态的线性投影来初始化解码器的第一个隐藏状态 $h_{0}^{dec}$ 和单元状态 $c_{0}^{dec}$ 。

$\begin{aligned} &\mathbf{h}_{0}^{ \mathrm{dec}}=\mathbf{W}_{h}[\overleftarrow{\mathbf{h}_{1}^{\mathrm{enc}}} ; \overrightarrow{\mathbf{h}_{m}^{\mathrm{enc}}}] \text { where } \mathbf{h}_{0}^{\mathrm{dec}} \in \mathbb{R}^{h \times 1}, \mathbf{W}_{h} \in \mathbb{R}^{h \times 2 h}\\ &\mathbf{c}_{0}^{\mathrm{dec}}=\mathbf{W}_{c}[\overleftarrow{\mathbf{c}_{1}^{\mathrm{enc}}} ; \overrightarrow{\mathbf{c}_{m}^{\mathrm{enc}}}] \text { where } \mathbf{c}_{0}^{\mathrm{dec}} \in \mathbb{R}^{h \times 1}, \mathbf{W}_{c} \in \mathbb{R}^{h \times 2 h} \end{aligned}$

将 $y_t$ 与上一个时间步的组合输出矢量 $o_{t-1}∈\mathbb{R}^{h×1}$ 连接起来，以产生 $\overline{y_t}∈\mathbb{R}^{（e+h）×1}$ 。请注意，对于第一个目标字（即开始标记）， $o_0$ 是零向量。然后将 $\overline{y_t}$ 作为输入提供给解码器。

$\mathbf{h}_{t}^{\text {dec }}, \mathbf{c}_{t}^{\text {dec }}=\operatorname{Decoder}\left(\overline{\mathbf{y}_{t}}, \mathbf{h}_{t-1}^{\text {dec }}, \mathbf{c}_{t-1}^{\text {dec }}\right) \text { where } \mathbf{h}_{t}^{\text {dec }} \in \mathbb{R}^{h \times 1}, \mathbf{c}_{t}^{\text {dec }} \in \mathbb{R}^{h \times 1}$

使用 $h_{t}^{dec}$ 计算注意力层 $e_t \in \mathbb{R}^{m×1}$

$\mathbf{e}_{t, i}=\left(\mathbf{h}_{t}^{\text {dec }}\right)^{T} \mathbf{W}_{\text {attProj }} \mathbf{h}_{i}^{\text {enc }} \text { where } \mathbf{e}_{t} \in \mathbb{R}^{m \times 1}, \mathbf{W}_{\text {attProj }} \in \mathbb{R}^{h \times 2 h}$
$\begin{aligned} &\alpha_{t}=\operatorname{softmax} \left(\mathbf{e}_{t}\right) \text { where } \alpha_{t} \in \mathbb{R}^{m \times 1}\\ &\mathbf{a}_{t}=\sum_{i=1}^{m} \alpha_{t, i} \mathbf{h}_{i}^{\mathrm{enc}} \text { where } \mathbf{a}_{t} \in \mathbb{R}^{2 h \times 1} \end{aligned}$

将注意输出与解码器隐藏状态 $h^{dec}_t$ 连接起来，并将其通过线性层 $tanh$ 和 $dropout$ 以获得组合输出矢量 $o_t$ 。

$\begin{array}{c} \mathbf{u}_{t}=\left[\mathbf{a}_{t} ; \mathbf{h}_{t}^{\mathrm{dec}}\right] \text { where} \mathbf{u}_{t} \in \mathbb{R}^{3 h \times 1} \\ \mathbf{v}_{t}=\mathbf{W}_{u} \mathbf{u}_{t} \text { where } \mathbf{v}_{t} \in \mathbb{R}^{h \times 1}, \mathbf{W}_{u} \in \mathbb{R}^{h \times 3 h} \\ \mathbf{o}_{t}=\operatorname{dropout}\left(\tanh \left(\mathbf{v}_{t}\right)\right) \text { where } \mathbf{o}_{t} \in \mathbb{R}^{h \times 1} \end{array}$

然后，在第t个时间步生成目标词的概率分布 $P_t$ ：

$\mathbf{P}_{t}=\operatorname{softmax}\left(\mathbf{W}_{\text {vocab }} \mathbf{o}_{t}\right) \text { where } \mathbf{P}_{t} \in \mathbb{R}^{V_{t} \times 1}, \mathbf{W}_{\text {vocab }} \in \mathbb{R}^{V_{t} \times h}$

最后，为了训练网络，计算 $P_t$ 与 $g_t$ 之间的 $softmax$ 交叉熵损失，其中 $g_t$ 是目标单词在时间步 $t$ 处的一热向量：

$J_{t}(\theta)=\text { CrossEntropy }\left(\mathbf{P}_{t}, \mathbf{g}_{t}\right)$

A5 字符型翻译模型

首先通过CNN卷积得到输入的特征向量，然后将其作为特征词向量输入到模型中。

其他步骤和词模型一样，只不过提取的特征是具有字符特征的，当预测出<UNK>字符时，启动简单的LSTM预测结构，进行输出词的生成。

$\mathbf{h}_{t}, \mathbf{c}_{t}=\text { CharDecoderLSTM }\left(\mathbf{x}_{t}, \mathbf{h}_{t-1}, \mathbf{c}_{t-1}\right) \text { where } \mathbf{h}_{t}, \mathbf{c}_{t} \in \mathbb{R}^{h}$
$\mathbf{s}_{t}=\mathbf{W}_{\mathrm{dec}} \mathbf{h}_{t}+\mathbf{b}_{\mathrm{dec}} \in \mathbb{R}^{V_{\text {char }}}$
$\begin{aligned} \mathbf{p}_{t} &=\operatorname{softmax}\left(\mathbf{s}_{t}\right) \in \mathbb{R}^{V_{\text {char }}} \quad \forall t \in\{1, \ldots, n\} \\ \text { loss_char_dec } &=-\sum_{t=1}^{n} \log \mathbf{p}_{t}\left(x_{t+1}\right) \in \mathbb{R} \end{aligned}$