论文:A Knowledge-Grounded Multimodal Search-Based Conversational Agent
论文地址:https://arxiv.org/pdf/1810.11954.pdf
总体来说,这篇论文是在HRED的模型上予以改进。
HRED模型是基于Seq2Seq进行改进,以扩展到多轮对话聊天这个应用场景。
The HRED model consists of three different RNNs: an Encoder, a Decoder, and a context module.
- The encoder will handle the data at the lower hierarchical level by processing each word in a sentence, to encode the sentence in a fixed length vector. The sentence vector representation is then given as input to the context module.
- The context RNN handles the data at the higher hierarchical level by iteratively processing all the sentence vectors of a conversation and updating its hidden state after every sentence.By doing so, the context vector represents the entire conversation up to the last sentence received.
- The output of the Context RNN is then used to initialize the hidden state of the Decoder RNN (similarly to the encoder-decoder, where the hidden state of the encoder was used to initialise the decoder), which will generate the output sentence.
这篇论文主要在HRED模型上做了三点改进:
- 输入变成了多模态(文本+图片)
- 加了attention机制
- 引入了外部知识库
数据集用的是MMD数据集,如下图所示:
模型的输入是文本+图片,其中图片考虑的是整个会话里所有出现过的图片,一起输入进一个linear层里。下图是两个回合的上下文建模:
可以看到,文本内容会通过一个双向的GRU模型产生文本表征,图片这边用的是VGG19的FC6层表征向量,最后两个concat到一起输入到Context Encode里。
整个模型框架长下面这个样子:
在解码器的每个时间步都引入了外部数据库KB的知识。
KB向量h_kb是由h_query和h_entity组合而成的,下面是一个例子。
随后h_kb再和decoder input结合起来。
解码器用到了input feeding attention,如下图所示。
代码地址在这里:https://github.com/shubhamagarwal92/mmd
附上我看代码的一些笔记。
image_encoder:
- used the 4096 dimensional FC6 layer image representations from VGG
- concatenated vector together and passed through a linear layer to form the ‘global’ image context for a single turn
encoderRNN:
- 1-layer bidirectional GRU cells
- input needs to be 'packed' before usage to handle variable length input sequences
- sort → pack → embedding → encoder → unpack
bridge:
- used to pass encoder final representation to decoder (layers*directions, batch, features)
- the encoder states are forwaded through a dense layer followed by a non-linearity, here relu
contextRNN:
- 1-layer GRU cells
- handles the data at the higher hierarchical level by iteratively processing all the sentence vectors of a conversation and updating its hidden state after every sentence
kb_encoder:
- 1-layer GRU cells
- sort → pack → embedding → encoder → unpack
decoder:
attention:
- the decoder applies attention over the source sequence and implements input feeding by default
- input feeding is an approach to feed attentional vectors as inputs to the next time steps to inform the model about past alignment decisions