Multi-headed Attention

一个attention head可能权重大部分在某处，不能提取丰富的信息，需要多个进行融合。

Illustration of the multi-head attention, which jointly attends to different representation subspaces (colored boxes) at different positions (darker color denotes higher attention probability).

Fusion/Aggregation

主要方法有

Concatenation
Mean/max pooling
Higher-order bilinear pooling
Routing-by-agreement algorithm

Diversity

回顾注意力机制中，scores由 $K$ 和 $Q$ 通过不同运算方式得到，然后进行softmax得到attention weight，再由attention weight对 $V$ 进行加权平均得到context vector（参考下图）。 $Q$ 和 $K$ 都是 $V$ 或其对应的状态序列在某个子空间的投影。不同head投影到不同子空间，自动可以保证diversity。
另外也可以通过增加正则项的方法迫使不同的attention head选择不同的系数。^[1]提出三种不同的正则项，分别对应子空间、注意力位置、输出。^[2]通过增加一个动态正则项一一attention weight的内积。^[3]中提出3个正则项，分别对应不同head之间的scores和context，某个head内部的context。注意的是正则项是在哪起作用。

Positional Encoding

让attention包含次序信息，不改变网络架构，可以把位置编码进embedding里面^[4]。

Others

Location-Aware

NLP问题中，通常使用content-based的attention。语音识别中，由于声学模型间隔时间大的前后元素相关性小，又有单向性，一般使用location-based或hybrid。
考虑注意力所在位置周围信息^[5]：理由是注意力机制忽略了次序和相对位置关系。而注意力权重大的单词周围也有相关信息（距离越近相关性越强），也要考虑进去。由此建立localness modeling，引入“窗口”的概念，把一定范围内的信息都提取出来。论文^[6]中也有LocationAttention的表述可参考。

Relation-Aware

^[7]

Reference

Multi-Head Attention with Disagreement Regularization ↩
Reconstructing attention with dynamic regularization ↩
Orthogonality Constrained Multi-Head Attention for Keyword Spotting ↩
Attention is all you need ↩
Modeling localness for self-attention networks ↩
Hybrid CTC/Attention Architecture for End-to-End Speech Recognition ↩
Self-Attention with Relative Position Representations ↩

注意力机制的增强Enhancement of Attention Mechanism