1.输入的维度为模型的维度,是上一层线性转化之后的模型维度,输出的维度是d_k=d_q 乘上头数
1024
64*16=1024
(layer_stack): ModuleList(
(0): EncoderLayer(
(slf_attn): MultiHeadAttention(
(w_qs): Linear(in_features=256, out_features=1024, bias=True)
(w_ks): Linear(in_features=256, out_features=1024, bias=True)
(w_vs): Linear(in_features=256, out_features=1024, bias=True)
(attention): ScaledDotProductAttention(
(dropout): Dropout(p=0.1, inplace=False)
(softmax): Softmax(dim=2)
)
(layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(fc): Linear(in_features=1024, out_features=256, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
2.1280是d_inner 在计算attention之后用一个全连接转为256模型维度
模型维度做一个前馈传播(也是自己设定中间的维度)(看函数好像是做了一个残差连接最后的out put 加上了残差)
(pos_ffn): PositionwiseFeedForward(
(w_1): Linear(in_features=256, out_features=1280, bias=True)
(w_2): Linear(in_features=1280, out_features=256, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
)