建议使用torch.nn.LayerNorm实现,比torch.layer_norm灵活度更高。
可对tensor实现任意维度的归一化操作。
官方示例:
>>> # NLP Example
>>> batch, sentence_length, embedding_dim = 20, 5, 10
>>> embedding = torch.randn(batch, sentence_length, embedding_dim)
>>> layer_norm = nn.LayerNorm(embedding_dim)
>>> # Activate module
>>> layer_norm(embedding)
>>>
>>> # Image Example
>>> N, C, H, W = 20, 5, 10, 10
>>> input = torch.randn(N, C, H, W)
>>> # Normalize over the last three dimensions (i.e. the channel and spatial dimensions)
>>> # as shown in the image below
>>> layer_norm = nn.LayerNorm([C, H, W])
>>> output = layer_norm(input)
通过官方示例可知LN在NLP和CV领域用法的不同。
在NLP中,LN相当于IN(instance normalization),只对最后一维上的元素做归一化(图1最右);
在CV中,LN会对C,H,W三个维度上的所有元素做归一化(图1中间)。
而且,torch.nn.LayerNorm默认的elementwise_affine=True,即有可学习的scale和bias参数,like BN。
假设和实验
根据以上示例很自然想到,如果 layer_norm = nn.LayerNorm([N, H, W])
,layer_norm就变成了BN。
N, C, H, W = 20, 5, 10, 10
input = torch.randn(N, C, H, W)
layer_norm = torch.nn.LayerNorm([N, H, W])
output = layer_norm(input)
RuntimeError: Given normalized_shape=[20, 10, 10], expected input with shape [*, 20, 10, 10], but got input of size[20, 5, 10, 10]
报错了,不允许这样的设定。
但我们可以通过将tensor的N/C两个维度交换一下来实现同样的效果:
N, C, H, W = 20, 5, 10, 10
input = torch.randn(N, C, H, W)
# LN
layer_norm = torch.nn.LayerNorm([N, H, W],elementwise_affine=False)
output = layer_norm(input.transpose(0,1)).transpose(0,1)
# BN
bn = torch.nn.BatchNorm2d(C,affine=False)
ouput_bn = bn(input)
# 结果相减
torch.sum(output-ouput_bn)
result:
tensor(8.2701e-07)
可见假设成立。