前言

本文旨在学习和记录，如需转载，请附出处https://www.jianshu.com/p/59aaaaab746a

Normalization

Normalization主要是对网络特征的一种处理方法，期望特征在训练中保持较好的分布。一般都是在激活函数前进行Normalization.对于Normalization，现在主要有以下几种方法：

image.png

输入

一、Batch Normalization

有关BN的理论知识可以查看博客Batch Normalization，BN的操作对象是对Batch个特征map按通道进行归一化，均值和方差的shape大小为 $1\times C \times 1\times 1$ ，然后乘以缩放因子和平移因子。

def batchnorm_forward(x, gamma, beta, bn_param):
    """
    Input:
    - x: Data of shape (N, D)
    - gamma: Scale parameter of shape (D,)
    - beta: Shift paremeter of shape (D,)
    - bn_param: Dictionary with the following keys:
      - mode: 'train' or 'test'; required
      - eps: Constant for numeric stability
      - momentum: Constant for running mean / variance.
      - running_mean: Array of shape (D,) giving running mean of features
      - running_var Array of shape (D,) giving running variance of features

    Returns a tuple of:
    - out: of shape (N, D)
    - cache: A tuple of values needed in the backward pass
    """
    mode = bn_param['mode']
    eps = bn_param.get('eps', 1e-5)
    momentum = bn_param.get('momentum', 0.9)

    N, D = x.shape
    running_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype))
    running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype))

    out, cache = None, None
    if mode == 'train':
        sample_mean = np.mean(x,axis = 0)## 每一列的均值
        sample_var = np.var(x,axis = 0)
        x_hat = (x- sample_mean)/(np.sqrt(sample_var+eps))
        
        out = gamma*x_hat+beta
        cache = (x, sample_mean, sample_var, x_hat, eps,gamma, beta)
        running_mean = momentum*running_mean +(1-momentum)*sample_mean
        running_var = momentum*running_var + (1-momentum)*sample_var
        
    elif mode == 'test':
        out = gamma* (x - running_mean)/(np.sqrt(running_var+eps))+beta
        pass
    else:
        raise ValueError('Invalid forward batchnorm mode "%s"' % mode)

    # Store the updated running means back into bn_param
    bn_param['running_mean'] = running_mean
    bn_param['running_var'] = running_var

    return out, cache


def batchnorm_backward(dout, cache):
    """
    Inputs:
    - dout: Upstream derivatives, of shape (N, D)
    - cache: Variable of intermediates from batchnorm_forward.

    Returns a tuple of:
    - dx: Gradient with respect to inputs x, of shape (N, D)
    - dgamma: Gradient with respect to scale parameter gamma, of shape (D,)
    - dbeta: Gradient with respect to shift parameter beta, of shape (D,)
    """
    dx, dgamma, dbeta = None, None, None
    N = dout.shape[0]
    x, sample_mean, sample_var, x_hat, eps,gamma, beta = cache
    dgamma = np.sum(dout*x_hat,axis = 0)
    dbeta = np.sum(dout,axis = 0)
    dhat = dout * gamma
    dx_1 = dhat/(np.sqrt(sample_var+eps))
    dvar = np.sum(dhat*(x-sample_mean),axis=0)*(-0.5)*((sample_var+eps)**(-1.5))
    dmean = np.sum(-dhat,axis=0)/(np.sqrt(sample_var+eps))+dvar*np.mean(2*sample_mean-2*x,axis=0)
    
    dx_var = dvar*2.0*(x-sample_mean)/N
    dx_mean = dmean*1.0/N
    dx = dx_1+dx_var+dx_mean
    return dx, dgamma, dbeta

上述代码是针对全连接层的BN，如果需要在卷积网络中使用BN，只需把conv出的特征map进行reshape $(N\times H\times W,C)$ 即可。此外，BN训练中每一都计算了每个Batch的均值和方差，在测试时所用的均值和方差是训练中所有数据的滑动平均。
BN的优点：

可以容许较大的学习率；
可以采用较差的初始化；
正则化

BN的缺点：
计算均值和方差是在Batch上，如果Batchsize太小，计算均值和方差不能代表整个数据分布；如果Batchsize太大，会超过显存容量，训练较慢，更新很慢，一般选择32,64,128等

image.png

二、Layer Normalization

LN的操作对象是对N个特征map按N进行归一化，均值和方差的shape大小为 $N\times 1 \times 1\times 1$ 。简单的说，就是每个样本求一个均值和方差。所以训练和测试时代码都一样，就不必考虑滑动平均了。

def layernorm_forward(x, gamma, beta, ln_param):
    """
   
    Input:
    - x: Data of shape (N, D)
    - gamma: Scale parameter of shape (D,)
    - beta: Shift paremeter of shape (D,)
    - ln_param: Dictionary with the following keys:
        - eps: Constant for numeric stability

    Returns a tuple of:
    - out: of shape (N, D)
    - cache: A tuple of values needed in the backward pass
    """
    out, cache = None, None
    eps = ln_param.get('eps', 1e-5)
   
    x_T = x.T
#     print(x_T)
    sample_mean = np.mean(x_T,axis = 0)

    
    sample_var = np.var(x_T,axis = 0)
   
    x_norm_T = (x_T - sample_mean)/(np.sqrt(sample_var+eps))
#     print(x_norm_T)
    x_norm = x_norm_T.T
    out = x_norm * gamma +beta
    cache = (x,  sample_mean, sample_var,x_norm,eps, gamma, beta)
    return out, cache


def layernorm_backward(dout, cache):
    dx, dgamma, dbeta = None, None, None
    x,  sample_mean, sample_var,x_norm,eps, gamma, beta = cache
    dgamma = np.sum(dout*x_norm, axis = 0) 
    dbeta = np.sum(dout, axis = 0)
    
    dout = dout.T
    N = dout.shape[0]
    dhat = dout * gamma[:,np.newaxis]
    dx_1 = dhat/(np.sqrt(sample_var+eps))
    x = x.T
    dvar = np.sum(dhat*(x-sample_mean),axis=0)*(-0.5)*((sample_var+eps)**(-1.5))
    dmean = np.sum(-dhat,axis=0)/(np.sqrt(sample_var+eps))+dvar*np.mean(2*sample_mean-2*x,axis=0)
    
    dx_var = dvar*2.0*(x-sample_mean)/N
    dx_mean = dmean*1.0/N
    
    dx = dx_1+dx_var+dx_mean
    dx = dx.T
    

    return dx, dgamma, dbeta

LN的优点：不需要批训练，在一条数据内部就能进行归一化。可以在Batchsize为1的网络和RNN中。此外，对CNN网络来说，BN比LN适合；对RNN网络来说，LN比BN更适合。

三、Instance Normalization

IN的提出主要是针对风格迁移网络。LN的操作对象是对Batch个特征map按像素进行归一化，均值和方差的shape大小为 $N \times C \times 1\times 1$ 。因为在图像的风格迁移中，生成的结果主要依赖于某个图像实例，所以在通道和数目上进行归一化不适合风格迁移，需要保持实例个通道内独立。

image.png

四、Group Normalization

GN的提出主要针对BN在小的batchsize下，其估计整体不精确造成的精度下降。
GN将原始的输入 $x:N\times C \times H\times W$ 按通道划分成几组 $N\times G \times (C/G) \times H\times W$ ，然后在各个组内进行归一化。这样计算时不必考虑Batchsize的大小。均值和方差的大小为 $N\times G \times 1 \times 1\times 1$

image.png

def spatial_groupnorm_forward(x, gamma, beta, G, gn_param):
    """
    out, cache = None, None
    eps = gn_param.get('eps',1e-5)
    
    N,C,H,W = x.shape
    x_group = np.reshape(x,(N,G,C//G,H,W))
    mean = np.mean(x_group,axis=(2,3,4),keepdims=True)
    var = np.var(x_group,axis=(2,3,4),keepdims=True)
    x_groupnorm = (x_group-mean)/np.sqrt(var+eps)
    x_norm = np.reshape(x_groupnorm,(N,C,H,W))
    out = x_norm*gamma+beta
    cache = (G,x,x_norm,mean,var,gamma,beta,eps)
   
    return out, cache


def spatial_groupnorm_backward(dout, cache):
    dx, dgamma, dbeta = None, None, None
    G,x,x_norm,mean,var,gamma,beta,eps = cache
    N,C,H,W = dout.shape
    dbeta = np.sum(dout,axis=(0,2,3),keepdims=True)
    dgamma = np.sum(dout*x_norm,axis=(0,2,3),keepdims=True)
    
    dx_norm = dout*gamma
    dx_groupnorm = dx_norm.reshape((N,G,C//G,H,W))
    x_group = x.reshape((N,G,C//G,H,W))
    
    dvar = np.sum(dx_groupnorm*-1.0/2*(x_group-mean)*(var+eps)**(-1.5),axis=(2,3,4),keepdims=True)
    
    N_group = C//G*H*W
    dmean1 = np.sum(dx_groupnorm*-1.0/np.sqrt(var+eps),axis=(2,3,4),keepdims=True)
    dmean2 = dvar*-2.0/N_group*np.sum(x_group-mean,axis=(2,3,4),keepdims=True)
    dmean = dmean1+dmean2
    
    dx_group1 = dx_groupnorm*1.0/np.sqrt(var+eps)
    dx_group2 = dmean*1.0/N_group
    dx_group3 = dvar*2.0/N_group*(x_group-mean)
    dx_groups = dx_group1+dx_group2+dx_group3
    dx = dx_groups.reshape((N,C,H,W))
   
    return dx, dgamma, dbeta

总结

BN在batch上进行归一化，保留通道数；LN在通道上进行归一化，保留数目N；IN在图像上进行归一化,保留N和C;GN将通道分组，在通道内进行归一化，保留N和G(通道数)；
BN,GN更适合CNN；LN更适合RNN;IN主要用于风格迁移。
BN训练和测试代码不一样，测试时需要考虑滑动平均。BN可以设置滑动平均的参数来获取更准确的均值和标准差。

cs231n学习之Normalization（5）

cs231n学习之Normalization（5）

前言

Normalization

一、Batch Normalization

二、Layer Normalization

三、Instance Normalization

四、Group Normalization

总结

参考

推荐阅读更多精彩内容