论文摘录:resnet&batch normalization

Deep Residual Learning for Image Recognition

开门见山,抛问题

问题1、An obstacle to answering this question was the notorious problem of vanishing/exploding gradients , which hamper convergence from the beginning.

  • This problem, however, has been largely addressed by normalized initialization and intermediate normalization layers , which enable networks with tens of layers to start converging for stochastic gradient descent(SGD) with back- propagation

问题2、When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error.

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

BN可被认为是新型网络结构的标配,例如Inception V4, ResNet等网络结构都采用了BN。

  • the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities.

  • While stochastic gradient is simple and effective, it requires careful tuning of the model hyper-parameters, specifically the learning rate used in optimization, as well as the initial values for the model parameters.

  • The training is complicated by the fact that the inputs to each layer are affected by the parameters of all preceding layers —— so that small changes to the network parameters amplify as the network becomes deeper.

  • The change in the distributions of layers’ inputs presents a problem because the layers need to continuously adapt to the new distribution. When the input distribution to a learning system changes, it is said to experience covariate shift (Shimodaira, 2000). This is typically handled via domain adaptation (Jiang, 2008). However, the notion of covariate shift can be extended beyond the learning system as a whole, to apply to its parts, such as a sub-network or a layer.

  • 对于网络的输入,我们用tanh这个激活函数来举例。如果不做normalization,那么当输入的绝对值比较大的时候,tanh激活函数的输入会比较大,那么tanh就会处于饱和阶段,此时的神经网络对于输入不在敏感,我们的网络基本上已经学不到东西了,激活函数的输出不是+1,就是-1.

  • 因为网络层次的有序性,每一层的统计分布都会受到前面所有层处理过程的影响。因此在训练过程中,浅层因为噪声样本或者激活函数设计不当等原因造成的分布误差会随着前向传播至Loss function, 这一误差随后又在梯度反向传播中被进一步放大。这种训练过程中出现的分布与理想分布发生偏差的现象被成为Internal Covariate Shift. Covariate Shift的概念与2000年在统计学领域被提出,论文作者将原始的端到端的Covariate Shift引入到网络的每一层,因此称为Internal Covariate Shift。

  • 归一化/正则化:数据归一化、正则化是非常重要的步骤,用于重新缩放输入的数值,确保在反向传播期间更好的收敛。一般来说,采用的方法是减去平均值在除以标准差。如果不这样做,某些特征的量纲比较大,在cost函数中将得到更大的加权。数据归一化使得所有特征的权重相等,量纲相同。

Batch Normalization,会其意知其形
Batch Normalization(批标准化)

  • As such it is advantageous for the distribution of x to remain fixed over time. Θ2 does not have to readjust to compensate for the change in the distribution of x .

效果:

  • Batch Normalization enables higher learning rates
  • Batch Normalization regularizes the model

Deep Residual Learning

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • 今日在追《神厨小福贵》,晚上老爸老妈总会看一些儿在我看来根本没啥意思的抗日剧,大概是神经早就被抗日神剧给磨没了。跟...
    无悔的矮子阅读 2,446评论 0 0
  • 斜阳余晖晚,旁径野花幽。 密布乌云荡,林泉石涧流。 归来逢陌路,何处是闲愁? 恰遇渔舟翁,山中岁月留。
    Delia常青藤阅读 180评论 2 5
  • 狗娃子胃口一直是我们家里最好的,我曾经一度很怜惜它旺盛的胃口,尽量满足它,一日三餐尽它吃,然而,它却是个吃得多便拉...
    米粒微光阅读 525评论 2 3