Large-batch training
-
Linear scaling learning rate
- e.g. ResNet-50 SGD 256 batch size 0.1 learning rate
- init learning . where b is the batch size
-
Learning rate warm up
- at the beginning, paras are far from the final solution
- e.g. we use first batch to warm up, the init learning rate is , at the batch where , set the learning rate to be
-
Zero
- Batch Normalization: Normally, both elements and are initialized to 1s and 0s
- Instead of setting them in a normal way, it set it as to all BN layers that sit at the end of the residual block (最后一层residual block的BN层).
- easy to train at the initial stage
-
No bias decay
- Weight decay will apply to both weight and bias
- it recommended that only apply to weight regularization to avoid overfitting. BN parameters are left unregularized
Low-precision training (降低位数)
- Normal setting: 32-bit floating point (FP32) precision
- Trick switching it to larger batch size (1024) with FP16 and get higher accuracy
Model Tweaks
ResNet Architecture
- ResNet-B
- 为了避免 1x1 conv stride=2 带来的information loss
- ResNet-C
- 为了避免计算量,使用两个3x3 conv代替一个7x7 conv
- ResNet-D
- ResNet-B中path B中的1x1 conv stride=2还是会带来信息丢失,在之前加一个avgpool stride=2 能够有效避免信息丢失
Training Refinement
-
Cosine Learning Rate Decay
-
- where is the total number of batches (ignore warmup stage)
- t is the current batch
- is the init learning rate
- potentially improve the training progress
-
Label Smoothing
-
Knowledge Distillation
- 训练一个复杂的网络(N1)
- 使用数据train N1网络并得到(M1)
- 根据复杂网络设计一个简单网络 (N0)
- 将M1 softmax 设T=20 预测数据得到 soft target
- soft target 和 hard target加权得出Target (推荐0.1:0.9)
- 使用 的数据集训练N0(T=20)得到 M0
- 设T=1,M0 模型为我们得到的训练好的精简模型
-
Mixup Training
Data Augmentation
-
Weighted linear interpolation (双线性插值)
-
In mixup training, we only use
Result