2019-10-18

https://www.kaggle.com/kernels/scriptcontent/20478888/data
梯度积累。因为gpu内存限制,更新梯度需要积累到几轮,然后统一做一次。
https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255

1: loss = loss / self.accumulation_steps , the loss is divided by self.accumulation_steps , I think there is no need to do that , why divided it?

2: if (itr + 1 ) % self.accumulation_steps == 0: self.optimizer.step() self.optimizer.zero_grad()
The gradient was not been update immediately , is it due to the very small batch_size ?

That's part of gradient accumulation, so that 32 means we will be adding loss for 32 samples and then we'll do backpropogation. In this way we can train models with batch size 32 even though the GPU memory constraints don't allow us to use 32 samples at a time. Check this out:

©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容