https://www.kaggle.com/kernels/scriptcontent/20478888/data
梯度积累。因为gpu内存限制,更新梯度需要积累到几轮,然后统一做一次。
https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255
1: loss = loss / self.accumulation_steps , the loss is divided by self.accumulation_steps , I think there is no need to do that , why divided it?
2: if (itr + 1 ) % self.accumulation_steps == 0: self.optimizer.step() self.optimizer.zero_grad()
The gradient was not been update immediately , is it due to the very small batch_size ?
That's part of gradient accumulation, so that 32 means we will be adding loss for 32 samples and then we'll do backpropogation. In this way we can train models with batch size 32 even though the GPU memory constraints don't allow us to use 32 samples at a time. Check this out: