语言模型训练技巧学习

(TRAINING A LANGUAGE MODEL ON A SINGLE GPU IN ONE DAY)[https://arxiv.org/pdf/2212.14034.pdf]

基础设置

字典大小为 $2^15$ 最佳，字典设置为 $2^16$ 性能并没有明显提高，更小的字典（ $2^12,2^13,2^14$ ）会带来性能下降。
训练时删除<sep>会对性能造成微小的影响，<cls>删除不会对性能造成影响。
较短的序列，如128，对于大部分下游任务已经足够且可以降低注意力的计算量。
由于在gtx2080内存限制，batch_size将只能使用64-96的小批次，为了更好的效果，可以通过累计梯度的方式进行计算。

模型结构

模型每个token的学习效率强烈依赖于模型大小，而和transformer类型关系不大。在限定计算资源的实验中，模型大的学习效率高，模型小的学习效率低，但是可以计算更多的批次，最终全部模型的损失都在1.9附近。
Attention
使用Pre-Normalization
rotary embeddings (Su et al., 2021; Black et al., 2022),可以带来性能的小提高，但是会损失模型训练速度。
FLashattention在序列长度为128时无明显增益。
FFN
移除QKV偏差项，设置head为12
移除QKV的偏差项，可以略微提高计算速度。
减少head数目可以提高模型的计算性能，但会略微降低模型微调性能。

embedding
使用正弦位置嵌入，并在embedding block中的最后进行layer normalization操作。

sinusoidal positional embeddings比可学习位置嵌入和未缩放的正弦位置嵌入可带来更多增益。

layer structure
使用pre-normalization的layer norm代替post-normalization.

作者认为pre-norm的关键在于稳定训练过程，增大学习率和减小wanm-up步数，单独使用收益有限。

除pre-norm外，其他改进的变种均没有额外提高。
使用 RMS Normalization代替layer Normalization也没有带来增益。

head block
使用线性头，使用稀疏令牌预测，在block最后添加layer norm。

移除非线性head并不会产生不好的效果。
稀疏令牌预测可以节省内存使用。
在block的最后添加layer norm可以稳定训练
下列方法可以降低解码器偏差？来自Language Models are Unsupervised Multitask Learners

Layer normalization (Ba et al., 2016)was moved to the input of each sub-block, similar to a
pre-activation residual network (He et al., 2016) and an
additional layer normalization was added after the final selfattention block. A modified initialization which accounts
for the accumulation on the residual path with model depth
is used. We scale the weights of residual layers at initialization by a factor of $\frac{1}{\sqrt{N}}$ where N is the number of residual layers.

超参数设置

Objective
只使用MLM任务，遮蔽率为15%，其中10%替换为随机字符，10%保持原有字符

设置更高的遮蔽率没有带来性能提高。
对损失使用其他方法进行评估，如mean-squared error (Hui & Belkin, 2021)或L1 loss，没有发现任何益处。

Optimizer
使用adam作为优化器，设置权重衰减为0.01,β1 = 0.9, β2 = 0.98,ε = $10^{-12}$ .设置梯度裁剪，裁剪值为0.5.

在合理的范围内调整参数值不会产生显著变化。
实验了一阶和高阶优化器，并没有发现显著优势，另外需要注意高阶优化器在实施中存在较大的可变性。

Learning
使用单周期损失，最大损失值为0.001，学习率关联计算预算，随着预算资源减少而下降。

从全局看各种学习率方案会有相似下降曲线，但仍然可以通过选择不同的学习率方案获得收益，其中在限定计算量的情况下，单周期学习率，峰值为0.001的学习率方案取得效果最优。

Steps
在2080ti时设置微批次大小为96，批次大小设置在1536附近时预训练任务损失最小，但是批次大小设置为4032时下游任务性能最佳.对于显存更大的设备，如A4000和A6000，则设置微批次大小为128/256，批次大小为4032.

正常模型训练会每个批次执行一次模型参数更新，因为bert需要的批次大小较大，此处使用梯度累计更新的方式，每个批次后会把梯度保留，达到n个批次后才会执行反向更新。

1.使用线性增长的微批量大小，可以在训练初期产生更大的进步，并对最终结果产生较小的好处。

#cramming/cramming/backend/torch_default.py
#class TorchEngine
   def backward(self, loss):
        self.accumulated_samples += self.cfg_impl.microbatch_size
        return self.scaler.scale(loss / self.accumulation_steps_expected).backward()

    @torch.no_grad()
    def forward_inference(self, *inputs, **kwargs):
        with torch.autocast(**self.amp_settings):
            outputs = self.model(*inputs, **kwargs)["logits"]
        if outputs.shape[-1] == 1:
            predictions = outputs.squeeze(dim=-1)
        else:
            predictions = outputs.argmax(dim=-1)
        return outputs, predictions

    def optimizer_step(self):
        """Requires a scheduler that is based on iterations instead of epochs."""
        self.steps += 1
        if self.accumulated_samples >= self.current_batch_size:
            self.accumulated_samples = 0

            if self.cfg_train.gradient_clipping is not None:
                self.scaler.unscale_(self.optimizer)
                torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.cfg_train.gradient_clipping, norm_type=2.0)
            self.scaler.step(self.optimizer)
            self.scaler.update()
            self.optimizer.zero_grad()
            self.schedule_batch_size()
            self.schedule_curriculum()
            self.moving_average_computation()
        self.scheduler.step()  # Trigger in every step, otherwise things get annoying with grad accumulation

Dropout
在预训练时移除Dropout，在下游任务中设置dropout比例为0.1.

在训练数据相对计算资源较少时，使用dropout可以防止过拟合。
在计算资源受限的情况下，dropout会导致每个参数每秒的更新量，降低参数更新效率。

数据集设置

通过两种方法来更好的缩小数据规模。第一步对数据进行过滤、预处理和排序，第二步交换数据源。
为此，作者实验了一些数据集包括Pile的子集Gutenberg,Books3,Wikipedia (en)和Common Crawl进行训练，其中选取Pile的前 $4\times 10^6$ ,选取Common Crawl的前 $20 \times 10^6$ .
在未处理情况下Pile在下游任务MNLI中性能最好，C4在进行一些额外处理后可以获得额外提高。实验了数据去重，没有发现效果提升，对数据根据压缩率移除条目，实现了性能提高。
另外为了提高效果还使用序列排序和加大训练最后的batch size。

按照Deduplicating Training Data Makes Language Models Better对数据集中的内容去重后对下游性能没有影响。
移除难以压缩的数据后对性能有所提高。压缩方法为使用数据集自带的tokenizer对数据集进行tokenizer，将tokens数大于原始字符(characters)数0.3倍的条目移除。

We use the tokenizer itself to remove all training sequences from C4 set that cannot be compressed well; we simply set a threshold t, e.g. t = 0.3, and drop all entries from the dataset where the number of tokens in the entry is larger than t times the number of raw characters.This removes, for example, sequences consisting of hard-to-compress HTML or markdown code.
对数据按照平均token流行性(average (unigram) token prevalence)来对数据进行排序。