论文阅读—InstructGPT:Training language models to follow instructions with human feedback

论文地址：https://export.arxiv.org/pdf/2203.02155.pdf

引用信息：Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with human feedback[J]. Advances in Neural Information Processing Systems, 2022, 35: 27730-27744.

背景与挑战

Large language models (LMs) can be "prompted" to perform a range of natural language processing (NLP) tasks, given some examples of the task as input. However, these models often express unintended behaviors such as making up facts, generating biased or toxic text, or simply not following user instructions。

使用“prompted”大型语言模型（LMs）执行一系列自然语言处理（NLP）任务，并给出一些作为输入的任务示例。然而，这些模型通常会表现出例如捏造事实、生成有偏见或有毒的文本，或者根本不遵循用户指示的意想不到的行为。

recent large LMs—predicting the next token on a webpage from the internet—is different from the objective "follow the user's instructions helpfully and safely" . Thus, we say that the language modeling objective is misaligned.

相关工作

Research on alignment and learning from human feedback.

最初开发用于在模拟环境和Atari游戏中训练简单机器人，最近已应用于微调语言模型以总结文本。这项工作反过来又受到了类似工作的影响，这些工作在对话、翻译、语义解析、故事生成、评论生成和证据提取（evidence extraction）等领域使用人类反馈作为奖励

Training language models to follow instructions.

研究中的一个一致发现是，在一系列NLP任务上微调LMs和指令，可以提高其在零射击和少射击设置下的下游任务性能。

Evaluating the harms of language models.

语言模型可以产生有偏见的输出，泄漏私人数据，生成错误信息，并被恶意使用

Modifying the behavior of language models to mitigate harms.

在一个小的、以值为目标的数据集上对LM进行了微调，这提高了模型在问答任务中坚持这些值的能力。

通过删除语言模型在其上生成一组研究人员编写的触发短语的条件可能性很高的文档来过滤预训练数据集。当在这个过滤后的数据集上训练时，它们的LM生成的有害文本较少，代价是语言建模性能略有下降。

使用各种方法来提高聊天机器人的安全性，包括数据过滤、在生成过程中屏蔽某些单词或n-gram、特定于安全的控制tokens。

减轻LMs产生的偏差的其他方法使用单词嵌入正则化、数据增强、零空间投影，以使敏感标记上的分布更加均匀或因果中介分析。

使用第二个（通常较小的）语言模型来指导语言模型的生成，这一想法的变体已被应用于降低语言模型的毒性。

解决方案

1 We make progress on aligning language models by training them to act in accordance with the user's intention. This encompasses both explicit intentions such as following instructions and implicit intentions such as staying truthful, and not being biased, toxic, or otherwise harmful. 根据用户的意图行事。这既包括明确的意图，如遵循指示，也包括隐含的意图，例如保持真实，不带偏见、有毒或其他有害因素。

2 focus on fine-tuning approaches to aligning language models. Specifically, reinforcement learning from human feedback (RLHF) to fine-tune GPT-3 to follow a broad class of written instructions (see Figure 2).

3 evaluate our models by having our labelers rate the quality of model outputs on our test set, consisting of prompts from held-out customers (who are not represented in the training data)

实验部分

数据集

Our prompt dataset consists primarily of text prompts submitted to the OpenAI API,

those using an earlier version of the InstructGPT models (trained via supervised learning on a subset of our demonstration data) on the Playground interface.

We asked labelers to write three kinds of prompts:

• Plain: We simply ask the labelers to come up with an arbitrary task, while ensuring the tasks had sufficient diversity.

• Few-shot: We ask the labelers to come up with an instruction, and multiple query/response pairs for that instruction.

• User-based: We had a number of use-cases stated in waitlist applications to the OpenAI API. We asked labelers to come up with prompts corresponding to these use cases.

From these prompts, we produce three different datasets used in our fine-tuning procedure:

(1) [endif]our SFT dataset, with labeler demonstrations used to train our SFT models, the SFT dataset contains about 13k training prompts (from the API and labeler-written)

(2) [endif]our RM dataset, with labeler rankings of model outputs used to train our RMs, the RM dataset has 33k training prompts (from the API and labeler-written),

(3) [endif]our PPO dataset, without any human labels, which are used as inputs for RLHF fine-tuning. The PPO dataset has 31k training prompts (only from the API).

任务

Our training tasks are from two sources:

(1) [endif]a dataset of prompts written by our labelers

(2) [endif]a dataset of prompts submitted to early InstructGPT models on our API (see Table 6).

模型

Start with the GPT-3 pretrained language models from Brown et al. (2020). These models are trained on a broad distribution of Internet data and are adaptable to a wide range of downstream tasks, but have poorly characterized behavior.

train models with three different techniques:

Supervised fine-tuning (SFT). We fine-tune GPT-3 on our labeler demonstrations using supervised learning. we find that training for more epochs helps both the RM score and human preference ratings, despite this overfitting.

Reward modeling (RM). Starting from the SFT model with the final unembedding layer removed, we trained a model to take in a prompt and response, and output a scalar reward. In this paper we only use 6B RMs, as this saves a lot of compute.

Reinforcement learning (RL). we fine-tuned the SFT model on our environment using PPO. The environment is a bandit environment which presents a random customer prompt and expects a response to the prompt. Given the prompt and response, it produces a reward determined by the reward model and ends the episode. In addition, we add a per-token KL penalty from the SFT model at each token to mitigate overoptimization of the reward model.

过程

We start with a pretrained language model , a distribution of prompts on which we want our model to produce aligned outputs, and a team of trained human labelers. We then apply the following three steps (Figure 2).

整体训练流程

Step 1: Collect demonstration data, and train a supervised policy. Our labelers provide demonstrations of the desired behavior on the input prompt distribution (see Section 3.2 for details on this distribution). We then fine-tune a pretrained GPT-3 model on this data using supervised learning.

Step 2: Collect comparison data, and train a reward model. We collect a dataset of comparisons between model outputs, where labelers indicate which output they prefer for a given input. We then train a reward model to predict the human-preferred output.

Step 3: Optimize a policy against the reward model using PPO. We use the output of the RM as a scalar reward. We fine-tune the supervised policy to optimize this reward using the PPO algorithm (Schulman et al., 2017).

Steps 2 and 3 can be iterated continuously; more comparison data is collected on the current best policy, which is used to train a new RM and then a new policy. In practice, most of our comparison data comes from our supervised policies, with some coming from our PPO policies.

评价

API分布评估

主要指标是人类对一组提示的偏好评级，这些提示来自与我们的训练分布相同的来源。

还评估了在API上提交给GPT-3模型的提示；这些提示通常不是'instruction following' 风格，而是专门为GPT-3设计的。 (see Table 3).

测试效果

对公共NLP数据集的评估

对两种类型的公共数据集进行评估：

一种是捕获语言模型安全性方面的数据集，特别是真实性、毒性和偏见，

另一种是捕捉传统NLP任务（如问题回答、阅读理解和总结）中零样本表现的数据集

实验结果Findings

InstructGPT 结果优于 GPT-3.

Outputs from our 175B InstructGPT are preferred to 175B GPT-3 outputs 85 ± 3% of the time, and preferred 71 ± 4% of the time to few-shot 175B GPT-3.

InstructGPT 真实性优于 GPT-3.

On the TruthfulQA benchmark, InstructGPT generates truthful and informative answers about twice as often as GPT-3.

InstructGPT 对生成的有害性在 GPT-3 上进行了提升, 而不是偏见

我们可以通过修改RLHF微调程序，最大限度地减少公共NLP数据集上的性能回归。

Our models generalize to the preferences of "held-out" labelers that did not produce any training data.

Public NLP datasets are not reflective of how our language models are used. 公共NLP数据集不能反映我们的语言模型是如何使用的。

InstructGPT models show promising generalization to instructions outside of the RLHF finetuning distribution. We qualitatively probe InstructGPT's capabilities, and find that it is able to follow instructions for summarizing code, answer questions about code, and sometimes follows instructions in different languages, despite these instructions being very rare in the fine-tuning distribution.

总结

Showing an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback.

Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning.

Collecting a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters.

InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets.

展示了通过对人类反馈进行微调，使语言模型与用户在广泛任务中的意图保持一致的途径。

从一组通过OpenAI API提交的贴标机书面提示和提示开始，我们收集了所需模型行为的贴标机演示数据集，我们使用该数据集使用监督学习对GPT-3进行微调。

收集模型输出排名的数据集，我们使用该数据集使用来自人类反馈的强化学习来进一步微调该监督模型。我们将生成的模型称为InstructGPT。在对我们的即时分布进行的人工评估中，1.3B参数InstructGPT模型的输出优先于175B GPT-3的输出，尽管参数减少了100倍。

InstructionGPT模型显示了真实性的提高和有毒输出的减少，同时在公共NLP数据集上具有最小的性能回归。

更多的讨论

1. [endif]The cost of increasing model alignment is modest relative to pretraining.

The results show that RLHF is very effective at making language models more helpful to users, more so than a 100x model size increase. This suggests that right now increasing investments in alignment of existing language models is more cost-effective than training larger models—at least for our customers' natural language task distribution.

2. [endif]We've seen some evidence that InstructGPT generalizes 'following instructions' to settings that we don't supervise it in, for example on non-English language tasks and code-related tasks.

3. We were able to mitigate most of the performance degradations introduced by our fine-tuning. If this was not the case, these performance degradations would constitute an alignment tax—an additional cost for aligning the model. Any technique with a high tax might not see adoption. To avoid incentives for future highly capable AI systems to remain unaligned with human intent, there is a need for alignment techniques that have low alignment tax. To this end, our results are good news for RLHF as a low-tax alignment technique.

4. We've validated alignment techniques from research in the real world. Alignment research has historically been rather abstract, focusing on either theoretical results ), small synthetic domains , or training ML models on public NLP datasets . Our work provides grounding for alignment research in AI systems that are being used in production in the real world with customers.10 This enables an important feedback loop on the techniques' effectiveness and limitations.

改进点Limitations

Methodology. The behavior of our InstructGPT models is determined in part by the human feedback obtained from our contractors. Some of the labeling tasks rely on value judgments that may be impacted by the identity of our contractors, their beliefs, cultural backgrounds, and personal history. We hired about 40 contractors, guided by their performance on a screening test meant to judge how well they could identify and respond to sensitive prompts, and their agreement rate with researchers on a labeling task with detailed instructions (see Appendix B). We kept our team of contractors small because this facilitates high-bandwidth communication with a smaller set of contractors who are doing the task full-time. However, this group is clearly not representative of the full spectrum of people who will use and be affected by our deployed models. As a simple example, our labelers are primarily English-speaking and our data consists almost entirely of English instructions. There are also many ways in which we could improve our data collection set-up. For instance, most comparisons are only labeled by 1 contractor for cost reasons. Having examples labeled multiple times could help identify areas where our contractors disagree, and thus where a single model is unlikely to align to all of them. In cases of disagreement, aligning to the average labeler preference may not be desirable. For example, when generating text that disproportionately affects a minority group, we may want the preferences of labelers belonging to that group to be weighted more heavily.

Models. Our models are neither fully aligned nor fully safe; they still generate toxic or biased outputs, make up facts, and generate sexual and violent content without explicit prompting. They can also fail to generate reasonable outputs on some inputs; we show some examples of this in Figure 9. Perhaps the greatest limitation of our models is that, in most cases, they follow the user's instruction, even if that could lead to harm in the real world. For example, when given a prompt instructing the models to be maximally biased, InstructGPT generates more toxic outputs than equivalently-sized GPT-3 models. We discuss potential mitigations in the following sections.

论文阅读—InstructGPT:Training language models to follow instructions with human feedback

推荐阅读更多精彩内容