奖励模型(reward model)是强化学习的基础,如果说pretrain是背书的话,sft就是背题,而rlhf则可以看作是有老师批改作业的学习,奖励模型(reward model)就是那个老师,他会告诉语言模型(policy model)什么是好的回答,什么是差的回答,然后语言模型(policy model)就会向好的方向学习,而远离差的方向。
1.Reward model 的学习
(1)模型的类型
首先,奖励模型可以是任意的语言模型,不必跟policy模型一样,甚至可以是BERT这种判别式模型,因为它用到的仅是模型文本分类任务AutoModelForSequenceClassification。
(2)数据格式
其次我们看下学习奖励模型的数据格式。
没错,奖励模型的学习数据是样本对,它们是同一条prompt的2条不同response的标记。这标记可能来自对战平台上用户的评价,也有可能是来自更先进模型例如gpt4-o的评估。一个样本对包含一个好的样本(chosen),一个差的样本(rejected)。当然这种好差是相对来说的,可以是根据样本的排名得来的,排在前面的是chosen,排在后面的是rejected。也可以根据得分(score)分配而来,分高的是chosen,分低的是rejected。得分(score)不是必须的,有分数可以在计算margin的时候使用(见后面loss的计算)。
需要注意的是这里的chosen和rejected都是包含prompt的,有的数据chosen和rejected仅有response,prompt是一个单独的column,训练前需要先拼接到一起。
(3)loss的计算
RewardTrainer其他部分没什么好说的,主要看下loss就行了。
https://github.com/huggingface/trl/blob/main/trl/trainer/reward_trainer.py#L264
def compute_loss(
self,
model: Union[PreTrainedModel, nn.Module],
inputs: Dict[str, Union[torch.Tensor, Any]],
return_outputs=False,
) -> Union[torch.Tensor, Tuple[torch.Tensor, Dict[str, torch.Tensor]]]:
if not self.use_reward_data_collator:
warnings.warn(
"The current compute_loss is implemented for RewardDataCollatorWithPadding,"
" if you are using a custom data collator make sure you know what you are doing or"
" implement your own compute_loss method."
)
rewards_chosen = model(
input_ids=inputs["input_ids_chosen"],
attention_mask=inputs["attention_mask_chosen"],
return_dict=True,
)["logits"]
rewards_rejected = model(
input_ids=inputs["input_ids_rejected"],
attention_mask=inputs["attention_mask_rejected"],
return_dict=True,
)["logits"]
# calculate loss, optionally modulate with margin
if "margin" in inputs:
loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected - inputs["margin"]).mean()
else:
loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected).mean()
if self.args.center_rewards_coefficient is not None:
loss += self.args.center_rewards_coefficient * torch.mean((rewards_chosen + rewards_rejected) ** 2)
if return_outputs:
return loss, {
"rewards_chosen": rewards_chosen,
"rewards_rejected": rewards_rejected,
}
return loss
其中上面的公式对应下面的代码
if "margin" in inputs:
loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected - inputs["margin"]).mean()
µˆ(x, y)代表margin函数,即代码里的inputs["margin"],在实践中很多研究人员倾向于不把margin添加进loss, 因为评分的主观性太强。如果要加,margin function可以自行构造,这里的score_chosen和score_rejected是该sample的评分。
def add_margin(row):
# Assume you have a score_chosen and score_rejected columns that you want to use to compute the margin
return {'margin': row['score_chosen'] - row['score_rejected']}
dataset = dataset.map(add_margin)
另外,在许多情况下,最好确保奖励模型的输出均值为零。这种中心化约束通过使奖励的平方和最小化实现,center_rewards_coefficient是约束系数。
if self.args.center_rewards_coefficient is not None:
loss += self.args.center_rewards_coefficient * torch.mean((rewards_chosen + rewards_rejected) ** 2)
2.Reward model 的使用问题
https://github.com/huggingface/trl/blob/main/trl/trainer/utils.py#L1069
我发现TRL在使用奖励模型的时候,都是直接把token ids输入到get_reward函数里面。
def get_reward(
model: torch.nn.Module, query_responses: torch.Tensor, pad_token_id: int, context_length: int
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
"""
Computes the reward logits and the rewards for a given model and query responses.
Args:
model (`torch.nn.Module`):
The model used to compute the reward logits.
query_responses (`torch.Tensor`):
The tensor containing the query responses.
pad_token_id (`int`):
The token ID representing the pad token.
context_length (`int`):
The length of the context in the query responses.
Returns:
tuple:
- `reward_logits` (`torch.Tensor`):
The logits for the reward model.
- `final_rewards` (`torch.Tensor`):
The final rewards for each query response.
- `sequence_lengths` (`torch.Tensor`):
The lengths of the sequences in the query responses.
"""
attention_mask = query_responses != pad_token_id
position_ids = attention_mask.cumsum(1) - attention_mask.long() # exclusive cumsum
lm_backbone = getattr(model, model.base_model_prefix)
input_ids = torch.masked_fill(query_responses, ~attention_mask, 0)
output = lm_backbone(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
return_dict=True,
output_hidden_states=True,
use_cache=False, # otherwise mistral-based RM would error out
)
reward_logits = model.score(output.hidden_states[-1])
sequence_lengths = first_true_indices(query_responses[:, context_length:] == pad_token_id) - 1 + context_length
# https://github.com/huggingface/transformers/blob/dc68a39c8111217683bf49a4912d0c9018bab33d/src/transformers/models/gpt2/modeling_gpt2.py#L1454
return (
reward_logits,
reward_logits[
torch.arange(reward_logits.size(0), device=reward_logits.device),
sequence_lengths,
].squeeze(-1),
sequence_lengths,
)
这是默认了reward model和policy model共享同样的tokenizer,明显不利于用户选择更广泛的奖励模型( issue里面也有人提到)。要修改也很简单,就是在使用前先用policy model的tokenizer把token ids解码成原来的文字,并且去掉special token,然后用reward model的tokenizer重新组装和tokenize一遍就可以了。