奖励模型(reward model)是强化学习的基础，如果说pretrain是背书的话，sft就是背题，而rlhf则可以看作是有老师批改作业的学习，奖励模型(reward model)就是那个老师，他会告诉语言模型(policy model)什么是好的回答，什么是差的回答，然后语言模型(policy model)就会向好的方向学习，而远离差的方向。

1.Reward model 的学习

（1）模型的类型

首先，奖励模型可以是任意的语言模型，不必跟policy模型一样，甚至可以是BERT这种判别式模型，因为它用到的仅是模型文本分类任务AutoModelForSequenceClassification。

（2）数据格式

其次我们看下学习奖励模型的数据格式。

image.png

没错，奖励模型的学习数据是样本对，它们是同一条prompt的2条不同response的标记。这标记可能来自对战平台上用户的评价，也有可能是来自更先进模型例如gpt4-o的评估。一个样本对包含一个好的样本(chosen)，一个差的样本(rejected)。当然这种好差是相对来说的，可以是根据样本的排名得来的，排在前面的是chosen，排在后面的是rejected。也可以根据得分(score)分配而来，分高的是chosen，分低的是rejected。得分(score)不是必须的，有分数可以在计算margin的时候使用（见后面loss的计算）。

需要注意的是这里的chosen和rejected都是包含prompt的，有的数据chosen和rejected仅有response，prompt是一个单独的column，训练前需要先拼接到一起。

（3）loss的计算

RewardTrainer其他部分没什么好说的，主要看下loss就行了。
https://github.com/huggingface/trl/blob/main/trl/trainer/reward_trainer.py#L264

def compute_loss(
        self,
        model: Union[PreTrainedModel, nn.Module],
        inputs: Dict[str, Union[torch.Tensor, Any]],
        return_outputs=False,
    ) -> Union[torch.Tensor, Tuple[torch.Tensor, Dict[str, torch.Tensor]]]:
        if not self.use_reward_data_collator:
            warnings.warn(
                "The current compute_loss is implemented for RewardDataCollatorWithPadding,"
                " if you are using a custom data collator make sure you know what you are doing or"
                " implement your own compute_loss method."
            )
        rewards_chosen = model(
            input_ids=inputs["input_ids_chosen"],
            attention_mask=inputs["attention_mask_chosen"],
            return_dict=True,
        )["logits"]
        rewards_rejected = model(
            input_ids=inputs["input_ids_rejected"],
            attention_mask=inputs["attention_mask_rejected"],
            return_dict=True,
        )["logits"]
        # calculate loss, optionally modulate with margin
        if "margin" in inputs:
            loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected - inputs["margin"]).mean()
        else:
            loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected).mean()

        if self.args.center_rewards_coefficient is not None:
            loss += self.args.center_rewards_coefficient * torch.mean((rewards_chosen + rewards_rejected) ** 2)

        if return_outputs:
            return loss, {
                "rewards_chosen": rewards_chosen,
                "rewards_rejected": rewards_rejected,
            }
        return loss

loss 计算公式

其中上面的公式对应下面的代码

if "margin" in inputs:
    loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected - inputs["margin"]).mean()

µˆ(x, y)代表margin函数，即代码里的inputs["margin"]，在实践中很多研究人员倾向于不把margin添加进loss, 因为评分的主观性太强。如果要加，margin function可以自行构造，这里的score_chosen和score_rejected是该sample的评分。

def add_margin(row):
    # Assume you have a score_chosen and score_rejected columns that you want to use to compute the margin
    return {'margin': row['score_chosen'] - row['score_rejected']}

dataset = dataset.map(add_margin)

另外，在许多情况下，最好确保奖励模型的输出均值为零。这种中心化约束通过使奖励的平方和最小化实现，center_rewards_coefficient是约束系数。

if self.args.center_rewards_coefficient is not None:
    loss += self.args.center_rewards_coefficient * torch.mean((rewards_chosen + rewards_rejected) ** 2)

2.Reward model 的使用问题

https://github.com/huggingface/trl/blob/main/trl/trainer/utils.py#L1069
我发现TRL在使用奖励模型的时候，都是直接把token ids输入到get_reward函数里面。

def get_reward(
    model: torch.nn.Module, query_responses: torch.Tensor, pad_token_id: int, context_length: int
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    """
    Computes the reward logits and the rewards for a given model and query responses.

    Args:
        model (`torch.nn.Module`):
            The model used to compute the reward logits.
        query_responses (`torch.Tensor`):
            The tensor containing the query responses.
        pad_token_id (`int`):
            The token ID representing the pad token.
        context_length (`int`):
            The length of the context in the query responses.

    Returns:
        tuple:
            - `reward_logits` (`torch.Tensor`):
                The logits for the reward model.
            - `final_rewards` (`torch.Tensor`):
                The final rewards for each query response.
            - `sequence_lengths` (`torch.Tensor`):
                The lengths of the sequences in the query responses.
    """
    attention_mask = query_responses != pad_token_id
    position_ids = attention_mask.cumsum(1) - attention_mask.long()  # exclusive cumsum
    lm_backbone = getattr(model, model.base_model_prefix)
    input_ids = torch.masked_fill(query_responses, ~attention_mask, 0)
    output = lm_backbone(
        input_ids=input_ids,
        attention_mask=attention_mask,
        position_ids=position_ids,
        return_dict=True,
        output_hidden_states=True,
        use_cache=False,  # otherwise mistral-based RM would error out
    )
    reward_logits = model.score(output.hidden_states[-1])
    sequence_lengths = first_true_indices(query_responses[:, context_length:] == pad_token_id) - 1 + context_length
    # https://github.com/huggingface/transformers/blob/dc68a39c8111217683bf49a4912d0c9018bab33d/src/transformers/models/gpt2/modeling_gpt2.py#L1454
    return (
        reward_logits,
        reward_logits[
            torch.arange(reward_logits.size(0), device=reward_logits.device),
            sequence_lengths,
        ].squeeze(-1),
        sequence_lengths,
    )

这是默认了reward model和policy model共享同样的tokenizer，明显不利于用户选择更广泛的奖励模型( issue里面也有人提到)。要修改也很简单，就是在使用前先用policy model的tokenizer把token ids解码成原来的文字，并且去掉special token，然后用reward model的tokenizer重新组装和tokenize一遍就可以了。

强化学习框架TRL源码—— 关于奖励模型(reward model)