奖励函数

此模块包含一些有用的奖励函数，主要用于 GRPOTrainer。

格式化奖励

think_format_reward

trl.rewards.think_format_reward

< 源 >

( completions: list **kwargs ) → list[float]

参数

completions (list[list[dict[str, str]]]) — 待评估的补全列表。每个补全必须是包含一条消息的列表，即一个包含键 "content" 且其值为补全文本的字典。
**kwargs — 额外的关键字参数。此函数不使用它们，但在函数签名中需要它们以确保与像 GRPOTrainer 这样的训练器兼容。

list[float]

一个奖励列表，其中每个奖励如果补全符合预期格式则为 1.0，否则为 0.0。

该奖励函数检查推理过程是否被包裹在 "<think>" 和 "</think>" 标签内。如果格式正确，函数返回 1.0 的奖励，否则返回 0.0。

示例

>>> from trl.rewards import think_format_reward

>>> completions = [
...     [{"content": "<think>\nThis is my reasoning.\n</think>\nThis is my answer."}],
...     [{"content": "<think>\nThis is my reasoning.\nThis is my answer."}],
... ]
>>> think_format_reward(completions)
[1.0, 0.0]

< > 在 GitHub 上更新

TRL

奖励函数

格式化奖励

think_format_reward

trl.rewards.think_format_reward