Judges

TRL Judges 是一个实验性 API，随时可能发生变化。

TRL 提供了 judges 功能，可以轻松比较两个补全结果。

请确保通过运行以下命令安装了所需的依赖项

pip install trl[judges]

使用提供的 judges

TRL 开箱即用地提供了多个 judges。例如，您可以使用 HfPairwiseJudge，通过 Hugging Face 模型中心的一个预训练模型来比较两个补全结果

from trl import HfPairwiseJudge

judge = HfPairwiseJudge()
judge.judge(
    prompts=["What is the capital of France?", "What is the biggest planet in the solar system?"],
    completions=[["Paris", "Lyon"], ["Saturn", "Jupiter"]],
)  # Outputs: [0, 1]

定义您自己的 judge

为了定义您自己的 judge，我们提供了几个基类，您可以对其进行子类化。对于基于排序的 judges，您需要子类化 BaseRankJudge 并实现 BaseRankJudge.judge() 方法。对于成对比较的 judges，您需要子类化 BasePairJudge 并实现 BasePairJudge.judge 方法。如果您想定义不属于这些类别的 judge，您需要子类化 BaseJudge 并实现 BaseJudge.judge() 方法。

例如，让我们定义一个偏好较短补全结果的成对 judge

from trl import BasePairwiseJudge

class PrefersShorterJudge(BasePairwiseJudge):
    def judge(self, prompts, completions, shuffle_order=False):
        return [0 if len(completion[0]) > len(completion[1]) else 1 for completion in completions]

然后您可以如下使用这个 judge

judge = PrefersShorterJudge()
judge.judge(
    prompts=["What is the capital of France?", "What is the biggest planet in the solar system?"],
    completions=[["Paris", "The capital of France is Paris."], ["Jupiter is the biggest planet in the solar system.", "Jupiter"]],
)  # Outputs: [0, 1]

提供的 judges

PairRMJudge

class trl.PairRMJudge

< 源码 >

( )

基于 AllenAI 的 PairRM 模型的 LLM judge。

此 judge 使用 PairRM 模型对给定提示下的成对补全结果进行排序。它专为语言模型输出的成对比较而设计。PairRM 模型使用 llm-blender 库加载，并在默认的 Accelerator 设备上运行。

属性:

blender (llm_blender.Blender)：llm-blender 中 Blender 类的实例。

示例:

>>> pairrm_judge = PairRMJudge()
>>> prompts = ["Translate 'hello' to French", "What's the capital of Japan?"]
>>> completions = [["Bonjour", "Salut"], ["Kyoto", "Tokyo"]]
>>> results = pairrm_judge.judge(prompts, completions)
>>> print(results)  # [0, 1] (indicating the first completion is preferred for the first prompt and the second)

此类需要安装 llm-blender 库。使用以下命令安装：pip install llm-blender。

judge

< 源码 >

( prompts: list completions: list shuffle_order: bool = True return_scores: bool = False temperature: float = 1.0 ) → Union[list[int, float]]

参数

prompts (list[str]) — 要评判的提示列表。
completions (list[list[str]]) — 每个提示的成对补全结果列表。
shuffle_order (bool, 可选, 默认为 True) — 是否打乱补全结果的顺序以避免位置偏差。
return_scores (bool, 可选, 默认为 False) — 如果为 True，则返回第一个补全结果的概率分数，而不是排名（即 *soft-judge*）。
temperature (float, 可选, 默认为 1.0) — 如果 return_scores 为 True，用于缩放 logits 的温度。

Union[list[int, float]]

如果 `return_scores` 为 `False`，则为每个提示返回一个排名列表（`0` 或 `1`），表示哪个补全结果更优。如果 `return_scores` 为 `True`，则返回第一个补全结果的 softmax 概率。

引发

ValueError

ValueError — 如果每个提示的补全数量不恰好是 2。

使用 PairRM 模型对给定提示的补全结果对进行评判。

注意：与 llm-blender 不同，排名是从 0 开始的（`0` 表示第一个补全结果更优）。

HfPairwiseJudge

class trl.HfPairwiseJudge

< 源码 >

( model = 'meta-llama/Meta-Llama-3-70B-Instruct' token: typing.Optional[str] = None system_prompt: typing.Optional[str] = None )

参数

model (str, 可选, 默认为 "meta-llama/Meta-Llama-3-70B-Instruct") — 用于 judge 的模型。
token (str, 可选) — 用于 huggingface_hub.InferenceClient 的 Hugging Face API 令牌。
system_prompt (str 或 None, 可选, 默认为 None) — 用于 judge 的系统提示。如果未提供，则使用默认提示。请注意，系统提示应包含以下占位符：{prompt}、{response0} 和 {response1}。此外，推理时会使用 max_tokens=1，因此系统提示应要求单 token 响应。

基于 Hugging Face API 和聊天补全的成对 judge。

此 judge 适用于评估聊天模型的质量，其中补全结果是对给定提示的响应。

OpenAIPairwiseJudge

class trl.OpenAIPairwiseJudge

< 源码 >

( model = 'gpt-4-turbo-preview' system_prompt: typing.Optional[str] = None max_requests: typing.Optional[int] = 1000 )

参数

model (str, 可选, 默认为 "gpt-4-turbo-preview") — 用于 judge 的模型。
system_prompt (str 或 None, 可选, 默认为 None) — 用于 judge 的系统提示。如果未提供，则使用默认提示。请注意，系统提示应包含以下占位符：{prompt}、{response0} 和 {response1}。此外，推理时会使用 max_tokens=1，因此系统提示应要求单 token 响应。
max_requests (int 或 None, 可选, 默认为 1000) — 向 OpenAI API 发出的最大请求数。如果设置为 None，则没有限制。

基于 OpenAI API 的 judge。

此 judge 适用于评估聊天模型的质量，其中补全结果是对给定提示的响应。

AllTrueJudge

class trl.AllTrueJudge

< 源码 >

( judges: list )

参数

judges (list[BaseBinaryJudge]) — 一个 BaseBinaryJudge 实例列表，其决策将被统一。

统一多个 BaseBinaryJudge 实例的决策。

仅当所有内部二元 judges 返回 `1` 时才返回 `1`。如果任何 judge 返回 `0`，它将返回 `0`。如果任何 judge 返回 `-1`，表示其处理失败，此 judge 也将返回 `-1`。

实现了 CGPO 论文中描述的“法官混合体”(Mixture of Judges)。

基类

BaseJudge

class trl.BaseJudge

< 源码 >

( )

judges 的基类。此类的子类应实现 judge 方法。

BaseBinaryJudge

class trl.BaseBinaryJudge

< 源码 >

( )

二元 judges 的基类。

judge

< 源码 >

( prompts: list completions: list gold_completions: typing.Optional[list[str]] = None shuffle_order: bool = True ) → list[int]

参数

prompts (list[str]) — 提示列表。
completions (list[str]) — 补全结果列表。
gold_completions (list[str], 可选) — 黄金补全结果列表（如果存在）。
shuffle_order (bool) — 是否打乱补全结果的顺序以避免位置偏差。

list[int]

一个二元标签列表

1 表示补全结果满足评估的约束条件。
0 表示补全结果不满足评估的约束条件。

对给定提示的补全结果进行评判。用于评估补全结果是否满足某个约束条件。

此基类应用于实现 CGPO 论文第 4.1.4 节中所述的二元评估。它适用于评估提示-补全对是否满足特定约束。

注意：如果 judge 对任何提示返回 -1，则表示用于计算偏好的内部过程失败。例如，如果底层的语言模型或基于规则的约束返回了无效答案，就可能发生这种情况。在这种情况下，调用者应适当处理这些无效索引，可能通过实现回退逻辑或错误处理。

BaseRankJudge

class trl.BaseRankJudge

< 源码 >

( )

LLM 排名 judges 的基类。

示例:

class MyRankJudge(BaseRankJudge):
    def judge(self, prompts, completions, shuffle_order=True):
        return ...  # Your ranking logic here


judge = MyRankJudge()
judge.judge(
    prompts=["The capital of France is", "The capital of Germany is"],
    completions=[[" Paris", " Marseille", "Lyon"], [" Munich", " Berlin"]],
)  # [[0, 1, 2], [1, 0]]