开源 AI 食谱文档

使用 TRL 中的 GRPO 进行 LLM 推理的后训练

开源 AI 食谱

加入 Hugging Face 社区

并获得增强的文档体验

在模型、数据集和 Spaces 上进行协作

通过加速推理获得更快的示例

在文档主题之间切换

开始使用

使用 TRL 中的 GRPO 进行 LLM 推理的后训练

作者：Sergio Paniego

在本笔记本中，我们将指导您完成使用群体相对策略优化 (GRPO) 后训练大型语言模型 (LLM) 的过程，GRPO 是 DeepSeekMath 论文中介绍的一种方法。 GRPO 对于扩展测试时计算以进行扩展推理特别有效，使其成为解决复杂任务（例如数学问题解决）的理想方法。

GRPO 是一种强化学习 (RL) 后训练技术，已集成到 DeepSeek-R1 的训练管道中。它似乎与最新的 OpenAI o1 和 o3 模型 中使用的训练程序有相似之处，尽管确切的对齐尚未得到证实。与早期依赖于搜索启发式技术的技术不同，GRPO 专门使用 RL 进行后训练，从而增强了模型处理复杂和细致任务的能力。

GRPO 技术可通过 TRL 库获得。在撰写本文时，Hugging Face 科学团队正在努力重现完整的 DeepSeek-R1 训练过程，您可以在他们的 Open-R1 项目中进行探索。我强烈建议您查看它，以更深入地了解整个过程。

在本笔记本中，我们将特别关注使用 GRPO 进行后训练，尽管最后一节提供了有关 DeepSeek-R1 及其训练程序的其他资源。

以下是说明此训练程序如何工作的图表。

1. 安装依赖项

让我们首先安装微调所需的必要库！🚀

!pip install  -U -q trl peft math_verify
# Tested with transformers==4.47.1, trl==0.14.0, datasets==3.2.0, peft==0.14.0, accelerate==1.2.1, math_verify==0.3.3

使用您的 Hugging Face 帐户进行身份验证，以直接从此笔记本保存和共享您的模型 🗝️。

from huggingface_hub import notebook_login

notebook_login()

2. 加载数据集 📁

这些模型擅长需要复杂推理的任务。一个主要的例子是数学问题解决，这通常需要多步推理才能得出正确的解决方案。

对于本项目，我们将使用 AI-MO/NuminaMath-TIR 数据集。这是一个以推理为中心的数据集，其中包含数学问题、其解决方案以及详细的推理步骤，这些步骤解释了如何从问题陈述过渡到最终解决方案。

from datasets import load_dataset

dataset_id = "AI-MO/NuminaMath-TIR"
train_dataset, test_dataset = load_dataset(dataset_id, split=["train[:5%]", "test[:5%]"])

让我们检查数据集的结构

>>> print(train_dataset)

Dataset(&#123;
    features: ['problem', 'solution', 'messages'],
    num_rows: 3622
})

让我们检查一个样本

>>> print(train_dataset[0])

&#123;'problem': 'What is the coefficient of $x^2y^6$ in the expansion of $\\left(\\frac&#123;3}&#123;5}x-\\frac&#123;y}&#123;2}\\right)^8$?  Express your answer as a common fraction.', 'solution': "To determine the coefficient of $x^2y^6$  in the expansion of $\\left(\\frac{3}{5}x - \\frac{y}{2}\\right)^8$ , we can use the binomial theorem.\n\nThe binomial theorem states:\n\\[\n(a + b)^n = \\sum_&#123;k=0}^&#123;n} \\binom&#123;n}&#123;k} a^&#123;n-k} b^k\n\\]\n\nIn this case, $a = \\frac{3}{5}x$ , $b = -\\frac{y}{2}$ , and $n = 8$ .\n\nWe are interested in the term that contains $x^2y^6$ . In the general term of the binomial expansion:\n\\[\n\\binom&#123;8}&#123;k} \\left(\\frac&#123;3}&#123;5}x\\right)^&#123;8-k} \\left(-\\frac&#123;y}&#123;2}\\right)^k\n\\]\n\nTo get $x^2$ , we need $8 - k = 2$ , thus $k = 6$ .\n\nSubstituting $k = 6$  into the expression:\n\\[\n\\binom&#123;8}&#123;6} \\left(\\frac&#123;3}&#123;5}x\\right)^&#123;8-6} \\left(-\\frac&#123;y}&#123;2}\\right)^6 = \\binom&#123;8}&#123;6} \\left(\\frac&#123;3}&#123;5}x\\right)^2 \\left(-\\frac&#123;y}&#123;2}\\right)^6\n\\]\n\nNow, we will compute each part of this expression.\n\n1. Calculate the binomial coefficient $\\binom{8}{6}$ .\n2. Compute $\\left(\\frac{3}{5}\\right)^2$ .\n3. Compute $\\left(-\\frac{y}{2}\\right)^6$ .\n4. Combine everything together to get the coefficient of $x^2y^6$ .\n\nLet's compute these in Python.\n```python\nfrom math import comb\n\n# Given values\nn = 8\nk = 6\n\n# Calculate the binomial coefficient\nbinom_coeff = comb(n, k)\n\n# Compute (3/5)^2\na_term = (3/5)**2\n\n# Compute (-1/2)^6\nb_term = (-1/2)**6\n\n# Combine terms to get the coefficient of x^2y^6\ncoefficient = binom_coeff * a_term * b_term\nprint(coefficient)\n```\n```output\n0.1575\n```\nThe coefficient of $x^2y^6$  in the expansion of $\\left(\\frac{3}{5}x - \\frac{y}{2}\\right)^8$  is $0.1575$ . To express this as a common fraction, we recognize that:\n\n\\[ 0.1575 = \\frac&#123;1575}&#123;10000} = \\frac&#123;63}&#123;400} \\]\n\nThus, the coefficient can be expressed as:\n\n\\[\n\\boxed&#123;\\frac&#123;63}&#123;400}}\n\\]", 'messages': [&#123;'content': 'What is the coefficient of $x^2y^6$ in the expansion of $\\left(\\frac&#123;3}&#123;5}x-\\frac&#123;y}&#123;2}\\right)^8$?  Express your answer as a common fraction.', 'role': 'user'}, &#123;'content': "To determine the coefficient of $x^2y^6$  in the expansion of $\\left(\\frac{3}{5}x - \\frac{y}{2}\\right)^8$ , we can use the binomial theorem.\n\nThe binomial theorem states:\n\\[\n(a + b)^n = \\sum_&#123;k=0}^&#123;n} \\binom&#123;n}&#123;k} a^&#123;n-k} b^k\n\\]\n\nIn this case, $a = \\frac{3}{5}x$ , $b = -\\frac{y}{2}$ , and $n = 8$ .\n\nWe are interested in the term that contains $x^2y^6$ . In the general term of the binomial expansion:\n\\[\n\\binom&#123;8}&#123;k} \\left(\\frac&#123;3}&#123;5}x\\right)^&#123;8-k} \\left(-\\frac&#123;y}&#123;2}\\right)^k\n\\]\n\nTo get $x^2$ , we need $8 - k = 2$ , thus $k = 6$ .\n\nSubstituting $k = 6$  into the expression:\n\\[\n\\binom&#123;8}&#123;6} \\left(\\frac&#123;3}&#123;5}x\\right)^&#123;8-6} \\left(-\\frac&#123;y}&#123;2}\\right)^6 = \\binom&#123;8}&#123;6} \\left(\\frac&#123;3}&#123;5}x\\right)^2 \\left(-\\frac&#123;y}&#123;2}\\right)^6\n\\]\n\nNow, we will compute each part of this expression.\n\n1. Calculate the binomial coefficient $\\binom{8}{6}$ .\n2. Compute $\\left(\\frac{3}{5}\\right)^2$ .\n3. Compute $\\left(-\\frac{y}{2}\\right)^6$ .\n4. Combine everything together to get the coefficient of $x^2y^6$ .\n\nLet's compute these in Python.\n```python\nfrom math import comb\n\n# Given values\nn = 8\nk = 6\n\n# Calculate the binomial coefficient\nbinom_coeff = comb(n, k)\n\n# Compute (3/5)^2\na_term = (3/5)**2\n\n# Compute (-1/2)^6\nb_term = (-1/2)**6\n\n# Combine terms to get the coefficient of x^2y^6\ncoefficient = binom_coeff * a_term * b_term\nprint(coefficient)\n```\n```output\n0.1575\n```\nThe coefficient of $x^2y^6$  in the expansion of $\\left(\\frac{3}{5}x - \\frac{y}{2}\\right)^8$  is $0.1575$ . To express this as a common fraction, we recognize that:\n\n\\[ 0.1575 = \\frac&#123;1575}&#123;10000} = \\frac&#123;63}&#123;400} \\]\n\nThus, the coefficient can be expressed as:\n\n\\[\n\\boxed&#123;\\frac&#123;63}&#123;400}}\n\\]", 'role': 'assistant'}]}

在 DeepSeek-R1 训练程序中，使用特定的系统提示来生成包含推理步骤的对话管道。我们将调整我们的数据集以遵循这种方法，其中引导模型首先思考问题，然后给出答案。

使用的系统提示是

A conversation between User and Assistant. The user asks a question, and the Assistant solves it.
The assistant first thinks about the reasoning process in the mind and then provides the user
with the answer. The reasoning process and answer are enclosed within <think> </think> and
<answer> </answer> tags, respectively, i.e., <think> reasoning process here </think>
<answer> answer here </answer>. User: prompt. Assistant:

我们将修改我们的数据集以遵循这种对话格式，提示 LLM 生成推理步骤和最终答案。

SYSTEM_PROMPT = (
    "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant "
    "first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning "
    "process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., "
    "<think> reasoning process here </think><answer> answer here </answer>"
)


def make_conversation(example):
    return {
        "prompt": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": example["problem"]},
        ],
    }


train_dataset = train_dataset.map(make_conversation)
test_dataset = test_dataset.map(make_conversation)

让我们看一个例子

>>> print(train_dataset[0]["prompt"])

[&#123;'content': 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within   and   tags, respectively, i.e.,  reasoning process here  answer here ', 'role': 'system'}, &#123;'content': 'What is the coefficient of $x^2y^6$ in the expansion of $\\left(\\frac&#123;3}&#123;5}x-\\frac&#123;y}&#123;2}\\right)^8$?  Express your answer as a common fraction.', 'role': 'user'}]

我们将删除 messages 和 problem 列，因为我们只需要自定义的 prompt 列和 solution 来验证生成的答案。

>>> train_dataset = train_dataset.remove_columns(["messages", "problem"])
>>> print(train_dataset)

Dataset(&#123;
    features: ['solution', 'prompt'],
    num_rows: 3622
})

3. 使用 GRPO 后训练基础模型

下图突出了 PPO（近端策略优化）和 GRPO（群体相对策略优化）之间的主要区别，特别是 GRPO 中价值模型的移除。有关主要区别的更多详细信息，您可以参考此处的完整说明。

3.1 加载基线模型

首先，我们将加载 Qwen/Qwen2-0.5B-Instruct 作为基线模型（上图中的 策略模型）。它只有 0.5 亿个参数，非常轻巧，并且适合可用资源。但是，为了获得更好的结果，应考虑更大的替代方案。

import torch
from transformers import AutoModelForCausalLM

model_id = "Qwen/Qwen2-0.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

3.2 配置 LoRA

接下来，我们将配置 LoRA 以进行模型训练。这项技术将使我们能够以减少的参数数量有效地微调模型，从而实现更快、资源效率更高的训练。

>>> from peft import LoraConfig, get_peft_model

>>> lora_config = LoraConfig(
...     task_type="CAUSAL_LM",
...     r=8,
...     lora_alpha=32,
...     lora_dropout=0.1,
...     target_modules=["q_proj", "v_proj"],
... )

>>> model = get_peft_model(model, lora_config)

>>> model.print_trainable_parameters()

trainable params: 540,672 || all params: 494,573,440 || trainable%: 0.1093

3.3 加载奖励函数

对于系统的奖励组件，我们可以使用预训练的奖励模型或直接在代码中定义的奖励函数。对于训练，DeepSeek-R1 作者使用了基于准确性的奖励模型，该模型评估响应是否正确，以及基于格式的奖励，该奖励确保模型将其推理过程置于 <think> </think> 标签之间。您可以在此处找到更多详细信息。我们可以简单地将这些奖励函数定义和实现为通用的 Python 函数。

在本例中，我们将使用这些奖励函数

格式强制： 确保生成遵循特定格式，使用 <think> </think> <answer> </answer> 标签进行推理。

import re


def format_reward(completions, **kwargs):
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<think>.*?</think>\s*<answer>.*?</answer>$"
    completion_contents = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, content) for content in completion_contents]
    rewards_list = [1.0 if match else 0.0 for match in matches]
    return [1.0 if match else 0.0 for match in matches]

解决方案准确性： 验证问题的解决方案是否正确。

from math_verify import LatexExtractionConfig, parse, verify


def accuracy_reward(completions, **kwargs):
    """Reward function that checks if the completion is the same as the ground truth."""
    solutions = kwargs["solution"]
    completion_contents = [completion[0]["content"] for completion in completions]
    rewards = []
    for content, solution in zip(completion_contents, solutions):
        gold_parsed = parse(solution, extraction_mode="first_match", extraction_config=[LatexExtractionConfig()])
        answer_parsed = parse(content, extraction_mode="first_match", extraction_config=[LatexExtractionConfig()])
        if len(gold_parsed) != 0:
            try:
                rewards.append(float(verify(answer_parsed, gold_parsed)))
            except Exception:
                rewards.append(0.0)
        else:
            rewards.append(1.0)
    return rewards

3.4 配置 GRPO 训练参数

接下来，让我们配置 GRPO 的训练参数。我们建议试验 max_completion_length、num_generations 和 max_prompt_length 参数（有关每个参数的详细信息，请参阅开头的图片）。

为了简化操作，我们将首先训练一个 epoch，并将 max_completion_length、num_generations 和 max_prompt_length 从其默认值减小。

from trl import GRPOConfig

# Configure training arguments using GRPOConfig
training_args = GRPOConfig(
    output_dir="Qwen2-0.5B-GRPO-test",
    learning_rate=1e-5,
    remove_unused_columns=False,  # to access the solution column in accuracy_reward
    gradient_accumulation_steps=16,
    num_train_epochs=1,
    bf16=True,
    # Parameters that control de data preprocessing
    max_completion_length=64,  # default: 256
    num_generations=4,  # default: 8
    max_prompt_length=128,  # default: 512
    # Parameters related to reporting and saving
    report_to=["tensorboard"],
    logging_steps=10,
    push_to_hub=True,
    save_strategy="steps",
    save_steps=10,
)

3.5 训练模型 🏃

现在，让我们配置训练器并开始训练模型！

在本例中，我们将先前定义的两个奖励函数传递给训练器

下面，您将找到我们将重现的训练程序的图表，该图表来自 Open-R1 项目。

from trl import GRPOTrainer

trainer = GRPOTrainer(
    model=model, reward_funcs=[format_reward, accuracy_reward], args=training_args, train_dataset=train_dataset
)

是时候训练模型了！🎉

trainer.train()

让我们保存结果 💾

trainer.save_model(training_args.output_dir)
trainer.push_to_hub(dataset_name=dataset_id)

下面，您可以查看训练的 Tensorboard 结果。它们看起来很有希望！

4. 检查模型性能

到目前为止，我们一直保持简单，但现在让我们检查模型是否已经学会推理。我们将加载保存的模型并在测试样本上运行评估。

from transformers import AutoTokenizer

model_id = "sergiopaniego/Qwen2-0.5B-GRPO"
trained_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)
trained_tokenizer = AutoTokenizer.from_pretrained(model_id)

让我们检查测试集中的一个样本！

>>> print(test_dataset["prompt"][0])

[&#123;'content': 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within   and   tags, respectively, i.e.,  reasoning process here  answer here ', 'role': 'system'}, &#123;'content': "In 1988, a person's age was equal to the sum of the digits of their birth year. How old was this person?", 'role': 'user'}]

我们将创建一个与模型交互的函数。除了生成答案外，我们还将测量推理持续时间并计算生成的令牌数量。这将使我们深入了解模型在生成过程中进行了多少推理。

import time


def generate_with_reasoning(prompt):
    # Build the prompt from the dataset
    prompt = " ".join(entry["content"] for entry in prompt)

    # Tokenize and move to the same device as the model
    inputs = trained_tokenizer(prompt, return_tensors="pt").to(trained_model.device)

    # Generate text without gradients
    start_time = time.time()
    with torch.no_grad():
        output_ids = trained_model.generate(**inputs, max_length=500)
    end_time = time.time()

    # Decode and extract model response
    generated_text = trained_tokenizer.decode(output_ids[0], skip_special_tokens=True)

    # Get inference time
    inference_duration = end_time - start_time

    # Get number of generated tokens
    num_input_tokens = inputs["input_ids"].shape[1]
    num_generated_tokens = output_ids.shape[1] - num_input_tokens

    return generated_text, inference_duration, num_generated_tokens

让我们为该测试样本生成答案！

>>> prompt = test_dataset["prompt"][0]
>>> generated_text, inference_duration, num_generated_tokens = generate_with_reasoning(prompt)
>>> print(generated_text)

A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within   and   tags, respectively, i.e.,  reasoning process here  answer here  In 1988, a person's age was equal to the sum of the digits of their birth year. How old was this person?
The reasoning process is that if the sum of the digits of the birth year is equal to the person's age, then the person must have been born in a given year.


The answer is: 1988

该模型已经展示了生成正确 <think> 和 <answer> 标签的能力，即使解决方案本身是不正确的。

鉴于推理时间和生成的令牌数量，这种方法显示出潜在的好处

>>> print(f"Inference time: {inference_duration:.2f} seconds")
>>> print(f"Generated tokens: {num_generated_tokens}")

Inference time: 2.09 seconds
Generated tokens: 55

让我们查看生成的响应以更好地可视化此行为

>>> prompt_text = " ".join(entry["content"] for entry in prompt)
>>> response_text = generated_text[len(prompt_text) :].strip()
>>> print(response_text)


The reasoning process is that if the sum of the digits of the birth year is equal to the person's age, then the person must have been born in a given year.


The answer is: 1988

我们观察到该模型展示了一些推理能力，尽管这些能力有限。这可以归因于以下几个因素：使用小型模型、数据集的有限子集以及较短的训练持续时间，以使过程简单实用，适合笔记本环境。

此外，数据集的复杂性也起着作用。简化问题可能会产生更好的结果，如此处所示。

尽管存在这些限制，但这项技术显示出巨大的前景。DeepSeek-R1 的发布以及这种训练方法的采用可能会在未来几个月内带来重大突破！

5. 继续您的学习之旅 🧑‍🎓

如您所见，这只是探索 GRPO 训练器和 DeepSeek R1 模型的开始。如果您渴望深入了解，请务必探索笔记本中链接的以下资源，以及这些其他材料

祝您学习愉快，实验顺利！🚀

< > 更新在 GitHub 上

←扩展 LLM 中更长思考时间的测试时计算 HuatuoGPT-o1 医学 RAG 和推理→