开源 AI 食谱文档
使用 TRL 中的 GRPO 对 LLM 进行推理的后期训练
并获得增强的文档体验
开始使用
使用 TRL 中的 GRPO 对 LLM 进行推理的后期训练
作者: Sergio Paniego
在本笔记本中,我们将指导您使用在 DeepSeekMath 论文中引入的群组相对策略优化 (GRPO) 方法对大型语言模型 (LLM) 进行后期训练。GRPO 在扩展测试时计算以进行扩展推理方面特别有效,使其成为解决复杂任务(例如数学问题解决)的理想方法。
GRPO 是一种强化学习 (RL) 后期训练技术,已集成到 DeepSeek-R1 的训练管道中。它似乎与最新 OpenAI o1 和 o3 模型中使用的训练程序有相似之处,尽管确切的一致性尚未得到证实。与依赖搜索启发式方法的早期技术不同,GRPO 专门采用 RL 进行后期训练,增强了模型处理复杂和细微任务的能力。
GRPO 技术可通过 TRL 库获得。截至本文撰写之时,Hugging Face Science 团队正在努力重现完整的 DeepSeek-R1 训练过程,您可以在他们的 Open-R1 项目中进行探索。我强烈建议您查看它,以深入了解整个过程。
在本笔记本中,我们将专门关注使用 GRPO 进行后期训练,尽管在最后一节中提供了有关 DeepSeek-R1 及其训练过程的额外资源。
下面是说明此训练过程如何工作的图表。
1. 安装依赖项
让我们先安装微调所需的基本库吧!🚀
!pip install -U -q trl peft math_verify
# Tested with transformers==4.47.1, trl==0.14.0, datasets==3.2.0, peft==0.14.0, accelerate==1.2.1, math_verify==0.3.3
使用您的 Hugging Face 账户进行身份验证,以便直接从本 Notebook 保存和分享您的模型 🗝️。
from huggingface_hub import notebook_login
notebook_login()
2. 加载数据集 📁
这些模型擅长需要复杂推理的任务。一个典型的例子是数学问题解决,它通常需要多步推理才能得出正确的解决方案。
对于这个项目,我们将使用 AI-MO/NuminaMath-TIR 数据集。这是一个以推理为重点的数据集,包含数学问题、它们的解决方案以及解释如何从问题陈述过渡到最终解决方案的详细推理步骤。
from datasets import load_dataset
dataset_id = "AI-MO/NuminaMath-TIR"
train_dataset, test_dataset = load_dataset(dataset_id, split=["train[:5%]", "test[:5%]"])
让我们检查数据集的结构
>>> print(train_dataset)
Dataset({ features: ['problem', 'solution', 'messages'], num_rows: 3622 })
让我们检查一个样本
>>> print(train_dataset[0])
{'problem': 'What is the coefficient of $x^2y^6$ in the expansion of $\\left(\\frac{3}{5}x-\\frac{y}{2}\\right)^8$? Express your answer as a common fraction.', 'solution': "To determine the coefficient of in the expansion of, we can use the binomial theorem.\n\nThe binomial theorem states:\n\\[\n(a + b)^n = \\sum_{k=0}^{n} \\binom{n}{k} a^{n-k} b^k\n\\]\n\nIn this case,,, and.\n\nWe are interested in the term that contains. In the general term of the binomial expansion:\n\\[\n\\binom{8}{k} \\left(\\frac{3}{5}x\\right)^{8-k} \\left(-\\frac{y}{2}\\right)^k\n\\]\n\nTo get, we need, thus.\n\nSubstituting into the expression:\n\\[\n\\binom{8}{6} \\left(\\frac{3}{5}x\\right)^{8-6} \\left(-\\frac{y}{2}\\right)^6 = \\binom{8}{6} \\left(\\frac{3}{5}x\\right)^2 \\left(-\\frac{y}{2}\\right)^6\n\\]\n\nNow, we will compute each part of this expression.\n\n1. Calculate the binomial coefficient.\n2. Compute.\n3. Compute.\n4. Combine everything together to get the coefficient of.\n\nLet's compute these in Python.\n```python\nfrom math import comb\n\n# Given values\nn = 8\nk = 6\n\n# Calculate the binomial coefficient\nbinom_coeff = comb(n, k)\n\n# Compute (3/5)^2\na_term = (3/5)**2\n\n# Compute (-1/2)^6\nb_term = (-1/2)**6\n\n# Combine terms to get the coefficient of x^2y^6\ncoefficient = binom_coeff * a_term * b_term\nprint(coefficient)\n```\n```output\n0.1575\n```\nThe coefficient of in the expansion of is. To express this as a common fraction, we recognize that:\n\n\\[ 0.1575 = \\frac{1575}{10000} = \\frac{63}{400} \\]\n\nThus, the coefficient can be expressed as:\n\n\\[\n\\boxed{\\frac{63}{400}}\n\\]", 'messages': [{'content': 'What is the coefficient of $x^2y^6$ in the expansion of $\\left(\\frac{3}{5}x-\\frac{y}{2}\\right)^8$? Express your answer as a common fraction.', 'role': 'user'}, {'content': "To determine the coefficient of in the expansion of, we can use the binomial theorem.\n\nThe binomial theorem states:\n\\[\n(a + b)^n = \\sum_{k=0}^{n} \\binom{n}{k} a^{n-k} b^k\n\\]\n\nIn this case,,, and.\n\nWe are interested in the term that contains. In the general term of the binomial expansion:\n\\[\n\\binom{8}{k} \\left(\\frac{3}{5}x\\right)^{8-k} \\left(-\\frac{y}{2}\\right)^k\n\\]\n\nTo get, we need, thus.\n\nSubstituting into the expression:\n\\[\n\\binom{8}{6} \\left(\\frac{3}{5}x\\right)^{8-6} \\left(-\\frac{y}{2}\\right)^6 = \\binom{8}{6} \\left(\\frac{3}{5}x\\right)^2 \\left(-\\frac{y}{2}\\right)^6\n\\]\n\nNow, we will compute each part of this expression.\n\n1. Calculate the binomial coefficient.\n2. Compute.\n3. Compute.\n4. Combine everything together to get the coefficient of.\n\nLet's compute these in Python.\n```python\nfrom math import comb\n\n# Given values\nn = 8\nk = 6\n\n# Calculate the binomial coefficient\nbinom_coeff = comb(n, k)\n\n# Compute (3/5)^2\na_term = (3/5)**2\n\n# Compute (-1/2)^6\nb_term = (-1/2)**6\n\n# Combine terms to get the coefficient of x^2y^6\ncoefficient = binom_coeff * a_term * b_term\nprint(coefficient)\n```\n```output\n0.1575\n```\nThe coefficient of in the expansion of is. To express this as a common fraction, we recognize that:\n\n\\[ 0.1575 = \\frac{1575}{10000} = \\frac{63}{400} \\]\n\nThus, the coefficient can be expressed as:\n\n\\[\n\\boxed{\\frac{63}{400}}\n\\]", 'role': 'assistant'}]}
在 DeepSeek-R1 训练过程中,使用了一个特定的系统提示来生成包含推理步骤的对话管道。我们将调整我们的数据集以遵循这种方法,模型被引导首先思考问题,然后给出答案。
使用的系统提示是
A conversation between User and Assistant. The user asks a question, and the Assistant solves it.
The assistant first thinks about the reasoning process in the mind and then provides the user
with the answer. The reasoning process and answer are enclosed within <think> </think> and
<answer> </answer> tags, respectively, i.e., <think> reasoning process here </think>
<answer> answer here </answer>. User: prompt. Assistant:
我们将修改我们的数据集以遵循这种对话格式,促使 LLM 生成推理步骤和最终答案。
SYSTEM_PROMPT = (
"A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant "
"first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning "
"process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., "
"<think> reasoning process here </think><answer> answer here </answer>"
)
def make_conversation(example):
return {
"prompt": [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": example["problem"]},
],
}
train_dataset = train_dataset.map(make_conversation)
test_dataset = test_dataset.map(make_conversation)
让我们看一个例子
>>> print(train_dataset[0]["prompt"])
[{'content': 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed withinand tags, respectively, i.e., reasoning process here answer here ', 'role': 'system'}, {'content': 'What is the coefficient of $x^2y^6$ in the expansion of $\\left(\\frac{3}{5}x-\\frac{y}{2}\\right)^8$? Express your answer as a common fraction.', 'role': 'user'}]
我们将删除 `messages` 和 `problem` 列,因为我们只需要自定义的 `prompt` 列和 `solution` 来验证生成的答案。
>>> train_dataset = train_dataset.remove_columns(["messages", "problem"])
>>> print(train_dataset)
Dataset({ features: ['solution', 'prompt'], num_rows: 3622 })
3. 使用 GRPO 对基础模型进行后期训练
下图突出显示了 PPO(近端策略优化)和 GRPO(群组相对策略优化)之间的主要区别,特别是 GRPO 中移除了价值模型。有关关键区别的更多详细信息,您可以参考此处的完整解释。
3.1 加载基线模型
首先,我们将加载 Qwen/Qwen2-0.5B-Instruct 作为基线模型(上图中的 `Policy Model`)。它只有 0.5 亿个参数,重量轻且符合可用资源。但是,为了获得更好的结果,应考虑使用更大的替代方案。
import torch
from transformers import AutoModelForCausalLM
model_id = "Qwen/Qwen2-0.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto",
)
3.2 配置 LoRA
接下来,我们将配置 LoRA 以进行模型训练。该技术将允许我们以减少参数数量的方式高效地微调模型,从而实现更快、更节省资源的训练。
>>> from peft import LoraConfig, get_peft_model
>>> lora_config = LoraConfig(
... task_type="CAUSAL_LM",
... r=8,
... lora_alpha=32,
... lora_dropout=0.1,
... target_modules=["q_proj", "v_proj"],
... )
>>> model = get_peft_model(model, lora_config)
>>> model.print_trainable_parameters()
trainable params: 540,672 || all params: 494,573,440 || trainable%: 0.1093
3.3 加载奖励函数
对于系统的奖励部分,我们可以使用预训练的奖励模型或直接在代码中定义的奖励函数。在训练时,DeepSeek-R1 作者使用了一个基于准确性的奖励模型来评估响应是否正确,以及一个基于格式的奖励来确保模型将其推理过程放在 `<think> </think>` 标签之间。您可以在此处找到更多详细信息。我们可以简单地将这些奖励函数定义并实现为通用 Python 函数。
在这种情况下,我们将使用这些奖励函数
- 格式强制:确保生成遵循特定格式,使用 `<think> </think> <answer> </answer>` 标签进行推理。
import re
def format_reward(completions, **kwargs):
"""Reward function that checks if the completion has a specific format."""
pattern = r"^<think>.*?</think>\s*<answer>.*?</answer>$"
completion_contents = [completion[0]["content"] for completion in completions]
matches = [re.match(pattern, content) for content in completion_contents]
rewards_list = [1.0 if match else 0.0 for match in matches]
return [1.0 if match else 0.0 for match in matches]
- 解决方案准确性:验证问题的解决方案是否正确。
from math_verify import LatexExtractionConfig, parse, verify
def accuracy_reward(completions, **kwargs):
"""Reward function that checks if the completion is the same as the ground truth."""
solutions = kwargs["solution"]
completion_contents = [completion[0]["content"] for completion in completions]
rewards = []
for content, solution in zip(completion_contents, solutions):
gold_parsed = parse(solution, extraction_mode="first_match", extraction_config=[LatexExtractionConfig()])
answer_parsed = parse(content, extraction_mode="first_match", extraction_config=[LatexExtractionConfig()])
if len(gold_parsed) != 0:
try:
rewards.append(float(verify(answer_parsed, gold_parsed)))
except Exception:
rewards.append(0.0)
else:
rewards.append(1.0)
return rewards
3.4 配置 GRPO 训练参数
接下来,让我们配置 GRPO 的训练参数。我们建议尝试调整 `max_completion_length`、`num_generations` 和 `max_prompt_length` 参数(有关每个参数的详细信息,请参阅开头的图像)。
为了简单起见,我们将只训练一个 epoch,并将其 `max_completion_length`、`num_generations` 和 `max_prompt_length` 从其默认值减小。
from trl import GRPOConfig
# Configure training arguments using GRPOConfig
training_args = GRPOConfig(
output_dir="Qwen2-0.5B-GRPO-test",
learning_rate=1e-5,
remove_unused_columns=False, # to access the solution column in accuracy_reward
gradient_accumulation_steps=16,
num_train_epochs=1,
bf16=True,
# Parameters that control de data preprocessing
max_completion_length=64, # default: 256
num_generations=4, # default: 8
max_prompt_length=128, # default: 512
# Parameters related to reporting and saving
report_to=["tensorboard"],
logging_steps=10,
push_to_hub=True,
save_strategy="steps",
save_steps=10,
)
3.5 训练模型 🏃
现在,让我们配置训练器并开始训练模型!
在这种情况下,我们将之前定义的两个奖励函数传递给训练器。
下面是我们将要重现的训练过程图,它来源于 Open-R1 项目。
from trl import GRPOTrainer
trainer = GRPOTrainer(
model=model, reward_funcs=[format_reward, accuracy_reward], args=training_args, train_dataset=train_dataset
)
是时候训练模型了!🎉
trainer.train()
让我们保存结果 💾
trainer.save_model(training_args.output_dir) trainer.push_to_hub(dataset_name=dataset_id)
下面,您可以查看训练的 Tensorboard 结果。它们看起来很有希望!
4. 检查模型性能
到目前为止,我们一直保持简单,但现在让我们检查模型是否已经学会推理。我们将加载保存的模型并对测试样本进行评估。
from transformers import AutoTokenizer
model_id = "sergiopaniego/Qwen2-0.5B-GRPO"
trained_model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto",
)
trained_tokenizer = AutoTokenizer.from_pretrained(model_id)
让我们检查测试集中的一个样本!
>>> print(test_dataset["prompt"][0])
[{'content': 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed withinand tags, respectively, i.e., reasoning process here answer here ', 'role': 'system'}, {'content': "In 1988, a person's age was equal to the sum of the digits of their birth year. How old was this person?", 'role': 'user'}]
我们将创建一个函数来与模型交互。除了生成答案,我们还将测量推理持续时间并计算生成的 token 数量。这将使我们了解模型在生成过程中推理了多少。
import time
def generate_with_reasoning(prompt):
# Build the prompt from the dataset
prompt = " ".join(entry["content"] for entry in prompt)
# Tokenize and move to the same device as the model
inputs = trained_tokenizer(prompt, return_tensors="pt").to(trained_model.device)
# Generate text without gradients
start_time = time.time()
with torch.no_grad():
output_ids = trained_model.generate(**inputs, max_length=500)
end_time = time.time()
# Decode and extract model response
generated_text = trained_tokenizer.decode(output_ids[0], skip_special_tokens=True)
# Get inference time
inference_duration = end_time - start_time
# Get number of generated tokens
num_input_tokens = inputs["input_ids"].shape[1]
num_generated_tokens = output_ids.shape[1] - num_input_tokens
return generated_text, inference_duration, num_generated_tokens
让我们为该测试样本生成答案!
>>> prompt = test_dataset["prompt"][0]
>>> generated_text, inference_duration, num_generated_tokens = generate_with_reasoning(prompt)
>>> print(generated_text)
A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed withinand tags, respectively, i.e., reasoning process here answer here In 1988, a person's age was equal to the sum of the digits of their birth year. How old was this person?The reasoning process is that if the sum of the digits of the birth year is equal to the person's age, then the person must have been born in a given year. The answer is: 1988
模型已经能够生成正确的 `<think>` 和 `<answer>` 标签,尽管解决方案本身不正确。
鉴于推理时间和生成的 token 数量,这种方法显示出潜在的优势。
>>> print(f"Inference time: {inference_duration:.2f} seconds")
>>> print(f"Generated tokens: {num_generated_tokens}")
Inference time: 2.09 seconds Generated tokens: 55
让我们回顾一下生成的响应,以便更好地可视化此行为。
>>> prompt_text = " ".join(entry["content"] for entry in prompt)
>>> response_text = generated_text[len(prompt_text) :].strip()
>>> print(response_text)
The reasoning process is that if the sum of the digits of the birth year is equal to the person's age, then the person must have been born in a given year. The answer is: 1988
我们观察到模型表现出一定的推理能力,尽管这些能力有限。这可以归因于几个因素:使用了小型模型、数据集的有限子集以及为了在笔记本环境中保持过程简单实用而缩短了训练持续时间。
此外,数据集的复杂性也起着作用。简化问题可能会产生更好的结果,正如此处所示。
尽管存在这些限制,但这项技术显示出巨大的潜力。DeepSeek-R1 的发布和这种训练方法的采用可能会在未来几个月内带来重大突破!
5. 继续您的学习之旅 🧑🎓
如您所见,这仅仅是探索 GRPO 训练器和 DeepSeek R1 模型的开始。如果您渴望深入了解,请务必探索笔记本中链接的以下资源以及这些额外材料
- DeepSeek-R1 的仓库
- DeepSeek-R1 的论文
- DeepSeek-R1 的开放复现
- GRPO TRL 训练器
- Phil Schmid 的 DeepSeek-R1 博客文章
- Phil Schmid 的迷你 DeepSeek-R1 博客文章
- DeepSeek-R1 图解
- LM Book 的 DeepSeek-R1 文章
祝您学习愉快,实验顺利!🚀
< > 在 GitHub 上更新