使用 Unsloth 通过 GRPO 对 LLM 进行推理的后训练

社区文章发布于 2025 年 8 月 4 日

Anshuman Mishra

shivance

注意 - 本文最初发布在我的个人博客网站此处。

我已经有一段时间没有写关于技术主题的博客了。今天就是时候了！

2025 年，LLM 已经获得了“推理”复杂数学问题的能力。这不是魔法——它是专门后训练的结果。虽然预训练的 LLM 知识渊博，但它们并非天生的问题解决者。要让它们擅长数学推理等复杂任务，我们需要对它们进行微调。

在本指南中，我们将学习如何做到这一点。我们将使用一种名为 GRPO 的智能强化学习技术和速度提升的 Unsloth 库，来训练强大的基础模型 Qwen3-1.7B-Base 进行推理。让我们开始吧！

基础模型与聊天模型

什么是基础模型？

基础模型是经过大量文本数据训练的原始、基础 LLM。它的核心能力仅仅是预测下一个词。它知识渊博，但本身并不知道如何遵循指令或进行对话。可以将其视为一个杰出但未经驯服的知识引擎。

什么是聊天/指令模型？

聊天模型（或指令调优模型）是经过第二阶段训练的基础模型。这个对齐阶段通常使用监督微调 (SFT) 和来自人类反馈的强化学习 (RLHF) 等技术，教会模型以对话形式提供帮助、无害并遵循用户指令。这个过程赋予模型特定的“个性”和对某种响应风格的强烈偏好。

什么是 GRPO 及其工作原理？

群组相对策略优化 (GRPO) 是一种先进的强化学习 (RL) 技术，旨在有效增强语言模型的推理能力。为了理解其优势，我们首先需要了解它所改进的方法：近端策略优化 (PPO)。

PPO 的问题

使用 PPO 进行传统的 RL 微调费用高昂，因为它需要将四个大型模型加载到 GPU 内存中：策略模型（正在训练的模型）、参考模型、奖励模型和价值模型。价值模型也是可训练的，它估计长期奖励的潜力，但增加了显著的复杂性和内存开销。

GRPO 完全消除了对价值模型的需求。这一单一的改变显著降低了计算要求，使高级 RL 微调更易于实现。

它只是用一个基于组统计数据的巧妙三步过程取代了复杂的价值估计：

生成一组输出：策略模型不再创建单个响应，而是被提示为给定提示生成一组不同的响应。
计算奖励：然后，这些生成的每个输出都由奖励函数（或单独的奖励模型）评分。对于推理任务，奖励可能基于正确的格式或数学准确性。
从组中估计优势：这是关键步骤。GRPO 为每个响应计算“优势”——一个信号，告诉模型是加强还是抑制那种类型的输出。它通过将每个响应的奖励标准化，对照整个组奖励的平均值和标准差来做到这一点。

优势通过一个简单的公式计算，该公式本质上是询问：“这个响应与该组中所有其他响应的平均值相比有多好？”

计算优势

优势通过这个简单的公式计算：

$\hat{A}_{i,t} = \frac{r_i - \text{mean}(r)}{\text{std}(r)}$

其中：

$r_i$ 是特定输出的奖励。
$\text{mean}(r)$ 是组中所有输出的平均奖励。
$\text{std}(r)$ 是所有奖励的标准差。

这个公式本质上是询问：“与模型刚刚为该提示生成的所有响应的平均值相比，这个特定响应的好坏程度如何？”奖励远高于平均值的输出获得高正优势，强烈强化该推理路径。低于平均值的输出获得负优势，对其进行惩罚。这种基于组的比较提供了一个稳健的、即时的基线，而无需单独的、内存消耗大的价值 LLM。

GRPO 中优势计算的简单示例

想象一下提示是“7 * 6 是多少？”。模型生成了 3 个响应组，然后对它们进行评分。

回应	理由	奖励
`<think>7*6 是 42</think><SOLUTION>42</SOLUTION>`	正确格式和答案	+4.0
`<think>7*6 是 41</think><SOLUTION>41</SOLUTION>`	正确格式，错误答案	+1.0
`答案是 42。`	格式错误，“正确”答案	-2.0

GRPO 然后计算组的统计数据：

平均奖励： (4.0 + 1.0 - 2.0) / 3 = 1.0
响应 A 的优势： (4.0 - 1.0) / std_dev = 高正值（强烈强化！）
响应 C 的优势： (-2.0 - 1.0) / std_dev = 高负值（强烈惩罚！）

这个过程允许模型学习偏好响应 A 的结构和准确性，而不需要单独的价值模型来做出判断。

这种基于组的比较提供了一个健壮的、即时基线，有效地指导模型实现更好的推理。

为什么使用 Unsloth？

Unsloth 是一个强大的库，旨在使 LLM 微调更快、更节省内存。它通过以下优化实现：

更快的训练： Unsloth 可以显著加快训练过程，在某些情况下可以提高 2 倍或更多。
减少内存使用： 通过减少内存占用，它允许在消费级硬件上微调更大的模型。
易用性： Unsloth 提供了一个用户友好的 API，简化了微调工作流程。

现在，让我们深入了解代码和微调过程。

后训练过程

1. 设置环境

第一步是安装必要的库。该脚本提供了标准 Python 环境和 Google Colab 实例的命令。

标准环境

# For a standard environment
!pip install unsloth vllm

Google Colab

!pip install --no-deps unsloth vllm==0.8.5.post1
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0>" "huggingface_hub>=0.34.0" hf_transfer

这些命令安装用于后训练的 Unsloth，用于快速推理的 vLLM，以及 PEFT（参数高效微调）、TRL（Transformer 强化学习）和 Datasets 等其他基本库。

2. 加载模型并准备 PEFT

接下来，我们使用 Unsloth 的 FastLanguageModel 加载 Qwen3-1.7B-Base 模型。我们还使用 LoRA（低秩适应）对其进行 PEFT 配置。

from unsloth import FastLanguageModel
import torch

max_seq_length = 2048
lora_rank = 32

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3-1.7B-Base",
    max_seq_length = max_seq_length,
    load_in_4bit = False, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.7, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha = lora_rank*2, # *2 speeds up training
    use_gradient_checkpointing = "unsloth", # Reduces memory usage
    random_state = 3407,
)

3. 为 GRPO 制作自定义聊天模板

在训练模型之前，我们需要教它如何构建响应。我们通过定义聊天模板来实现这一点。这个模板充当蓝图，指导模型以一致、可预测的格式生成输出，这非常适合我们的推理任务。

我们的目标是创建一个模板，强制模型“展示其工作”并提供清晰的最终答案。为此，我们将分三步构建模板：

步骤 1：使用特殊标记定义结构

首先，我们将定义将作为模型输出中分隔符的特殊标签。这使得我们的奖励函数以后可以轻松解析响应。

# Define the special tokens that will structure the output
reasoning_start = "<think>"
reasoning_end   = "</think>"
solution_start  = "<SOLUTION>"
solution_end    = "</SOLUTION>"

我们期望的输出格式如下：

<think>
...the model's step-by-step reasoning goes here...
</think>
<SOLUTION>
...the model's final answer goes here...
</SOLUTION>

步骤 2：创建系统提示和 Jinja 模板

接下来，我们为模型创建指令。这包括两个部分：一个告诉模型其作用的高级系统提示，以及一个程序化地为模型组装对话的 Jinja2 模板。

# Create the system prompt that instructs the model on the format
system_prompt = \
f"""You are given a problem.
Think about the problem and provide your working out.
Place it between {reasoning_start} and {reasoning_end}.
Then, provide your solution between {solution_start}{solution_end}"""

# Define the Jinja2 template for the tokenizer
chat_template = \
    "{% if messages[0]['role'] == 'system' %}"\
        "{{ messages[0]['content'] + eos_token }}"\
        "{% set loop_messages = messages[1:] %}"\
    "{% else %}"\
        "{{ '{system_prompt}' + eos_token }}"\
        "{% set loop_messages = messages %}"\
    "{% endif %}"\
    "{% for message in loop_messages %}"\
        "{% if message['role'] == 'user' %}"\
            "{{ message['content'] }}"\
        "{% elif message['role'] == 'assistant' %}"\
            "{{ message['content'] + eos_token }}"\
        "{% endif %}"\
    "{% endfor %}"\
    "{% if add_generation_prompt %}{{ '{reasoning_start}' }}"\
    "{% endif %}"

Jinja 模板是核心逻辑。它遍历对话历史并正确格式化。最重要的一行是最后一行：{% if add_generation_prompt %}。这会告诉分词器在轮到模型说话时自动添加我们的 <think> 标记，从而启动所需的推理过程。

步骤 3：将模板应用于分词器

最后，我们将自定义 system_prompt 注入 Jinja 模板，并将完成的模板分配给我们的分词器。这使我们的自定义格式成为所有未来对话的官方规则。

# Inject our custom system prompt and starting token into the template
chat_template = chat_template\
    .replace("'{system_prompt}'",   f"'{system_prompt}'")\
    .replace("'{reasoning_start}'", f"'{reasoning_start}'")

# Finally, apply this template to the tokenizer
tokenizer.chat_template = chat_template

通过强制执行此结构，我们使奖励模型进行良好推理的过程既可编程又可靠。这是我们整个 GRPO 训练策略所基于的基础。

4. 格式化预微调

在让模型通过 GRPO 试错学习之前，我们首先通过一个简短的监督微调（SFT）阶段为其提供先发优势。为什么？当模型已经大致了解要做什么时，强化学习是最有效的。如果基础模型从未见过我们的 <think> 格式，它将生成随机的、非结构化的文本。奖励它极少数正确格式化的情况效率极低。

这个 SFT 步骤充当行为克隆。我们向模型展示数百个我们想要的确切格式的示例。这快速地教会它基本结构，使随后的 GRPO 阶段更稳定，更专注于改进格式内的推理，而不是仅仅学习格式本身。

我们为此使用了NVIDIA 的 Open Math Reasoning 数据集的一个小子集。

加载和格式化 SFT 数据集

首先，我们使用 Hugging Face 的 datasets 库加载数据集。我们将对其进行筛选，只包含具有数值答案的问题，以简化此预调优步骤。

from datasets import load_dataset
import pandas as pd
import numpy as np

dataset = load_dataset("unsloth/OpenMathReasoning-mini", split = "cot")
dataset = dataset.to_pandas()[
    ["expected_answer", "problem", "generated_solution"]
]

# Keep only samples where the answer is a number
is_number = pd.to_numeric(pd.Series(dataset["expected_answer"]), errors = "coerce").notnull()
dataset = dataset.iloc[np.where(is_number)[0]]

接下来，我们创建一个函数，将每一行重新格式化为我们之前定义的自定义聊天结构。此函数会获取现有的推理轨迹，并用我们的特殊标记（、等）将其包装起来。

def format_sft_dataset(x):
    # Reformat the existing solution to match our template
    thoughts = x["generated_solution"].replace("<think>", "").replace("</think>", "").strip()
    
    # Construct the final response format
    final_prompt = \
        reasoning_start + thoughts + reasoning_end + \
        solution_start + x["expected_answer"] + solution_end
    
    # Return the message structure
    return [
        {"role" : "system",    "content" : system_prompt},
        {"role" : "user",      "content" : x["problem"]},
        {"role" : "assistant", "content" : final_prompt},
    ]

dataset["Messages"] = dataset.apply(format_sft_dataset, axis = 1)

最后，我们将 pandas DataFrame 转换回 Hugging Face Dataset 对象，SFTTrainer 期望的就是这种对象。

from datasets import Dataset

dataset["text"] = tokenizer.apply_chat_template(
    dataset["Messages"].values.tolist(), tokenize = False
)
dataset = Dataset.from_pandas(dataset)

现在数据集已准备好，我们可以将其传递给 SFTTrainer。

import numpy as np
from trl import SFTTrainer, SFTConfig

# Load and format the dataset from the original script
# This involves loading "unsloth/OpenMathReasoning-mini", filtering,
# and applying the format_dataset function.

# Create the SFT Trainer
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset, # The formatted dataset
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 1, # Use GA to mimic batch size!
        warmup_steps = 5,
        num_train_epochs = 2, # Set this for 1 full training run.
        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
        logging_steps = 5,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none", # Use "wandb" for Weights & Biases
    ),
)

# Start the pre-finetuning
trainer.train()

我们将数据集映射以创建“提示”，其中包括我们的系统消息和用户问题，以及“答案”，即预期解决方案。

5. 定义奖励系统

在模型经过格式化预训练后，是时候设置主要的 GRPO 训练循环了。这首先要准备我们的主要数据集并定义将指导学习过程的奖励函数。

加载 GRPO 数据集

对于主要的 RL 阶段，我们将使用 open-r1/DAPO-Math-17k-Processed 数据集。我们将它映射到一个简单的结构，其中包含 prompt 和真实 answer。

from datasets import load_dataset

# Load the main dataset for GRPO
dataset = load_dataset("open-r1/DAPO-Math-17k-Processed", "en", split = "train")

# Map it to our required format
dataset = dataset.map(lambda x: {
    "prompt" : [
        {"role": "system", "content": system_prompt},
        {"role": "user",   "content": x["prompt"]},
    ],
    "answer": x["solution"],
})

奖励函数

GRPO 的核心是对模型生成的输出进行评分。我们将使用四种奖励函数。GRPOTrainer 会将所有这些函数的分数相加，以获得每次生成的最终奖励。

match_format_exactly：如果模型的响应完美遵循我们定义的结构，我们将给予模型一个大的正奖励。我们可以使用正则表达式来检查、、和标签是否按正确的顺序出现。

import re

# We pre-compile regex for efficiency
solution_end_regex = r"</SOLUTION>[\s]{0,}" + "(?:" + re.escape(tokenizer.eos_token) + ")?"
match_format = re.compile(
    rf"{reasoning_end}.*?{solution_start}(.+?){solution_end_regex}",
    flags = re.MULTILINE | re.DOTALL
)

def match_format_exactly(completions, **kwargs):
    scores = []
    for completion in completions:
        score = 0
        response = completion[0]["content"]
        # Match if format is seen exactly!
        if match_format.search(response) is not None: score += 3.0
        scores.append(score)
    return scores

match_format_approximately：如果格式不完美，我们应该给模型部分奖励。为此，我们检查每个所需标签（、、）是否存在，并为每个找到的标签添加少量奖励，同时如果缺少标签则惩罚模型。这鼓励模型至少尝试遵循格式。

def match_format_approximately(completions, **kwargs):
    scores = []
    for completion in completions:
        score = 0
        response = completion[0]["content"]
        # Count how many keywords are seen - we penalize if too many!
        score += 0.5 if response.count(reasoning_end)   == 1 else -1.0
        score += 0.5 if response.count(solution_start)  == 1 else -1.0
        score += 0.5 if response.count(solution_end)    == 1 else -1.0
        scores.append(score)
    return scores

check_answer：此奖励函数评估答案的正确性。我们首先尝试提取标签内的文本。如果格式错误，我们将分配一个惩罚。如果格式正确，我们将提取的文本与真实答案进行比较，对于精确匹配给予高奖励。如果猜测答案与真实答案的数值比率接近（例如，在 10% 或 20% 以内），我们还将提供部分奖励。


def check_answer(prompts, completions, answer, **kwargs):
    question = prompts[0][-1]["content"]
    responses = [completion[0]["content"] for completion in completions]

    extracted_responses = [
        guess.group(1)
        if (guess := match_format.search(r)) is not None else None \
        for r in responses
    ]

    scores = []
    for guess, true_answer in zip(extracted_responses, answer):
        score = 0
        if guess is None:
            scores.append(-2.0)
            continue
        # Correct answer gets 5 points!
        if guess == true_answer:
            score += 5.0
        # Match if spaces are seen, but less reward
        elif guess.strip() == true_answer.strip():
            score += 3.5
        else:
            # We also reward it if the answer is close via ratios!
            # Ie if the answer is within some range, reward it!
            try:
                ratio = float(guess) / float(true_answer)
                if   ratio >= 0.9 and ratio <= 1.1: score += 2.0
                elif ratio >= 0.8 and ratio <= 1.2: score += 1.5
                else: score -= 2.5 # Penalize wrong answers
            except:
                score -= 4.5 # Penalize
        scores.append(score)
    return scores

check_numbers - 我们还专门为数字答案提供额外的奖励。为此，我们只从解决方案中提取数字，清除它们（例如，删除逗号），将它们转换为浮点数，并为精确的数字匹配提供正奖励，否则给予惩罚。

match_numbers = re.compile(
    solution_start + r".*?[\s]{0,}([-]?[\d\.\,]{1,})",
    flags = re.MULTILINE | re.DOTALL
)

global PRINTED_TIMES
PRINTED_TIMES = 0
global PRINT_EVERY_STEPS
PRINT_EVERY_STEPS = 5

def check_numbers(prompts, completions, answer, **kwargs):
    question = prompts[0][-1]["content"]
    responses = [completion[0]["content"] for completion in completions]
    extracted_responses = [
        guess.group(1) if (guess := match_numbers.search(r)) is not None else None \
        for r in responses
    ]
    scores = []
    global PRINTED_TIMES, PRINT_EVERY_STEPS
    if PRINTED_TIMES % PRINT_EVERY_STEPS == 0:
        print(f"*****\nQ: {question}\nA: {answer[0]}\nR: {responses[0]}\nE: {extracted_responses[0]}\n*****")
    PRINTED_TIMES += 1

    for guess, true_answer in zip(extracted_responses, answer):
        if guess is None:
            scores.append(-2.5)
            continue
        try:
            true_num = float(true_answer.strip())
            guess_num = float(guess.strip().replace(",", ""))
            scores.append(3.5 if guess_num == true_num else -1.5)
        except:
            scores.append(0)
    return scores

通过结合这些奖励函数，我们创建了一个全面的评分系统，鼓励模型生成结构良好且准确的响应。

6. 配置并启动 GRPO 训练器

现在，我们使用我们的模型、分词器、奖励函数和训练参数来配置 GRPOTrainer。

from trl import GRPOConfig, GRPOTrainer
from vllm import SamplingParams

# Define sampling parameters for vLLM
vllm_sampling_params = SamplingParams(
    min_p = 0.1,
    top_p = 1.0,
    top_k = -1,
    seed = 3407,
    stop = [tokenizer.eos_token],
    include_stop_str_in_output = True,
)

# Configure GRPO training
training_args = GRPOConfig(
    vllm_sampling_params = vllm_sampling_params,
    temperature = 1.0,
    learning_rate = 5e-6,
    weight_decay = 0.01,
    warmup_ratio = 0.1,
    lr_scheduler_type = "linear",
    optim = "adamw_8bit",
    logging_steps = 1,
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1, # Increase for smoother training
    num_generations = 4, # Decrease if out of memory
    max_prompt_length = max_prompt_length,
    max_completion_length = max_completion_length,
    max_steps = 100, # Set to a higher number for a full run
    save_steps = 100,
    report_to = "none", # Can use "wandb"
    output_dir = "outputs",
)

关键 GRPOConfig 参数

num_generations：这是 GRPO 中的“G”——为每个提示生成的一组响应的大小。更大的组为优势计算提供更好的统计数据，但会占用更多内存和计算资源。4-8 的值是一个很好的起点。
temperature：较高的温度（如 1.0）会鼓励模型为组生成更多样化和富有创意的响应。这种多样性对于探索至关重要，因为它允许模型尝试不同的推理路径。低温会使组中的所有响应过于相似，从而阻碍学习。

# Initialize the trainer
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        match_format_exactly,
        match_format_approximately,
        check_answer,
        check_numbers,
    ],
    args = training_args,
    train_dataset = dataset,
)

# Start training!
trainer.train()

GRPOTrainer 使用配置对模型进行微调。训练目标是观察训练日志中的奖励列随时间增加。

7. 使用微调模型进行推理

训练结束后，我们可以测试我们的微调模型。首先，我们保存训练好的 LoRA 适配器，然后在推理时加载它。

# Save the LoRA adapter
model.save_lora("grpo_saved_lora")

# Prepare messages for inference
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user",   "content": "What is the sqrt of 101?"},
]

text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
    tokenize = False,
)

# Generate text using the fine-tuned LoRA
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_saved_lora"),
)[0].outputs[0].text

print(output)

lora_request 参数告诉模型使用我们的微调 LoRA 适配器进行此生成。输出现在应该遵循我们定义的推理格式。

8. 保存和共享您的模型

最后，Unsloth 提供了方便的方法，可以将您的微调模型保存为各种格式。

保存为合并模型（float16 或 4 位）：

# This combines the base model with the LoRA adapter into a single model.


# Merge to 16-bit
model.save_pretrained_merged(
    "model", 
    tokenizer, 
    save_method = "merged_16bit"
)

# Merge to 4-bit
model.save_pretrained_merged(
    "model", 
    tokenizer, 
    save_method = "merged_4bit"
)

仅保存 LoRA 适配器

# Merge to 16-bit
model.save_pretrained_merged(
    "model", 
    tokenizer, 
    save_method = "merged_16bit"
)

# Merge to 4-bit
model.save_pretrained_merged(
    "model", 
    tokenizer, 
    save_method = "merged_4bit"
)

转换为 GGUF 以用于 llama.cpp

# Save to 8-bit Q8_0 GGUF
model.save_pretrained_gguf("model", tokenizer)

# Save to q4_k_m GGUF
model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")

结束语

就是这样！现在您已经了解了“后训练”的基础知识。您会发现通过结合强大的基础模型、巧妙的强化学习技术和高效的库来创建擅长复杂任务的专用模型是多么容易。

主要收获

GRPO 是一种有效的教学模型推理方法，通过奖励模型正确且结构良好的响应。
Unsloth 通过提高速度和减少内存使用，使微调过程更易于访问。
明确定义的奖励系统是 GRPO 成功的关键。
预微调有助于简化主要的 RL 训练阶段。

社区

通过拖放到文本输入框、粘贴或点击此处上传图片、音频和视频。

点击或粘贴此处以上传图片

· 注册或登录评论