开源 AI 食谱文档

使用 TRL 和 GRPO 对 VLM 进行推理能力的后训练

开源 AI 食谱

加入 Hugging Face 社区

并获得增强的文档体验

在模型、数据集和 Spaces 上进行协作

通过加速推理获得更快的示例

切换文档主题

开始使用

使用 TRL 和 GRPO 对 VLM 进行推理能力的后训练

作者: Sergio Paniego

🚨 警告：此 notebook 属于资源密集型，需要大量计算能力。如果在 Colab 中运行，它将使用 A100 GPU。

在本篇指南中，我们将演示如何使用 GRPO 对一个视觉语言模型 (VLM) 进行后训练，以利用 Hugging Face 生态系统（特别是 Transformer 强化学习库 (trl)）为 VLM 增加推理能力。

我们将使用 lmms-lab/multimodal-open-r1-8k-verified 数据集的一个子集来微调 Qwen2.5-VL-3B-Instruct。该数据集包含带有问题描述的图像及其解决方案和得出该解决方案的思维过程。我们将利用这种数据格式以及 GRPO 奖励函数，来教模型如何进行推理以得出解决方案。

1. 安装依赖

我们先从安装微调所需的基本库开始。我们将从源代码安装 trl，因为在撰写本文时，VLM GRPO trainer 尚未包含在官方发布版本中。

!pip install -U -q git+https://github.com/huggingface/trl.git peft math_verify qwen-vl-utils[decord]

请使用您的 Hugging Face 🤗 账户进行认证，以便保存和分享训练好的模型。

from huggingface_hub import login

login()

2. 加载数据集 📁

在本指南中，我们使用 lmms-lab/multimodal-open-r1-8k-verified。该数据集包含 8k 个专注于数学推理的多模态 RL 训练样本。这些数据是使用 GPT4o 创建的，每个样本都包含 image、problem、solution、original question 和 original answer。它是在这个项目中创建的。

对于我们希望模型学习使用图像进行推理的特定情况，我们将 image 和 problem 作为输入，solution 作为输出。

为了这个教学资源，我们将只使用 5% 的数据集，并将其划分为训练集和测试集，以加快训练速度。在实际训练中，我们会使用完整的数据集。

我们来加载并划分数据集。

from datasets import load_dataset

dataset_id = "lmms-lab/multimodal-open-r1-8k-verified"
dataset = load_dataset(dataset_id, split="train[:5%]")

split_dataset = dataset.train_test_split(test_size=0.2, seed=42)

train_dataset = split_dataset["train"]
test_dataset = split_dataset["test"]

我们来检查一下数据集的结构。

>>> print(train_dataset)

Dataset(&#123;
    features: ['image', 'problem', 'solution', 'original_question', 'original_answer'],
    num_rows: 307
})

我们来检查一个样本

print(train_dataset[0])

除了 problem 和 image 列之外，我们还包含了一个自定义的系统提示，以告知模型我们希望它如何生成内容。

系统提示是从 DeepSeek R1 中提取的。更多细节请参考之前的这篇指南。

我们将数据集样本转换为对话样本，每个样本包含系统提示、一个图像和问题描述，因为这是 GRPO trainer 所期望的格式。

我们还设置了 padding_side="left"，以确保训练期间生成的补全内容直接连接在提示之后，这对于 GRPO 正确比较偏好响应和拒绝响应之间的 token 级概率至关重要。

from transformers import AutoProcessor

model_id = "Qwen/Qwen2.5-VL-3B-Instruct"
processor = AutoProcessor.from_pretrained(model_id, use_fast=True, padding_side="left")

SYSTEM_PROMPT = (
    "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant "
    "first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning "
    "process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., "
    "<think> reasoning process here </think><answer> answer here </answer>"
)


def make_conversation(example):
    conversation = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text": example["problem"]},
            ],
        },
    ]
    prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
    return {
        "prompt": prompt,
        "image": example["image"],
    }


train_dataset = train_dataset.map(make_conversation)

我们来看一个转换后的例子。

>>> print(train_dataset[0]["prompt"])

<|im_start|>system
A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within   and   tags, respectively, i.e.,  reasoning process here  answer here <|im_end|>
<|im_start|>user
<|vision_start|><|image_pad|><|vision_end|>Based on the image, determine the constant term after combining all the polynomial expressions representing the side lengths of the triangle. Choose the correct answer from the options provided.

Choices:
A. 3
B. 5
C. 8
D. 13<|im_end|>
<|im_start|>assistant

我们将移除训练中不需要的列。

train_dataset

我们可以检查一下，这些列现在已经消失了。

>>> train_dataset = train_dataset.remove_columns(["problem", "original_question", "original_answer"])
>>> print(train_dataset)

Dataset(&#123;
    features: ['image', 'solution', 'prompt'],
    num_rows: 307
})

3. 使用 GRPO 对 VLM 进行后训练

下图突显了 PPO (近端策略优化) 和 GRPO (分组相对策略优化) 之间的主要区别，特别是在 GRPO 中移除了价值模型。关于关键差异的更详细信息，您可以参考这篇进一步的解释。

为了实现训练流程，我们利用了 trl，这是 Hugging Face 的强化学习库，它提供了一个简化的接口和对关键训练算法的内置支持。在我们的案例中，我们使用了 GRPOConfig 和 GRPOTrainer 类。这个过程中的一个关键步骤是定义自定义奖励函数，这些函数引导模型的行为并帮助它与我们的特定目标对齐。

但首先，我们来加载模型。在本例中，我们使用 Qwen/Qwen2.5-VL-3B-Instruct，这是由 Qwen 开发的一款强大的 VLM。为了获得更好的结果，考虑使用参数更多的模型将很重要。

包含推理能力的其他 VLM 项目示例有：

3.1 加载基线模型

我们先来加载基线模型。如前所述，是 Qwen/Qwen2.5-VL-3B-Instruct。

import torch
from transformers import Qwen2_5_VLForConditionalGeneration

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    pretrained_model_name_or_path=model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

3.2 配置 LoRA

我们将利用 LoRA 来训练模型，所以我们先来配置它。

>>> from peft import LoraConfig, get_peft_model

>>> lora_config = LoraConfig(
...     task_type="CAUSAL_LM",
...     r=8,
...     lora_alpha=32,
...     lora_dropout=0.1,
...     target_modules=["q_proj", "v_proj"],
... )

>>> model = get_peft_model(model, lora_config)

>>> model.print_trainable_parameters()

trainable params: 1,843,200 || all params: 3,756,466,176 || trainable%: 0.0491

3.3 加载奖励函数

对于系统的奖励部分，我们可以使用预训练的奖励模型或直接在代码中定义的奖励函数。为了训练，DeepSeek-R1 的作者使用了一个基于准确性的奖励模型，该模型评估响应是否正确，同时还使用了一个基于格式的奖励，以确保模型将其推理过程置于 <think> </think> 标签之间。您可以在这里找到更多细节。我们可以简单地将这些奖励函数定义并实现为通用的 Python 函数。

在这种情况下，我们将使用以下奖励函数，这些函数直接从 Open R1 的实现中提取。

格式强制：确保生成的内容遵循特定格式，使用 <think> </think> <answer> </answer> 标签进行推理。

import re


def format_reward(completions, **kwargs):
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<think>\n.*?\n</think>\n<answer>\n.*?\n</answer>$"
    matches = [re.match(pattern, content, re.DOTALL | re.MULTILINE) for content in completions]
    rewards = [1.0 if match else 0.0 for match in matches]
    return rewards

解决方案准确性： 验证问题的解决方案是否正确，将其与数据集中的 solution 列进行比较。

from math_verify import LatexExtractionConfig, parse, verify
from latex2sympy2_extended import NormalizationConfig
from typing import Optional


def accuracy_reward(completions: list[list[dict[str, str]]], solution: list[str], **kwargs) -> list[Optional[float]]:
    """Reward function that checks if the completion matches the ground truth.
    - If both gold and prediction are parseable → use math verification.
    - If not parseable → compare as normalized text.
    """
    rewards = []

    for completion, sol in zip(completions, solution):
        try:
            gold_parsed = parse(sol, extraction_mode="first_match")
        except Exception as e:
            gold_parsed = []

        if len(gold_parsed) != 0:
            # Try parsing predicted answer too
            try:
                answer_parsed = parse(
                    completion,
                    extraction_config=[
                        LatexExtractionConfig(
                            normalization_config=NormalizationConfig(
                                nits=False,
                                malformed_operators=False,
                                basic_latex=True,
                                boxed="all",
                                units=True,
                            ),
                            boxed_match_priority=0,
                            try_extract_without_anchor=False,
                        )
                    ],
                    extraction_mode="first_match",
                )
                reward = float(verify(gold_parsed, answer_parsed))
            except Exception as e:
                print(f"verify failed: {e}, answer: {completion}, gold: {sol}")
                reward = None
        else:
            # fallback to text match
            reward = float(completion.strip().lower() == sol.strip().lower())

        rewards.append(reward)

    return rewards

3.4 配置 GRPO 训练参数

接下来，我们来配置 GRPO 的训练参数。我们建议对 max_completion_length、num_generations 和 max_prompt_length 参数进行实验。

调整 max_completion_length、num_generations 和 max_prompt_length 这些参数，以找到最佳的训练组合，将会很有趣。

参数的选择已经调整以适应 Google Colab 会话的硬件限制。要观察奖励提升的全部潜力，尤其是在第二个目标函数中，并进一步提高模型在真实世界场景中的推理能力，将需要一个更具雄心的设置。这将涉及更大的模型、更多的生成次数以及高质量、多样化的数据集。

from trl import GRPOConfig

# Configure training arguments using GRPOConfig
training_args = GRPOConfig(
    output_dir="Qwen2.5-VL-3B-Instruct-Thinking",
    learning_rate=1e-5,
    remove_unused_columns=False,  # to access the solution column in accuracy_reward
    num_train_epochs=1,
    bf16=True,
    # Parameters that control the data preprocessing
    per_device_train_batch_size=2,
    max_completion_length=1024,  # default: 256
    num_generations=2,  # default: 8
    max_prompt_length=2048,
    # Parameters related to reporting and saving
    report_to=["tensorboard"],
    logging_steps=10,
    push_to_hub=True,
    save_strategy="steps",
    save_steps=10,
)

3.5 训练模型 🏃

现在，我们来配置 trainer 并开始训练模型！

在这种情况下，我们除了模型、训练参数和数据集之外，还将我们之前定义的两个奖励函数传递给 trainer。

下面，你会看到一个我们将要复现的训练流程图，该图摘自Open-R1 项目。

from trl import GRPOTrainer

trainer = GRPOTrainer(
    model=model,
    processing_class=processor,
    reward_funcs=[format_reward, accuracy_reward],
    args=training_args,
    train_dataset=train_dataset,
)

是时候训练模型了！

trainer.train()

我们可以直接在[模型页面]((https://huggingface.co/sergiopaniego/Qwen2.5-VL-3B-Instruct-Thinking/tensorboard)的 TensorBoard 中查看训练指标。虽然损失曲线可能看起来有点奇怪，但奖励结果讲述了一个更清晰的故事：模型在稳步提升，随着时间的推移，它获得的奖励越来越多。

现在，让我们把结果保存到我们的账户中 💾

trainer.save_model(training_args.output_dir)
trainer.push_to_hub(dataset_name=dataset_id)

4. 检查模型性能

现在我们的模型已经训练好了，我们可以检查它的性能以进行定性评估。

我们建议您重启会话以释放用于训练的资源。

trained_model_id = "sergiopaniego/Qwen2.5-VL-3B-Instruct-Thinking"

为此，我们将使用数据集的测试子集。首先，加载我们训练好的模型及其处理器。

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor

trained_model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    trained_model_id,
    torch_dtype="auto",
    device_map="auto",
)
trained_processor = AutoProcessor.from_pretrained(trained_model_id, use_fast=True, padding_side="left")

我们将生成一个辅助函数来生成我们的响应。这将使我们更容易地发送一个问题和图像，并检索模型的响应，该响应应包括推理过程和最终答案。

import time
import torch
from qwen_vl_utils import process_vision_info


def generate_with_reasoning(problem, image):
    # Conversation setting for sending to the model
    conversation = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image},
                {"type": "text", "text": problem},
            ],
        },
    ]
    prompt = trained_processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)

    # Process images using the process_vision_info from qwen_vl_utils
    image_inputs, video_inputs = process_vision_info(conversation)

    inputs = processor(
        text=[prompt],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
    )
    inputs = inputs.to(trained_model.device)

    # Generate text without gradients
    start_time = time.time()
    with torch.no_grad():
        output_ids = trained_model.generate(**inputs, max_new_tokens=500)
    end_time = time.time()

    # Decode and extract model response
    generated_text = trained_processor.decode(output_ids[0], skip_special_tokens=True)

    # Get inference time
    inference_duration = end_time - start_time

    # Get number of generated tokens
    num_input_tokens = inputs["input_ids"].shape[1]
    num_generated_tokens = output_ids.shape[1] - num_input_tokens

    return generated_text, inference_duration, num_generated_tokens

我们来检查一下！

>>> generated_text, inference_duration, num_generated_tokens = generate_with_reasoning(
...     test_dataset[0]["problem"], test_dataset[0]["image"]
... )
>>> print(generated_text)

system
A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within   and   tags, respectively, i.e.,  reasoning process here  answer here 
user
Based on the image, determine the sine value of angle AOB if it measures 120 degrees. Choose the correct answer from the options provided.

Choices:
A. $\frac&#123;\sqrt&#123;3}}&#123;2}$
B. $\frac&#123;1}&#123;2}$
C. $-\frac&#123;\sqrt&#123;3}}&#123;2}$
D. $\sqrt&#123;2}$
assistant

In a circle, the sine of an angle is equal to the ratio of the length of the side opposite the angle to the hypotenuse. In this case, since angle AOB is 120 degrees, we can use the properties of a 30-60-90 triangle to find the sine value. The sine of 120 degrees is equivalent to the sine of 60 degrees because 180 - 120 = 60. The sine of 60 degrees is $\frac&#123;\sqrt&#123;3}}&#123;2}$. Therefore, the sine of angle AOB is $\frac&#123;\sqrt&#123;3}}&#123;2}$.


$\frac&#123;\sqrt&#123;3}}&#123;2}$

答案似乎遵循了我们在训练期间使用奖励函数添加的约束。我们可以看到模型生成了类似这样的内容：<think>推理</think><answer>解决方案</answer>。我们来检查一下实际的解决方案，以了解模型是否正确。

test_dataset[0]["solution"]

看起来模型已经将一些推理能力融入其功能中！我们再检查一下推理时间和生成的 token 数量，以进一步检验模型的能力。

>>> print(f"Inference time: {inference_duration:.2f} seconds")
>>> print(f"Generated tokens: {num_generated_tokens}")

Inference time: 11.03 seconds
Generated tokens: 163

5. 继续你的学习之旅 🧑‍🎓

学习之旅并未在此结束！

如果您渴望了解更多关于 GRPO、推理或 VLM 的知识，我们可以推荐一些材料

< > 在 GitHub 上更新

←使用 TRL 和 MPO 微调视觉语言模型使用 Elasticsearch 进行语义重排→