使用 TRL 和 MPO 微调视觉语言模型

在本教程中，我们将演示如何使用 Transformer Reinforcement Learning (TRL) 库中的混合偏好优化 (MPO) 来微调视觉语言模型 (VLM)。

MPO 是一种结合了多个优化目标的训练方法，在论文《通过混合偏好优化增强多模态大型语言模型的推理能力》中被引入。它是直接偏好优化 (DPO) 训练器的一部分，通过结合多个具有不同权重的损失函数来工作，从而实现更复杂的优化策略。

我们将使用偏好数据集微调Qwen/Qwen2.5-VL-3B-Instruct，这是一个性能出色的小型 VLM，以帮助模型与期望的输出对齐。查阅这篇博文以了解更多关于视觉语言模型的偏好优化的信息。

我们将使用的数据集是HuggingFaceH4/rlaif-v_formatted，这是RLAIF-V 数据集的特殊格式化版本。该数据集包含prompt + image对，以及每个样本的chosen（选择的）和rejected（拒绝的）响应。微调过程的最终目标是训练一个模型，使其始终偏爱chosen答案而不是rejected答案，从而减少幻觉。为实现此目标，将组合使用多个损失函数。

1. 安装依赖

让我们开始安装所需的依赖。
我们将从源代码安装trl，因为在撰写本文时，MPO 训练器尚未包含在官方版本中。

!pip install -U -q git+https://github.com/huggingface/trl.git bitsandbytes qwen-vl-utils==0.0.8

我们将使用我们的帐户与 Hugging Face Hub 进行身份验证，以上传和保存微调后的模型。
您可以在此处生成您的访问令牌。

from huggingface_hub import notebook_login

notebook_login()

2. 加载数据集

对于本教程，我们将使用HuggingFaceH4/rlaif-v_formatted，这是RLAIF-V 数据集的特殊格式化版本。

在介绍 MPO 的论文中，作者还介绍了OpenGVLab/MMPR，这是一个通过结合有清晰真实标签和无清晰真实标签的样本而构建的大规模多模态偏好数据集。

对于我们的教育案例，我们将使用HuggingFaceH4/rlaif-v_formatted。然而，为了最好地重现论文结果，我们建议探索 MMPR。
在此示例中，我们将使用数据集的一个子集。

from datasets import load_dataset

dataset_id = "HuggingFaceH4/rlaif-v_formatted"
train_dataset, test_dataset = load_dataset(dataset_id, split=["train[:5%]", "test[:1%]"])

让我们快速检查以确保图像为 RGB 格式。如果不是，我们将相应地转换它们。

from PIL import Image


def ensure_rgb(example):
    # Convert the image to RGB if it's not already
    image = example["images"][0]
    if isinstance(image, Image.Image):
        if image.mode != "RGB":
            image = image.convert("RGB")
        example["images"] = [image]
    return example


# Apply the transformation to the dataset (change num_proc depending on the available compute)
train_dataset = train_dataset.map(ensure_rgb, num_proc=8)
test_dataset = test_dataset.map(ensure_rgb, num_proc=8)

让我们检查一个样本以了解其结构。
如我们所见，每个样本包含chosen、rejected、image和prompt。
我们的目标是使用 MPO 微调模型以偏好chosen答案。

train_dataset[5]

让我们检查那个特定样本的图像

>>> train_dataset[5]["images"][0]

3. 使用 TRL 和 MPO 微调模型

如前所述，我们将利用trl，因为该库提供了我们使用 MPO 进行训练所需的一切，同时抽象化了在此特定情况下我们不需要处理的一些复杂性。

MPO 训练器接受一个loss_type列表。DPO 训练器文档此处提供了所有可用损失函数的完整列表。
如前所述，MPO 是 DPO 训练器的一个特殊情况，因此我们可以通过指定损失类型列表及其对应的权重来使用它。

在下图中，您可以看到 MPO 论文中报告的 InternVL2-8B 模型使用这种训练策略所获得的改进。

3.1 加载量化模型进行训练

我们来加载模型。在此示例中，我们将使用Qwen/Qwen2.5-VL-3B-Instruct，这是一个性能卓越的紧凑型视觉语言模型（VLM）。

在原始的 MPO 论文中，作者发布了使用该技术为InternVL2.5微调的检查点集合，InternVL2.5 是另一个高性能的 VLM。

我们选择 Qwen2.5-VL-3B-Instruct 是因为它与transformers库的集成直截了当，尽管 InternVL2.5 是论文中使用的原始模型。

qwen2.5vl_arc.jpeg

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor

model_id = "Qwen/Qwen2.5-VL-3B-Instruct"

from transformers import BitsAndBytesConfig

# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model and tokenizer
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config,
)
processor = AutoProcessor.from_pretrained(model_id, use_fast=True)

3.2 设置 QLoRA

现在我们已经加载了模型和处理器，让我们设置 QLoRA 和 DPOConfig，我们将在其中设置损失列表及其对应的权重。
这些配置支持为我们的训练目标量身定制的高效微调和优化。

>>> from peft import LoraConfig, get_peft_model

>>> # Configure LoRA
>>> peft_config = LoraConfig(
...     r=8,
...     lora_alpha=8,
...     lora_dropout=0.1,
...     target_modules=["down_proj", "o_proj", "k_proj", "q_proj", "gate_proj", "up_proj", "v_proj"],
...     use_dora=True,
...     init_lora_weights="gaussian",
... )

>>> # Apply PEFT model adaptation
>>> peft_model = get_peft_model(model, peft_config)

>>> # Print trainable parameters
>>> peft_model.print_trainable_parameters()

trainable params: 19,868,416 || all params: 3,774,491,392 || trainable%: 0.5264

3.3 通过 DPOConfig 配置 MPO

要使用DPOConfig配置 MPO 训练，只需使用loss_type参数提供一个损失类型列表。这可以作为 Python 列表或逗号分隔的字符串传递。此外，您可以选择指定一个相应的loss_weights列表来控制优化过程中每个损失的相对重要性。如果省略，所有损失默认为权重1.0。

例如，按照原始 MPO 论文中描述的设置，您可以定义

loss_type = ["sigmoid", "bco_pair", "sft"]

loss_weights = [0.8, 0.2, 1.0]

这对应于

MPO 被定义为偏好损失 (L_p)、质量损失 (L_q) 和生成损失 (L_g) 的组合。

所选的loss_type是

"sigmoid"：来自原始DPO论文的 Sigmoid 损失。
"bco_pair"：来自BCO论文的成对 BCO 损失。
"sft"：负对数似然损失（标准监督微调损失）。

有关每个可用损失类型及其如何影响训练的更多详细信息，请参阅官方文档。

所有其他配置选项均遵循标准的DPOConfig格式，并可根据您可用的计算资源进行调整。

from trl import DPOConfig

training_args = DPOConfig(
    output_dir="Qwen2.5-VL-3B-Instruct-trl-mpo-rlaif-v",
    loss_type=["sigmoid", "bco_pair", "sft"],  # Loss types to combine, as used in the MPO paper
    loss_weights=[0.8, 0.2, 1.0],  # Corresponding weights, as used in the MPO paper
    bf16=False,
    gradient_checkpointing=True,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=8,
    num_train_epochs=1,
    dataset_num_proc=1,  # tokenization will use 1 processes
    dataloader_num_workers=8,  # data loading will use 8 workers
    logging_steps=10,
    report_to="tensorboard",
    push_to_hub=True,
    save_strategy="steps",
    save_steps=10,
    save_total_limit=1,
    eval_steps=10,  # Steps interval for evaluation
    eval_strategy="steps",
)

如我们所见，设置 MPO 与 DPO 非常简单，只需在DPOConfig中额外添加两个参数。最后，我们可以初始化DPOTrainer并开始训练模型。

from trl import DPOTrainer

trainer = DPOTrainer(
    model=peft_model,
    ref_model=None,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    processing_class=processor,
)

trainer.train()

4. 测试微调后的模型

我们已经使用 MPO 微调了模型。现在，让我们评估它在一个样本上的性能，看看它在实践中的表现如何。

trained_model_id = "sergiopaniego/Qwen2.5-VL-3B-Instruct-trl-mpo-rlaif-v"
model_id = "Qwen/Qwen2.5-VL-3B-Instruct"

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
import torch

base_model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
trained_model = PeftModel.from_pretrained(base_model, trained_model_id).eval()

trained_processor = AutoProcessor.from_pretrained(model_id, use_fast=True)

test_dataset[0]

>>> test_dataset[0]["images"][0]

from qwen_vl_utils import process_vision_info


def generate_text_from_sample(model, processor, sample, max_new_tokens=1024, device="cuda"):
    model.gradient_checkpointing_disable()
    model.config.use_cache = True

    # Prepare the text input by applying the chat template
    sample["prompt"][0]["content"][0]["image"] = sample["images"][0]
    text_input = processor.apply_chat_template(sample["prompt"], add_generation_prompt=True)

    image_inputs, _ = process_vision_info(sample["prompt"])
    inputs = processor(
        text=[text_input],
        images=image_inputs,
        videos=None,
        padding=True,
        return_tensors="pt",
    )
    inputs = inputs.to("cuda")

    # Inference: Generation of the output
    generated_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)

    trimmed_generated_ids = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]

    output_text = processor.batch_decode(
        trimmed_generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )

    return output_text[0]

我们将生成预训练模型和微调模型的输出，以突出它们之间的差异。
一个有趣的扩展是比较 MPO 输出与仅使用 DPO 微调的同一模型的输出。
我们将把这个实验留给您去探索！

>>> pretrained_output = generate_text_from_sample(model, processor, test_dataset[0])
>>> print("\n\n>>> Pretrained model output:\n\n")
>>> print(pretrained_output)
>>> trained_output = generate_text_from_sample(trained_model, trained_processor, test_dataset[0])
>>> print("\n\n>>> Fine tuned model output:\n\n")
>>> print(trained_output)

>>> Pretrained model output:

The image depicts a modern high-speed train at a station platform. The train has a sleek, aerodynamic design with a streamlined front and a yellow nose. The body of the train is primarily white, with red and blue accents along its side. The windows are rectangular and evenly spaced, providing a clear view of the interior.

The train is on a set of tracks that are elevated above the platform, which is indicated by the yellow safety line painted along the edge of the platform. The platform itself appears to be made of concrete and is equipped with a metal railing for safety.

In the background, there are several elements that provide context to the setting. There are multiple power lines and poles running parallel to the tracks, suggesting that this is an electrified railway system. The sky is clear with a few scattered clouds, indicating fair weather conditions. Additionally, there are some greenery and possibly other structures or buildings visible in the distance, though they are not the main focus of the image.

>>> Fine tuned model output:

The image depicts a modern high-speed train, likely a bullet train, positioned on a railway track. The train has a sleek, aerodynamic design with a streamlined front and a predominantly white body. It features a distinctive color scheme with red and blue accents along its sides, which are characteristic of certain high-speed rail services.

Key features of the train include:

1. **Color Scheme**: The train is primarily white with red and blue accents. The red sections are located on the sides, while the blue sections are more prominent on the front and sides.
2. **Design**: The train has a futuristic design with a pointed nose and large windows, which are typical for high-speed trains to improve aerodynamics and visibility.
3. **Windows**: The train has multiple windows along its side, allowing passengers to see outside during travel.
4. **Front Window**: The front of the train has a large, transparent window that provides a clear view of the tracks ahead.
5. **Headlights**: The train has two headlights at the front, which are essential for visibility during nighttime or low-light conditions.
6. **Platform**: The train is stopped at a platform, indicating it is either arriving or departing from a station.
7. **Railway Track**: The train is on a standard gauge railway track, suggesting it is designed for use on conventional tracks rather than high-speed lines.
8. **Surroundings**: The background shows a clear sky with some clouds, and there are some buildings and structures visible, possibly part of a cityscape or urban area.

Overall, the image captures a modern, high-speed train in a stationary position, highlighting its design and color scheme, as well as its surroundings.

从输出中，我们已经可以观察到模型在训练后的响应风格有明显的差异。
MPO 微调现已完成！

5. 继续您的学习之旅 🧑‍🎓️

这并不是您学习之旅的终点！如果您喜欢这些内容并希望深入了解 MPO、trl或视觉语言模型，请查看以下资源：

< > 在 GitHub 上更新

开源 AI 食谱