开源 AI 食谱文档

在消费级 GPU 上使用 TRL 和直接偏好优化 (DPO) 微调 SmolVLM

开源 AI 食谱

加入 Hugging Face 社区

并获得增强的文档体验

在模型、数据集和 Spaces 上进行协作

通过加速推理获得更快的示例

切换文档主题

开始使用

在消费级 GPU 上使用 TRL 和直接偏好优化 (DPO) 微调 SmolVLM

作者: Sergio Paniego

在本指南中，我们将指导您如何使用**Transformer 强化学习 (TRL)** 库，通过**直接偏好优化 (DPO)** 来微调一个**小巧的 🤏 视觉语言模型 (VLM)**，以展示即使在消费级 GPU 上，您也可以根据特定需求定制 VLM。

我们将使用**偏好数据集**来微调 SmolVLM，以帮助模型与期望的输出保持一致。SmolVLM 是一款性能高、内存效率高的模型，是完成此项任务的理想选择。如果您对语言或视觉语言模型的**偏好优化**还不熟悉，可以查看这篇博客进行深入了解。

我们将使用的数据集是 HuggingFaceH4/rlaif-v_formatted，其中包含成对的**`提示 + 图像`**，以及每对的**`选择`**和**`拒绝`**答案。此次微调过程的目标是使模型始终偏好数据集中的**选择答案**，从而减少幻觉。

本 Notebook 已在 **NVIDIA L4 GPU** 上测试通过。

1. 安装依赖

让我们先安装微调所需的基本库吧！🚀

!pip install  -U -q transformers trl datasets bitsandbytes peft accelerate
# Tested with transformers==4.46.3, trl==0.12.2, datasets==3.2.0, bitsandbytes==0.45.0, peft==0.14.0, accelerate==1.2.0

!pip install -q flash-attn --no-build-isolation

使用您的 Hugging Face 账户进行身份验证，以便直接从本 Notebook 保存和分享您的模型 🗝️。

from huggingface_hub import notebook_login

notebook_login()

2. 加载数据集 📁

我们将使用 HuggingFaceH4/rlaif-v_formatted 数据集，其中提供了成对的**`提示 + 图像`**，以及每对的**`选择`**和**`拒绝`**答案。这种结构化格式非常适合使用**直接偏好优化 (DPO)** 进行模型训练。

该数据集已经为此任务预先格式化。如果您使用自定义数据集，则需要将其预处理成相同的格式。

在此示例中，我们将使用数据集的一个子集来演示该过程。然而，在实际场景中，您应使用完整的数据集以获得更好的性能。

from datasets import load_dataset

dataset_id = "HuggingFaceH4/rlaif-v_formatted"
train_dataset, test_dataset = load_dataset(dataset_id, split=["train[:6%]", "test[:1%]"])

我们将确保所有图像都格式化为 RGB

from PIL import Image


def ensure_rgb(example):
    # Convert the image to RGB if it's not already
    image = example["images"][0]
    if isinstance(image, Image.Image):
        if image.mode != "RGB":
            image = image.convert("RGB")
        example["images"] = [image]
    return example


# Apply the transformation to the dataset
train_dataset = train_dataset.map(ensure_rgb, num_proc=32)
test_dataset = test_dataset.map(ensure_rgb, num_proc=32)

让我们浏览一个数据集中的示例，以便更好地了解其结构和我们正在处理的数据类型。

train_dataset[20]

>>> train_dataset[20]["images"][0]

3. 使用 TRL 微调模型

3.1 加载量化模型以进行训练 ⚙️

首先，让我们使用 bitsandbytes 加载 SmolVLM-Instruct 模型的量化版本，并加载处理器。我们将使用 SmolVLM-Instruct。

import torch
from transformers import Idefics3ForConditionalGeneration, AutoProcessor

model_id = "HuggingFaceTB/SmolVLM-Instruct"

from transformers import BitsAndBytesConfig

# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model and tokenizer
model = Idefics3ForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config,
    _attn_implementation="flash_attention_2",
)
processor = AutoProcessor.from_pretrained(model_id)

3.2 设置 QLoRA 和 DPOConfig 🚀

在这一步中，我们将为我们的训练设置配置 QLoRA。QLoRA 是一种强大的微调技术，旨在减少内存占用，使得即使在有限的硬件上也能高效地微调大型模型。

QLoRA 在传统的 **LoRA** (Low-Rank Adaptation) 基础上，引入了适配器权重的量化。这一增强显著降低了内存使用量并加快了训练速度，使其成为资源受限环境的理想选择。

>>> from peft import LoraConfig, get_peft_model

>>> # Configure LoRA
>>> peft_config = LoraConfig(
...     r=8,
...     lora_alpha=8,
...     lora_dropout=0.1,
...     target_modules=["down_proj", "o_proj", "k_proj", "q_proj", "gate_proj", "up_proj", "v_proj"],
...     use_dora=True,
...     init_lora_weights="gaussian",
... )

>>> # Apply PEFT model adaptation
>>> peft_model = get_peft_model(model, peft_config)

>>> # Print trainable parameters
>>> peft_model.print_trainable_parameters()

trainable params: 11,269,248 || all params: 2,257,542,128 || trainable%: 0.4992

接下来，我们将使用 `DPOConfig` 配置训练选项。

from trl import DPOConfig

training_args = DPOConfig(
    output_dir="smolvlm-instruct-trl-dpo-rlaif-v",
    bf16=True,
    gradient_checkpointing=True,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=32,
    num_train_epochs=5,
    dataset_num_proc=8,  # tokenization will use 8 processes
    dataloader_num_workers=8,  # data loading will use 8 workers
    logging_steps=10,
    report_to="tensorboard",
    push_to_hub=True,
    save_strategy="steps",
    save_steps=10,
    save_total_limit=1,
    eval_steps=10,  # Steps interval for evaluation
    eval_strategy="steps",
)

我们将使用 TRL 库中的 DPOTrainer 类为**直接偏好优化 (DPO)** 定义训练参数。

**DPO** 使用带标签的偏好数据来引导模型生成符合偏好的响应。TRL 的 DPOTrainer 会在训练前**对数据集进行分词**并将其保存到磁盘。这个过程可能会消耗大量磁盘空间，具体取决于用于训练的数据量。请做好相应规划以避免存储空间不足。

这一步可能需要一些时间，所以请放松并享受这个过程！😄

from trl import DPOTrainer

trainer = DPOTrainer(
    model=model,
    ref_model=None,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    peft_config=peft_config,
    processing_class=processor,
)

开始训练模型！🎉

trainer.train()

让我们保存结果 💾

trainer.save_model(training_args.output_dir)

4. 测试微调后的模型 🔍

在微调完我们的视觉语言模型 (VLM) 后，是时候评估它的性能了！在本节中，我们将使用 HuggingFaceH4/rlaif-v_formatted 数据集中的示例来测试模型。让我们深入了解结果，评估模型与偏好响应的一致性如何！🚀

在开始之前，让我们清理一下 GPU 内存，以确保流畅和最佳的性能。🧹

>>> import gc
>>> import time


>>> def clear_memory():
...     # Delete variables if they exist in the current global scope
...     if "inputs" in globals():
...         del globals()["inputs"]
...     if "model" in globals():
...         del globals()["model"]
...     if "processor" in globals():
...         del globals()["processor"]
...     if "trainer" in globals():
...         del globals()["trainer"]
...     if "peft_model" in globals():
...         del globals()["peft_model"]
...     if "bnb_config" in globals():
...         del globals()["bnb_config"]
...     time.sleep(2)

...     # Garbage collection and clearing CUDA memory
...     gc.collect()
...     time.sleep(2)
...     torch.cuda.empty_cache()
...     torch.cuda.synchronize()
...     time.sleep(2)
...     gc.collect()
...     time.sleep(2)

...     print(f"GPU allocated memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
...     print(f"GPU reserved memory: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")


>>> clear_memory()

GPU allocated memory: 1.64 GB
GPU reserved memory: 2.01 GB

我们将使用与之前相同的流程重新加载基础模型。

model = Idefics3ForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2",
)

processor = AutoProcessor.from_pretrained(model_id)

我们将训练好的适配器附加到预训练模型上。该适配器包含了训练期间进行的微调调整，使基础模型能够在保持其核心参数不变的情况下利用新知识。通过集成适配器，我们在不改变其原始结构的情况下增强了模型的能力。

adapter_path = "sergiopaniego/smolvlm-instruct-trl-dpo-rlaif-v"
model.load_adapter(adapter_path)

让我们在一个未见过的样本上评估模型。

test_dataset[20]

>>> test_dataset[20]["images"][0]

让我们创建一个通用函数，可以用不同的样本调用，以简化测试过程。这个函数将使我们能够高效地评估模型在多个示例上的性能，而无需为每个示例重写代码。通过使用这个可重用的函数，我们可以快速评估模型在各种输入下的表现。

def generate_text_from_sample(model, processor, sample, max_new_tokens=1024, device="cuda"):
    # Prepare the text input by applying the chat template
    text_input = processor.apply_chat_template(sample["prompt"], add_generation_prompt=True)

    image_inputs = []
    image = sample["images"][0]
    if image.mode != "RGB":
        image = image.convert("RGB")
    image_inputs.append([image])

    # Prepare the inputs for the model
    model_inputs = processor(
        text=text_input,
        images=image_inputs,
        return_tensors="pt",
    ).to(
        device
    )  # Move inputs to the specified device

    # Generate text with the model
    generated_ids = model.generate(**model_inputs, max_new_tokens=max_new_tokens)

    # Trim the generated ids to remove the input ids
    trimmed_generated_ids = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(model_inputs.input_ids, generated_ids)]

    # Decode the output text
    output_text = processor.batch_decode(
        trimmed_generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )

    return output_text[0]  # Return the first decoded output text

现在，我们准备好调用函数并评估模型了！🚀

output = generate_text_from_sample(model, processor, test_dataset[20])
output

该模型现在能够根据提供的图像和提示生成响应。对于这样的任务，将您的模型的性能与基准进行比较是很有用的，以了解它改进了多少，以及它与其他选项的对比情况。有关此比较的更多信息和详细信息，请查看这篇文章。

💻 我开发了一个用于测试模型的示例应用程序，您可以在这里找到它。

由于这里我们只用数据集的一个子集进行了一个示例训练，对于 Space，我使用了官方的 Hugging Face DPO 微调模型。您可以轻松地将其与另一个展示预训练模型的 Space 进行比较，该 Space 可在这里找到。

from IPython.display import IFrame

IFrame(src="https://sergiopaniego-smolvlm-trl-dpo-rlaif-v.hf.space", width=1000, height=800)

5. 继续学习之旅 🧑‍🎓️

通过这些资源扩展您对视觉语言模型及相关工具的知识。

Cookbook 中的多模态指南： 发现多模态模型的实用指南，包括检索增强生成 (RAG) 流程和微调。我们已经发布了一篇关于使用 SFT 和 TRL 微调 smol VLM 的指南，它与本指南完美互补——请查阅以获取更多细节。
TRL 社区教程： 探索丰富的教程集，深入了解 TRL 的复杂性及其在实际应用中的使用。

您也可以重新访问使用 Hugging Face 生态系统 (TRL) 微调视觉语言模型 (Qwen2-VL-7B) 中的“继续学习之旅”部分。

这些资源将帮助您深化在多模态学习领域的知识和专业技能。

< > 在 GitHub 上更新

←Smol 多模态 RAG，在 Colab 免费版 GPU 上使用 ColSmolVLM 和 SmolVLM 进行构建使用视觉语言模型从图像或文档中进行结构化生成→