开源 AI 食谱文档

使用 Hugging Face 生态系统 (TRL) 微调视觉语言模型 (Qwen2-VL-7B)

开源 AI 食谱

加入 Hugging Face 社区

并获得增强的文档体验

在模型、数据集和 Spaces 上进行协作

通过加速推理获得更快的示例

切换文档主题

开始使用

使用 Hugging Face 生态系统 (TRL) 微调视觉语言模型 (Qwen2-VL-7B)

作者: Sergio Paniego

🚨 警告：本 Notebook 资源密集，需要大量计算能力。如果在 Colab 中运行，将使用 A100 GPU。

在本教程中，我们将演示如何使用 Hugging Face 生态系统，特别是 Transformer 强化学习库 (TRL) 来微调视觉语言模型 (VLM)。

🌟 模型与数据集概述

我们将使用 ChartQA 数据集微调 Qwen2-VL-7B 模型。该数据集包含各种图表图像以及问答对，非常适合增强模型的视觉问答能力。

📖 附加资源

如果您对 VLM 的更多应用感兴趣，请查看

多模态检索增强生成 (RAG) 食谱：我将指导您构建一个使用文档检索 (ColPali) 和视觉语言模型 (VLM) 的 RAG 系统。
Phil Schmid 的教程：深入探讨使用 TRL 微调多模态 LLM 的绝佳资源。
Merve Noyan 的 smol-vision 存储库：一系列引人入胜的关于前沿视觉和多模态 AI 主题的 Notebook。

1. 安装依赖项

让我们先安装微调所需的基本库吧！🚀

!pip install  -U -q git+https://github.com/huggingface/transformers.git git+https://github.com/huggingface/trl.git datasets bitsandbytes peft qwen-vl-utils wandb accelerate
# Tested with transformers==4.53.0.dev0, trl==0.20.0.dev0, datasets==3.6.0, bitsandbytes==0.46.0, peft==0.15.2, qwen-vl-utils==0.0.11, wandb==0.20.1, accelerate==1.8.1

我们还需要安装早期版本的 PyTorch，因为最新版本存在一个问题，目前会阻止此 Notebook 正常运行。您可以在此处了解更多信息，并在问题解决后考虑更新到最新版本。

!pip install -q torch==2.4.1+cu121 torchvision==0.19.1+cu121 torchaudio==2.4.1+cu121 --extra-index-url https://download.pytorch.org/whl/cu121

您需要使用您的 Hugging Face 帐户进行身份验证，以便直接从本 Notebook 保存和共享您的模型。

from huggingface_hub import notebook_login

notebook_login()

2. 加载数据集 📁

本节我们将加载 HuggingFaceM4/ChartQA 数据集。该数据集包含图表图像以及相关问题和答案，非常适合视觉问答任务的训练。

接下来，我们将为 VLM 生成一个系统消息。在这种情况下，我们希望创建一个系统，使其能够充当图表图像分析专家，并根据图表提供简洁的问题答案。

system_message = """You are a Vision Language Model specialized in interpreting visual data from chart images.
Your task is to analyze the provided chart image and respond to queries with concise answers, usually a single word, number, or short phrase.
The charts include a variety of types (e.g., line charts, bar charts) and contain colors, labels, and text.
Focus on delivering accurate, succinct answers based on the visual information. Avoid additional explanation unless absolutely necessary."""

我们将数据集格式化为聊天机器人结构以进行交互。每次交互将包括系统消息，后跟图像和用户的查询，最后是查询的答案。

💡有关此模型的更多使用技巧，请查看模型卡片。

def format_data(sample):
    return [
        {
            "role": "system",
            "content": [{"type": "text", "text": system_message}],
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": sample["image"],
                },
                {
                    "type": "text",
                    "text": sample["query"],
                },
            ],
        },
        {
            "role": "assistant",
            "content": [{"type": "text", "text": sample["label"][0]}],
        },
    ]

为了教育目的，我们将只加载数据集中每个分割的 10%。然而，在实际用例中，您通常会加载所有样本。

from datasets import load_dataset

dataset_id = "HuggingFaceM4/ChartQA"
train_dataset, eval_dataset, test_dataset = load_dataset(dataset_id, split=["train[:10%]", "val[:10%]", "test[:10%]"])

让我们看一下数据集的结构。它包括一个图像、一个查询、一个标签（即答案），以及我们将要丢弃的第四个特征。

train_dataset

现在，让我们使用聊天机器人结构格式化数据。这将使我们能够为模型适当地设置交互。

train_dataset = [format_data(sample) for sample in train_dataset]
eval_dataset = [format_data(sample) for sample in eval_dataset]
test_dataset = [format_data(sample) for sample in test_dataset]

train_dataset[200]

3. 加载模型并检查性能！🤔

现在我们已经加载了数据集，让我们首先加载模型并使用数据集中的一个样本评估其性能。我们将使用 Qwen/Qwen2-VL-7B-Instruct，这是一个能够理解视觉数据和文本的视觉语言模型 (VLM)。

如果您正在探索替代方案，请考虑以下开源选项

Meta AI 的 Llama-3.2-11B-Vision
Mistral AI 的 Pixtral-12B
Allen AI 的 Molmo-7B-D-0924

此外，您可以查看排行榜，例如 WildVision Arena 或 OpenVLM Leaderboard，以找到性能最佳的 VLM。

Qwen2_VL architecture

import torch
from transformers import Qwen2VLForConditionalGeneration, Qwen2VLProcessor

model_id = "Qwen/Qwen2-VL-7B-Instruct"

接下来，我们将加载模型和分词器，为推理做准备。

model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

processor = Qwen2VLProcessor.from_pretrained(model_id)

为了评估模型的性能，我们将使用数据集中的一个样本。首先，让我们看一下这个样本的内部结构。

train_dataset[0]

我们将使用不带系统消息的样本来评估 VLM 的原始理解能力。这是我们将使用的输入

train_dataset[0][1:2]

现在，让我们看看与样本对应的图表。您能根据视觉信息回答查询吗？

>>> train_dataset[0][1]["content"][0]["image"]

让我们创建一个方法，该方法将模型、处理器和样本作为输入，以生成模型的答案。这将使我们能够简化推理过程并轻松评估 VLM 的性能。

from qwen_vl_utils import process_vision_info


def generate_text_from_sample(model, processor, sample, max_new_tokens=1024, device="cuda"):
    # Prepare the text input by applying the chat template
    text_input = processor.apply_chat_template(
        sample[1:2], tokenize=False, add_generation_prompt=True  # Use the sample without the system message
    )

    # Process the visual input from the sample
    image_inputs, _ = process_vision_info(sample)

    # Prepare the inputs for the model
    model_inputs = processor(
        text=[text_input],
        images=image_inputs,
        return_tensors="pt",
    ).to(
        device
    )  # Move inputs to the specified device

    # Generate text with the model
    generated_ids = model.generate(**model_inputs, max_new_tokens=max_new_tokens)

    # Trim the generated ids to remove the input ids
    trimmed_generated_ids = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(model_inputs.input_ids, generated_ids)]

    # Decode the output text
    output_text = processor.batch_decode(
        trimmed_generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )

    return output_text[0]  # Return the first decoded output text

# Example of how to call the method with sample:
output = generate_text_from_sample(model, processor, train_dataset[0])
output

虽然模型成功检索了正确的视觉信息，但它难以准确回答问题。这表明微调可能是提高其性能的关键。让我们继续进行微调过程！

移除模型并清理 GPU

在下一节中进行模型训练之前，让我们清除当前变量并清理 GPU 以释放资源。

import gc
import time


def clear_memory():
    # Delete variables if they exist in the current global scope
    if "inputs" in globals():
        del globals()["inputs"]
    if "model" in globals():
        del globals()["model"]
    if "processor" in globals():
        del globals()["processor"]
    if "trainer" in globals():
        del globals()["trainer"]
    if "peft_model" in globals():
        del globals()["peft_model"]
    if "bnb_config" in globals():
        del globals()["bnb_config"]
    time.sleep(2)

    # Garbage collection and clearing CUDA memory
    gc.collect()
    time.sleep(2)
    torch.cuda.empty_cache()
    torch.cuda.synchronize()
    time.sleep(2)
    gc.collect()
    time.sleep(2)

    print(f"GPU allocated memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
    print(f"GPU reserved memory: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")


clear_memory()

4. 使用 TRL 微调模型

4.1 加载量化模型进行训练 ⚙️

接下来，我们将使用 bitsandbytes 加载量化模型。如果您想了解更多关于量化的信息，请查看这篇博客文章或这篇。

from transformers import BitsAndBytesConfig

# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model and tokenizer
model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_id, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=bnb_config
)
processor = Qwen2VLProcessor.from_pretrained(model_id)

4.2 设置 QLoRA 和 SFTConfig 🚀

接下来，我们将为我们的训练设置配置 QLoRA。QLoRA 能够高效地微调大型语言模型，同时与传统方法相比显著减少内存占用。与通过应用低秩近似来减少内存的标准 LoRA 不同，QLoRA 通过量化 LoRA 适配器的权重进一步减少内存。这导致更低的内存需求和更高的训练效率，使其成为优化模型性能而不牺牲质量的绝佳选择。

>>> from peft import LoraConfig, get_peft_model

>>> # Configure LoRA
>>> peft_config = LoraConfig(
...     lora_alpha=16,
...     lora_dropout=0.05,
...     r=8,
...     bias="none",
...     target_modules=["q_proj", "v_proj"],
...     task_type="CAUSAL_LM",
... )

>>> # Apply PEFT model adaptation
>>> peft_model = get_peft_model(model, peft_config)

>>> # Print trainable parameters
>>> peft_model.print_trainable_parameters()

trainable params: 2,523,136 || all params: 8,293,898,752 || trainable%: 0.0304

我们将使用监督微调 (SFT) 来改进模型在当前任务上的性能。为此，我们将使用 TRL 库中的 SFTConfig 类定义训练参数。SFT 允许我们提供标记数据，帮助模型根据接收到的输入学习生成更准确的响应。这种方法确保模型根据我们的特定用例进行定制，从而在理解和响应视觉查询方面实现更好的性能。

from trl import SFTConfig

# Configure training arguments
training_args = SFTConfig(
    output_dir="qwen2-7b-instruct-trl-sft-ChartQA",  # Directory to save the model
    num_train_epochs=3,  # Number of training epochs
    per_device_train_batch_size=4,  # Batch size for training
    per_device_eval_batch_size=4,  # Batch size for evaluation
    gradient_accumulation_steps=8,  # Steps to accumulate gradients
    gradient_checkpointing=True,  # Enable gradient checkpointing for memory efficiency
    # Optimizer and scheduler settings
    optim="adamw_torch_fused",  # Optimizer type
    learning_rate=2e-4,  # Learning rate for training
    lr_scheduler_type="constant",  # Type of learning rate scheduler
    # Logging and evaluation
    logging_steps=10,  # Steps interval for logging
    eval_steps=10,  # Steps interval for evaluation
    eval_strategy="steps",  # Strategy for evaluation
    save_strategy="steps",  # Strategy for saving the model
    save_steps=20,  # Steps interval for saving
    metric_for_best_model="eval_loss",  # Metric to evaluate the best model
    greater_is_better=False,  # Whether higher metric values are better
    load_best_model_at_end=True,  # Load the best model after training
    # Mixed precision and gradient settings
    bf16=True,  # Use bfloat16 precision
    tf32=True,  # Use TensorFloat-32 precision
    max_grad_norm=0.3,  # Maximum norm for gradient clipping
    warmup_ratio=0.03,  # Ratio of total steps for warmup
    # Hub and reporting
    push_to_hub=True,  # Whether to push model to Hugging Face Hub
    report_to="wandb",  # Reporting tool for tracking metrics
    # Gradient checkpointing settings
    gradient_checkpointing_kwargs={"use_reentrant": False},  # Options for gradient checkpointing
    # Dataset configuration
    dataset_text_field="",  # Text field in dataset
    dataset_kwargs={"skip_prepare_dataset": True},  # Additional dataset options
    # max_seq_length=1024  # Maximum sequence length for input
)

training_args.remove_unused_columns = False  # Keep unused columns in dataset

4.3 训练模型 🏃

我们将使用 Weights & Biases (W&B) 记录我们的训练进度。让我们将 Notebook 连接到 W&B，以在训练期间捕获关键信息。

import wandb

wandb.init(
    project="qwen2-7b-instruct-trl-sft-ChartQA",  # change this
    name="qwen2-7b-instruct-trl-sft-ChartQA",  # change this
    config=training_args,
)

我们需要一个 collator 函数来在训练过程中正确检索和批量处理数据。此函数将处理数据集输入的格式化，确保它们正确地结构化以供模型使用。让我们在下面定义 collator 函数。

👉 查看 TRL 官方示例脚本了解更多详情。

# Create a data collator to encode text and image pairs
def collate_fn(examples):
    # Get the texts and images, and apply the chat template
    texts = [
        processor.apply_chat_template(example, tokenize=False) for example in examples
    ]  # Prepare texts for processing
    image_inputs = [process_vision_info(example)[0] for example in examples]  # Process the images to extract inputs

    # Tokenize the texts and process the images
    batch = processor(
        text=texts, images=image_inputs, return_tensors="pt", padding=True
    )  # Encode texts and images into tensors

    # The labels are the input_ids, and we mask the padding tokens in the loss computation
    labels = batch["input_ids"].clone()  # Clone input IDs for labels
    labels[labels == processor.tokenizer.pad_token_id] = -100  # Mask padding tokens in labels

    # Ignore the image token index in the loss computation (model specific)
    if isinstance(processor, Qwen2VLProcessor):  # Check if the processor is Qwen2VLProcessor
        image_tokens = [151652, 151653, 151655]  # Specific image token IDs for Qwen2VLProcessor
    else:
        image_tokens = [processor.tokenizer.convert_tokens_to_ids(processor.image_token)]  # Convert image token to ID

    # Mask image token IDs in the labels
    for image_token_id in image_tokens:
        labels[labels == image_token_id] = -100  # Mask image token IDs in labels

    batch["labels"] = labels  # Add labels to the batch

    return batch  # Return the prepared batch

现在，我们将定义 SFTTrainer，它是 transformers.Trainer 类的包装器，并继承其属性和方法。当提供 PeftConfig 对象时，此类别通过正确初始化 PeftModel 来简化微调过程。通过使用 SFTTrainer，我们可以有效地管理训练工作流程，并确保我们的视觉语言模型获得流畅的微调体验。

from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=collate_fn,
    peft_config=peft_config,
    processing_class=processor.tokenizer,
)

是时候训练模型了！🎉

trainer.train()

让我们保存结果 💾

trainer.save_model(training_args.output_dir)

5. 测试微调模型 🔍

现在我们已经成功微调了我们的视觉语言模型 (VLM)，是时候评估其性能了！在本节中，我们将使用 ChartQA 数据集中的示例来测试模型，以查看它根据图表图像回答问题的效果。让我们深入了解并探索结果！🚀

让我们清理 GPU 内存以确保最佳性能 🧹

clear_memory()

我们将使用与之前相同的流程重新加载基础模型。

model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

processor = Qwen2VLProcessor.from_pretrained(model_id)

我们将把训练好的适配器附加到预训练模型。此适配器包含我们在训练期间进行的微调调整，允许基础模型利用新知识而无需更改其核心参数。通过集成适配器，我们可以增强模型的功能，同时保持其原始结构。

adapter_path = "sergiopaniego/qwen2-7b-instruct-trl-sft-ChartQA"
model.load_adapter(adapter_path)

我们将使用模型最初难以正确回答的数据集中的先前样本。

train_dataset[0][:2]

>>> train_dataset[0][1]["content"][0]["image"]

output = generate_text_from_sample(model, processor, train_dataset[0])
output

由于此样本来自训练集，模型在训练期间已经遇到过它，这可能被视为一种作弊行为。为了更全面地了解模型的性能，我们还将使用一个未见过的样本进行评估。

test_dataset[10][:2]

>>> test_dataset[10][1]["content"][0]["image"]

output = generate_text_from_sample(model, processor, test_dataset[10])
output

模型已成功学会按照数据集中指定的方式响应查询。我们已达成目标！🎉✨

💻 我开发了一个示例应用程序来测试模型，您可以在此处找到它。您可以轻松地将其与另一个展示预训练模型的 Space 进行比较，该 Space 可在此处获取。

from IPython.display import IFrame

IFrame(src="https://sergiopaniego-qwen2-vl-7b-trl-sft-chartqa.hf.space", width=1000, height=800)

6. 微调模型与基础模型 + 提示的比较 📊

我们探讨了微调 VLM 如何成为使其适应我们特定需求的有价值选项。另一个值得考虑的方法是直接使用提示或实现 RAG 系统，这在另一个食谱中有所介绍。

微调 VLM 需要大量数据和计算资源，这可能会产生费用。相比之下，我们可以尝试使用提示来查看是否可以在没有微调开销的情况下实现类似的结果。

让我们再次清理 GPU 内存，以确保最佳性能 🧹

>>> clear_memory()

GPU allocated memory: 0.02 GB
GPU reserved memory: 0.27 GB

🏗️ 首先，我们将按照与之前相同的流程加载基线模型。

model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

processor = Qwen2VLProcessor.from_pretrained(model_id)

📜 在这种情况下，我们将再次使用之前的样本，但这次我们将包含系统消息，如下所示。这种添加有助于为模型提供上下文输入，从而可能提高其响应的准确性。

train_dataset[0][:2]

让我们看看它的表现如何！

text = processor.apply_chat_template(train_dataset[0][:2], tokenize=False, add_generation_prompt=True)

image_inputs, _ = process_vision_info(train_dataset[0])

inputs = processor(
    text=[text],
    images=image_inputs,
    return_tensors="pt",
)

inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]

output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

output_text[0]

💡 正如我们所看到的，模型在使用预训练模型和附加系统消息的情况下，无需任何训练即可生成正确答案。这种方法可以作为微调的可行替代方案，具体取决于特定的用例。

7. 继续学习之旅 🧑‍🎓️

为了进一步增强您对多模态模型的工作理解和技能，请查看以下资源

这些资源将帮助您加深多模态学习的知识和技能。

< > 在 GitHub 上更新

←使用文档检索 (ColPali) 和视觉语言模型 (VLM) 进行多模态检索增强生成 (RAG) 在消费级 GPU 上使用 ColQwen2、Reranker 和量化 VLM 的多模态 RAG→