使用 TRL 对视觉语言模型进行偏好优化

发布于 2024 年 7 月 10 日

在 GitHub 上更新

Quentin Gallouédec

qgallouedec

黄圣毅 (Shengyi Costa Huang)

训练模型以理解和预测人类偏好可能极其复杂。传统的监督微调方法通常需要为数据分配特定的标签，这在处理细微任务时成本效益不高。偏好优化是一种替代方法，可以简化此过程并产生更准确的结果。通过侧重于比较和排名候选答案而不是分配固定标签，偏好优化允许模型更有效地捕捉人类判断的细微差别。

偏好优化广泛用于微调语言模型，但它也可应用于视觉语言模型（VLM）。我们很高兴地宣布，**TRL 库现在支持 VLM 的直接偏好优化（DPO）**。本文将指导您完成使用 TRL 和 DPO 训练 VLM 的过程。

偏好数据集

偏好优化需要捕获用户偏好的数据。在二元选择设置中，每个示例包含一个提示和两个候选答案：一个被选中，一个被拒绝。模型的任务是学习预测被选中的答案而不是被拒绝的答案。例如，您需要有以下示例：

❔ 问题: 有多少个家庭？

❌ 拒绝: 图像没有提供任何关于家庭的信息。
✅ 选中: 图像显示了一个工会组织表格设置，有 18,000 个家庭。

请注意，被选中的消息不一定正确。例如，被选中的回复“18,000 个家庭”仍然是错误的，但与被拒绝的回复相比，它的错误程度较低。

对于这篇博客文章，我们将使用 openbmb/RLAIF-V-Dataset，它包含超过 83,000 行带注释的数据。让我们仔细看看这个数据集：

>>> from datasets import load_dataset
>>> dataset = load_dataset("openbmb/RLAIF-V-Dataset", split="train[:1%]")
>>> sample = dataset[1]
>>> sample["image"].show()
>>> sample["question"]
'how many families?'
>>> sample["rejected"]
'The image does not provide any information about families.'
>>> sample["chosen"]
'The image shows a Union Organization table setup with 18,000 families.'

我们的模型需要文本和图像作为输入，所以第一步是格式化数据集以符合此要求。数据应结构化为模拟用户和助手之间的对话。用户提供包含图像和问题的提示，而助手则提供答案。以下是此格式化方式：

from datasets import features
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b", do_image_splitting=False)

def format(example):
    # Prepare the input for the chat template
    prompt = [
        {
            "role": "user",
            "content": [{"type": "image"}, {"type": "text", "text": example["question"]}],
        },
    ]
    chosen = [
        {
            "role": "assistant",
            "content": [{"type": "text", "text": example["chosen"]}],
        },
    ]
    rejected = [
        {
            "role": "assistant",
            "content": [{"type": "text", "text": example["rejected"]}],
        },
    ]
    # Apply the chat template
    prompt = processor.apply_chat_template(prompt, tokenize=False)
    chosen = processor.apply_chat_template(chosen, tokenize=False)
    rejected = processor.apply_chat_template(rejected, tokenize=False)
    # Resize the image to ensure it fits within the maximum allowable
    # size of the processor to prevent OOM errors.
    max_size = processor.image_processor.size["longest_edge"]
    example["image"].thumbnail((max_size, max_size))
    return {"images": [example["image"]], "prompt": prompt, "chosen": chosen, "rejected": rejected}

# Apply the formatting function to the dataset,
# remove columns to end up with only "images", "prompt", "chosen", "rejected" columns
dataset = dataset.map(format, remove_columns=dataset.column_names)

# Make sure that the images are decoded, it prevents from storing bytes.
# More info here https://github.com/huggingface/blog/pull/2148#discussion_r1667400478
f = dataset.features
f["images"] = features.Sequence(features.Image(decode=True))  # to avoid bytes
dataset = dataset.cast(f)

我们的数据集现在已格式化。让我们看看第一个示例：

>>> dataset[1]
{'images': [<PIL.JpegImagePlugin.JpegImageFile image mode=L size=980x812 at 0x154505570>],
 'prompt': 'User:<image>how many families?<end_of_utterance>\n',
 'rejected': 'Assistant: The image does not provide any information about families.<end_of_utterance>\n',
 'chosen': 'Assistant: The image shows a Union Organization table setup with 18,000 families.<end_of_utterance>\n'}

预热您的 GPU，数据集已准备好进行训练！

训练

为了示例，我们将训练 Idefics2-8b 模型，但请注意 TRL 中的 DPO 实现支持其他模型，如 Llava 1.5 和 PaliGemma。更多信息请参见微调 Llava 1.5、PaliGemma 和其他模型部分。在查看训练过程之前，我们首先确保所有内容都能顺利适应内存。

我需要多少内存？

我有一个 80GB 显存的 GPU。这足够训练我的 Idefics2-8b 模型吗？以下是粗略估算所需内存的计算步骤。

设 $N$ 为参数数量， $P$ 为精度。以下组件必须同时适配到内存中：

待训练模型: $N \times P$
参考模型：参考模型与待训练模型相同，因此也需要 $N \times P$
梯度：我们训练整个模型，每个参数都需要一个梯度，因此需要 $N \times P$
优化器状态：我们使用 AdamW，它每个参数需要两个状态，因此需要 $2 \times N \times P$

Idefics2-8b 有 80 亿个参数，我们使用 float32 精度，每个浮点数需要 4 字节。因此，所需的总内存为：

组件	计算	内存
待训练模型	$8 \times 10^9 \times 4$	32 GB
参考模型	$8 \times 10^9 \times 4$	32 GB
梯度	$8 \times 10^9 \times 4$	32 GB
优化器状态	$2 \times 8 \times 10^9 \times 4$	64 GB
总计		160 GB

这远远超出了我 GPU 的内存容量。幸运的是，通过应用量化和 LoRA 等技术，我们可以显著减少内存需求，使训练变得可行。让我们看看如何做到这一点。

量化

量化是一种减少模型权重和激活精度的方法。将精度从 float32 切换到 bfloat16 可将每个参数的存储需求减半，从 4 字节变为 2 字节。这种优化可节省内存并加速计算，同时确保高性能，且妥协程度最低。要在模型中实现 bfloat16 精度：

import torch
from transformers import AutoModelForVision2Seq

model = AutoModelForVision2Seq.from_pretrained("HuggingFaceM4/idefics2-8b", torch_dtype=torch.bfloat16)

通过在训练参数中设置 bf16=True，也可以将 bfloat16 精度应用于优化器。

from transformers import TrainingArguments

training_args = TrainingArguments(..., bf16=True)

LoRA

LoRA 是一种通过学习秩分解矩阵对来减少可训练参数数量的方法，同时保持原始权重冻结。这显著降低了适应特定任务的 LLM 的存储需求。LoRA 已集成到 PEFT 中，您可以立即进行设置：

  from transformers import AutoModelForVision2Seq
+ from peft import get_peft_model, LoraConfig

  model = AutoModelForVision2Seq.from_pretrained("HuggingFaceM4/idefics2-8b")
+ peft_config = LoraConfig(target_modules="all-linear")
+ model = get_peft_model(model, peft_config)

PEFT 就像模型周围的包装器（称为适配器）。这个适配器将在内部模型保持冻结的情况下进行训练。LoRA 减少了多少可训练参数？

>>> model.print_trainable_parameters()
trainable params: 55,348,736 || all params: 8,458,116,848 || trainable%: 0.6543860411799315

它将可训练参数的数量从 80 亿减少到 5500 万，这是一个巨大的差距，将显著减少内存需求。

量化和 LoRA 后的新内存要求

现在我们已经减少了内存需求，让我们重新计算所需的内存：

组件	计算	内存
待训练模型	$8 \mathrm{G} \times 2$	16 GB
参考模型	$8 \mathrm{G} \times 2$	16 GB
梯度	$55 \mathrm{M} \times 2$	0.1 GB
优化器状态	$2 \times 55 \mathrm{M} \times 2$	0.2 GB
总计		32.3 GB

这次，我们需要大约 32GB 的内存来微调我们的 Idefics2-8b 模型，这合理得多，并且我的 GPU 可以满足！

有关使用 LoRA 和 QLoRA 优化内存使用的更多信息，请参阅 PEFT 文档或 Google 关于 LLM 的 LoRA 和 QLoRA 建议。

批次大小如何？

我们的内存计算并不精确，因为它没有考虑激活。激活是网络层的中间输出，其内存需求取决于模型结构和批次大小。精确计算激活所需的内存具有挑战性，因此我们将依赖于经验观察。

要选择合适的训练批次大小（per_device_train_batch_size），请从您期望的批次大小（例如 64）开始。这可能会导致内存不足（OOM）错误。如果出现此错误，请将批次大小减半，并将梯度累积步数（gradient_accumulation_steps）加倍，以保持相同的有效批次大小。重复此过程，直到内存适配您的 GPU。在我们的例子中，我们最终的批次大小为 2，梯度累积步数为 32。

另一个优化是使用梯度检查点 (gradient_checkpointing) 来减少激活所需的内存。这种技术通过在反向传播过程中重新计算网络的部分来权衡计算和内存。可以通过在训练参数中设置 gradient_checkpointing=True 来启用它。

总结：完整的训练脚本

现在我们已经设置好模型、数据集和训练参数，我们准备好进行训练了。以下是如何将所有内容组合到一个脚本中，包括一些额外的元素以加快处理速度，如 dataset_num_proc 和 dataloader_num_workers：

# dpo_idefics2-8b.py
from datasets import features, load_dataset
from transformers import AutoModelForVision2Seq, AutoProcessor
import torch
from trl import DPOConfig, DPOTrainer
from peft import LoraConfig


def main():
    # Load the model and processor
    model = AutoModelForVision2Seq.from_pretrained("HuggingFaceM4/idefics2-8b", torch_dtype=torch.bfloat16)
    processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b", do_image_splitting=False)

    # Load the dataset
    dataset = load_dataset("openbmb/RLAIF-V-Dataset", split="train")

    def format(example):
        # Prepare the input for the chat template
        prompt = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": example["question"]}]}]
        chosen = [{"role": "assistant", "content": [{"type": "text", "text": example["chosen"]}]}]
        rejected = [{"role": "assistant", "content": [{"type": "text", "text": example["rejected"]}]}]
        # Apply the chat template
        prompt = processor.apply_chat_template(prompt, tokenize=False)
        chosen = processor.apply_chat_template(chosen, tokenize=False)
        rejected = processor.apply_chat_template(rejected, tokenize=False)
        # Resize the image to ensure it fits within the maximum allowable
        # size of the processor to prevent OOM errors.
        max_size = processor.image_processor.size["longest_edge"] // 2
        example["image"].thumbnail((max_size, max_size))
        return {"images": [example["image"]], "prompt": prompt, "chosen": chosen, "rejected": rejected}

    # Apply the formatting function to the dataset
    dataset = dataset.map(format, remove_columns=dataset.column_names, num_proc=32)

    # Make sure that the images are decoded, it prevents from storing bytes.
    # More info here https://github.com/huggingface/blog/pull/2148#discussion_r1667400478
    f = dataset.features
    f["images"] = features.Sequence(features.Image(decode=True))
    dataset = dataset.cast(f)

    # Train the model
    training_args = DPOConfig(
        output_dir="idefics2-8b-dpo",
        bf16=True,
        gradient_checkpointing=True,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=32,
        num_train_epochs=1,
        dataset_num_proc=32,  # tokenization will use 32 processes
        dataloader_num_workers=32,  # data loading will use 32 workers
        logging_steps=10,
    )
    trainer = DPOTrainer(
        model,
        ref_model=None,  # not needed when using peft
        args=training_args,
        train_dataset=dataset,
        tokenizer=processor,
        peft_config=LoraConfig(target_modules="all-linear"),
    )

    trainer.train()


if __name__ == "__main__":
    main()

让我们运行并等待……🚀

accelerate launch dpo_idefics2-8b.py

结果

几个小时后，训练完成。让我们看看训练曲线：

在 DPO 中，我们关注以下几个指标来评估训练质量：

准确率：此指标表示模型更可能输出所选答案而非被拒绝答案的训练样本百分比。我们可以看到准确率有所提高，这是一个积极的信号。
奖励：奖励与答案被选中的概率相关。更多详情请参阅 DPO 论文第 5 节。我们期望所选答案的奖励高于被拒绝答案的奖励。为了验证这一点，我们查看了奖励裕度，即所选答案和被拒绝答案奖励之间的差值。此处观察到的奖励裕度增加也是一个好兆头。

评估

推理

模型训练完成后，下一步是在一些示例上评估其性能。这将使我们了解模型学习得有多好，以及它预测的有效性。以下是一个脚本，可帮助您评估模型并分析其在一组测试示例上的性能：

from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image

model = AutoModelForVision2Seq.from_pretrained("HuggingFaceM4/idefics2-8b").to("cuda")
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b", do_image_splitting=False)
model.load_adapter("HuggingFaceH4/idefics2-8b-dpo-rlaif-v-v0.3")  # <-- Load the adapter we've just trained

# Process
user_message = ...
image_path = ...
data = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": user_message}]}]
prompts = processor.apply_chat_template(data, add_generation_prompt=True)  # add_generation_prompt=True to end the prompt with "ASSISTANT:"
images = [Image.open(image_path)]
inputs = processor(prompts, images, return_tensors="pt")
inputs = {k: v.to("cuda") for k, v in inputs.items()}

# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
response_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response_text)

如上所述，openbmb/RLAIF-V-Dataset 旨在减少幻觉。但是，微调是否真的减少了幻觉呢？为了找出答案，我们可以使用 AMBER 基准测试，这是一个专门用于评估 VLM 中幻觉的数据集。我们将报告 Idefics2 和 Idefics2+DPO 在判别任务上的结果，并与其他模型进行比较以供参考。

	准确率	F1
GPT-4o	88.8	91.6
Idefics2+DPO	85.9	89.4
Idefics2	85.8	89.1
GPT-4v	83.4	87.4
MiniGemini	82.6	87.6
LLaVA-NeXT	81.4	85.4
QWEN-VL	81.9	86.4
LURE	73.5	77.7
OPERA	75.2	78.3
Less-is-more	72.4	75.8
VCD	71.8	74.9

总体而言，微调后的模型似乎幻觉少了一些。训练似乎很成功！

以下是一些精选示例，以说明模型的性能：

问题	Idefics2	Idefics2+DPO
这张图片里有两艘船吗？	是	否
这张图片里的地面不平坦吗？	否	是
这张图片里有一把铲子吗？	是	否

自己尝试一下，看看模型在您的示例上表现如何！

微调 Llava 1.5、PaliGemma 和其他模型

在撰写本文时，TRL 中的 DPO 实现支持 Idefics2、Llava 1.5 和 PaliGemma，并且正在努力添加对更多模型的支持。微调这些模型最简单的方法是使用 TRL 存储库中提供的示例脚本。例如，要微调 PaliGemma，您可以使用以下命令：

accelerate launch examples/scripts/dpo_visual.py \
    --dataset_name HuggingFaceH4/rlaif-v_formatted \
    --model_name_or_path google/paligemma-3b-pt-224 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 32 \
    --dataset_num_proc 32 \
    --output_dir dpo_paligemma_rlaif-v \
    --bf16 \
    --torch_dtype bfloat16 \
    --gradient_checkpointing \
    --use_peft \
    --lora_target_modules=all-linear

您可以在 smol-vision 项目中找到关于 PaliGemma 微调的详细介绍。

🚀🚀 现在您拥有了使用 DPO 微调您自己的 VLM 所需的一切。与社区分享您的发现、模型和数据集吧！

更多博客文章