实践练习：使用 GRPO 微调模型

既然您已经了解了理论，现在让我们付诸实践！在本练习中，您将使用 GRPO 微调模型。

本练习由 LLM 微调专家 @mlabonne 编写。

安装依赖项

首先，让我们安装本练习的依赖项。

!pip install -qqq datasets==3.2.0 transformers==4.47.1 trl==0.14.0 peft==0.14.0 accelerate==1.2.1 bitsandbytes==0.45.2 wandb==0.19.7 --progress-bar off
!pip install -qqq flash-attn --no-build-isolation --progress-bar off

现在我们将导入必要的库。

import torch
from datasets import load_dataset
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import GRPOConfig, GRPOTrainer

导入并登录 Weights & Biases

Weights & Biases 是一个用于记录和监控实验的工具。我们将使用它来记录我们的微调过程。

import wandb

wandb.login()

您可以不登录 Weights & Biases 完成本练习，但建议您这样做以跟踪您的实验并解释结果。

加载数据集

现在，让我们加载数据集。在本例中，我们将使用 mlabonne/smoltldr 数据集，其中包含一系列短篇故事。

dataset = load_dataset("mlabonne/smoltldr")
print(dataset)

加载模型

现在，让我们加载模型。

在本练习中，我们将使用 SmolLM2-135M 模型。

这是一个小型 1.35 亿参数模型，可在硬件受限的环境中运行。这使得该模型非常适合学习，但它不是最强大的模型。如果您有更强大的硬件，您可以尝试微调更大的模型，如 SmolLM2-1.7B。

model_id = "HuggingFaceTB/SmolLM-135M-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
    attn_implementation="flash_attention_2",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

加载 LoRA

现在，让我们加载 LoRA 配置。我们将利用 LoRA 来减少可训练参数的数量，进而减少微调模型所需的内存占用。

如果您不熟悉 LoRA，可以在第 11 章中阅读更多相关信息。

# Load LoRA
lora_config = LoraConfig(
    task_type="CAUSAL_LM",
    r=16,
    lora_alpha=32,
    target_modules="all-linear",
)
model = get_peft_model(model, lora_config)
print(model.print_trainable_parameters())

Total trainable parameters: 135M

定义奖励函数

如上一节所述，GRPO 可以使用任何奖励函数来改进模型。在本例中，我们将使用一个简单的奖励函数，鼓励模型生成 50 个 tokens 长度的文本。

# Reward function
ideal_length = 50


def reward_len(completions, **kwargs):
    return [-abs(ideal_length - len(completion)) for completion in completions]

定义训练参数

现在，让我们定义训练参数。我们将使用 GRPOConfig 类以典型的 transformers 风格定义训练参数。

如果这是您第一次定义训练参数，您可以查看 TrainingArguments 类以获取更多信息，或查看第 2 章以获取详细介绍。

# Training arguments
training_args = GRPOConfig(
    output_dir="GRPO",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    max_prompt_length=512,
    max_completion_length=96,
    num_generations=8,
    optim="adamw_8bit",
    num_train_epochs=1,
    bf16=True,
    report_to=["wandb"],
    remove_unused_columns=False,
    logging_steps=1,
)

现在，我们可以使用模型、数据集和训练参数初始化训练器并开始训练。

# Trainer
trainer = GRPOTrainer(
    model=model,
    reward_funcs=[reward_len],
    args=training_args,
    train_dataset=dataset["train"],
)

# Train model
wandb.init(project="GRPO")
trainer.train()

在 Google Colab 或通过 Hugging Face Spaces 提供的单个 A10G GPU 上，训练大约需要 1 小时。

在训练期间将模型推送到 Hub

如果我们将 push_to_hub 参数设置为 True，并将 model_id 参数设置为有效的模型名称，则模型将在我们训练时推送到 Hugging Face Hub。如果您想立即开始对模型进行 vibe 测试，这将非常有用！

解释训练结果

GRPOTrainer 记录来自您的奖励函数的奖励、损失以及一系列其他指标。

我们将重点关注来自奖励函数的奖励和损失。

如您所见，随着模型的学习，来自奖励函数的奖励值越来越接近 0。这是一个很好的迹象，表明模型正在学习生成正确长度的文本。

Reward from reward function

您可能会注意到，损失从零开始，然后在训练期间增加，这似乎违反直觉。这种行为在 GRPO 中是预期的，并且与算法的数学公式直接相关。GRPO 中的损失与 KL 散度（相对于原始策略的上限）成正比。随着训练的进行，模型学习生成更符合奖励函数的文本，导致其与初始策略的偏差更大。这种不断增加的偏差反映在不断上升的损失值中，这实际上表明模型正在成功地适应以优化奖励函数。

Loss

保存并发布模型

让我们与社区分享模型！

merged_model = trainer.model.merge_and_unload()
merged_model.push_to_hub(
    "SmolGRPO-135M", private=False, tags=["GRPO", "Reasoning-Course"]
)

生成文本

🎉 您已成功使用 GRPO 微调了模型！现在，让我们使用该模型生成一些文本。

首先，我们将定义一个非常长的文档！

prompt = """
# A long document about the Cat

The cat (Felis catus), also referred to as the domestic cat or house cat, is a small 
domesticated carnivorous mammal. It is the only domesticated species of the family Felidae.
Advances in archaeology and genetics have shown that the domestication of the cat occurred
in the Near East around 7500 BC. It is commonly kept as a pet and farm cat, but also ranges
freely as a feral cat avoiding human contact. It is valued by humans for companionship and
its ability to kill vermin. Its retractable claws are adapted to killing small prey species
such as mice and rats. It has a strong, flexible body, quick reflexes, and sharp teeth,
and its night vision and sense of smell are well developed. It is a social species,
but a solitary hunter and a crepuscular predator. Cat communication includes
vocalizations—including meowing, purring, trilling, hissing, growling, and grunting—as
well as body language. It can hear sounds too faint or too high in frequency for human ears,
such as those made by small mammals. It secretes and perceives pheromones.
"""

messages = [
    {"role": "user", "content": prompt},
]

现在，我们可以使用该模型生成文本了。

# Generate text
from transformers import pipeline

generator = pipeline("text-generation", model="SmolGRPO-135M")

## Or use the model and tokenizer we defined earlier
# generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

generate_kwargs = {
    "max_new_tokens": 256,
    "do_sample": True,
    "temperature": 0.5,
    "min_p": 0.1,
}

generated_text = generator(messages, generate_kwargs=generate_kwargs)

print(generated_text)

结论

在本章中，我们了解了如何使用 GRPO 微调模型。我们还了解了如何解释训练结果并使用该模型生成文本。

< > 在 GitHub 上更新

LLM 课程