使用 Unsloth 和 🤗 TRL 将 LLM 微调速度提升 2 倍

发布于 2024 年 1 月 10 日

在 GitHub 上更新

Daniel (Unsloth)

danielhanchen

访客

因为 LLM 微调耗时过长而抓狂？在这篇文章中，我们将介绍一个由社区开发的轻量级工具，它能让 LLM 微调速度快如闪电！

在深入了解 Unsloth 之前，阅读我们的 QLoRA 博客文章，或者熟悉使用 🤗 PEFT 库进行 LLM 微调可能会有所帮助。

Unsloth - 速度提升 2 倍，内存使用减少 40%，准确度无下降

Unsloth 是一个轻量级库，用于加速 LLM 微调，它与 Hugging Face 生态系统（Hub、transformers、PEFT、TRL）完全兼容。该库由 Unsloth 团队（Daniel 和 Michael）以及开源社区积极开发。该库支持大多数 NVIDIA GPU——从 GTX 1070 到 H100——并且可以与 TRL 库的整个训练器套件（SFTTrainer、DPOTrainer、PPOTrainer）一起使用。撰写本文时，Unsloth 支持 Llama（CodeLlama、Yi 等）和 Mistral 架构。

Unsloth 通过用优化操作覆盖部分建模代码来工作。通过手动推导反向传播步骤并将所有 Pytorch 模块重写为 Triton 内核，Unsloth 可以同时减少内存使用并加快微调速度。至关重要的是，相对于正常的 QLoRA，准确度下降为 0%，因为在优化代码中没有进行任何近似。

基准测试

1 个 A100 40GB	数据集	🤗 Hugging Face	🤗 + Flash Attention 2	🦥 Unsloth	🦥 显存减少
Code Llama 34b	Slim Orca	1 倍	1.01 倍	1.94 倍	-22.7%
Llama-2 7b	Slim Orca	1 倍	0.96 倍	1.87 倍	-39.3%
Mistral 7b	Slim Orca	1 倍	1.17 倍	1.88 倍	-65.9%
Tiny Llama 1.1b	Alpaca	1 倍	1.55 倍	2.74 倍	-57.8%
DPO with Zephyr	Ultra Chat	1 倍	1.24 倍	1.88 倍	-11.6%

免费 Colab T4	数据集	🤗 Hugging Face	🤗 + Pytorch 2.1.1	🦥 Unsloth	🦥 显存减少
Llama-2 7b	OASST	1 倍	1.19 倍	1.95 倍	-43.3%
Mistral 7b	Alpaca	1 倍	1.07 倍	1.56 倍	-13.7%
Tiny Llama 1.1b	Alpaca	1 倍	2.06 倍	3.87 倍	-73.8%
DPO with Zephyr	Ultra Chat	1 倍	1.09倍	1.55 倍	-18.6%

Unsloth 在 Tesla T4 和 A100 Google Colab 实例上使用 4 个数据集进行了 59 次运行的基准测试。QLoRA 应用于所有线性层（注意力层和 MLP 层），秩为 16，并开启了梯度检查点。通过与最新版本的 Transformers (4.36) 进行测试，如果安装了 Pytorch 2.1.1，该版本已原生集成了 SDPA，Unsloth 的速度最高可提升 2.7 倍，内存使用最高可减少 74%。我们还在免费的 Google Colab 实例（低内存，1 个 T4 GPU，Pytorch 2.1.0 CUDA 12.1）上测试了 Unsloth。所有 59 个 Jupyter Notebook 都已提供，以确保完全可复现性，更多详情请参见 Unsloth 的基准测试详情此处

如何使用 Unsloth？

只需使用 FastLanguageModel.from_pretrained 加载您的模型！目前，Unsloth 支持 Llama 和 Mistral 类型的架构（Yi、Deepseek、TinyLlama、Llamafied Qwen）。如果您想要支持其他架构，请在 Github 上提出问题！此外，在最新的 Transformers main 分支中，您现在可以直接加载预量化的 4 位模型！这使模型下载速度快 4 倍，并减少了大约 500MB 的内存碎片，从而允许您适应更大的批次！我们提供了一些预量化模型供您方便使用，包括 unsloth/llama-2-7b-bnb-4bit、unsloth/llama-2-13b-bnb-4bit、unsloth/mistral-7b-bnb-4bit 和 unsloth/codellama-34b-bnb-4bit。

您需要向 from_pretrained 提供预期的最大序列长度。Unsloth 内部执行 RoPE 缩放，因此会自动支持更大的最大序列长度。否则，API 与 transformers 的 from_pretrained 几乎相同，只是 FastLanguageModel.from_pretrained 为了方便也返回模型 tokenizer。

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-bnb-4bit", # Supports Llama, Mistral - replace this!
    max_seq_length = 2048, # Supports RoPE Scaling internally, so choose any!
    load_in_4bit = True,
)

模型加载后，使用 FastLanguageModel.get_peft_model 附加适配器以执行 QLoRA 微调。

# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = True,
)

附加适配器后，您可以直接在 HF 生态系统中的任何类中使用该模型，例如 TRL 中的 SFTTrainer！

Unsloth + TRL 集成

要将 Unsloth 与 TRL 库一起使用，只需将 Unsloth 模型传入 SFTTrainer 或 DPOTrainer！训练后的模型与 Hugging Face 生态系统完全兼容，因此您可以将最终模型推送到 Hub 并开箱即用地使用 transformers 进行推理！

import torch

from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

from unsloth import FastLanguageModel

max_seq_length = 2048 # Supports RoPE Scaling interally, so choose any!
# Get dataset
dataset = load_dataset("imdb", split="train")

# Load Llama model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-bnb-4bit", # Supports Llama, Mistral - replace this!
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True,
)

# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = True,
    random_state = 3407,
    max_seq_length = max_seq_length,
)

trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    tokenizer = tokenizer,
    args = TrainingArguments(
      per_device_train_batch_size = 2,
      gradient_accumulation_steps = 4,
      warmup_steps = 10,
      max_steps = 60,
      fp16 = not torch.cuda.is_bf16_supported(),
      bf16 = torch.cuda.is_bf16_supported(),
      logging_steps = 1,
      output_dir = "outputs",
      optim = "adamw_8bit",
      seed = 3407,
  ),
)
trainer.train()

可复现的 Jupyter Notebook

我们将在下面分享完全可复现的 Jupyter Notebook，供任何想在免费 Google Colab 实例上使用 SFTTrainer 试用 Unsloth 的人使用。

Llama 7b 免费 Tesla T4 colab 示例此处

Mistral 7b 免费 Tesla T4 colab 示例此处

CodeLlama 34b A100 colab 示例此处

Zephyr DPO 复制 T4 colab 示例此处

更多博客文章

🤗 Transformers 中原生支持的量化方案概述

作者： 2023 年 9 月 12 日 • 12

使用 AutoGPTQ 和 transformers 让大型语言模型更轻量

作者： 2023 年 8 月 23 日 • 58

社区

通过拖放到文本输入框、粘贴或点击此处上传图片、音频和视频。

点击或粘贴此处以上传图片

· 注册或登录评论