DeepSpeed

DeepSpeed 是一个为速度和规模而设计的库，用于大规模模型的分布式训练，模型参数可达数十亿。其核心是零冗余优化器 (ZeRO)，它跨数据并行进程分片优化器状态 (ZeRO-1)、梯度 (ZeRO-2) 和参数 (ZeRO-3)。这大大减少了内存使用量，使您可以将训练扩展到数十亿参数的模型。为了释放更高的内存效率，ZeRO-Offload 通过在优化期间利用 CPU 资源来减少 GPU 计算和内存。

🤗 Accelerate 支持所有这些功能，您可以将它们与 🤗 PEFT 一起使用。

与 bitsandbytes 量化 + LoRA 的兼容性

下表总结了 PEFT 的 LoRA、bitsandbytes 库和 DeepSpeed Zero 阶段在微调方面的兼容性。DeepSpeed Zero-1 和 2 在推理时不起作用，因为阶段 1 分片优化器状态，阶段 2 分片优化器状态和梯度

DeepSpeed 阶段	是否兼容？
Zero-1	🟢
Zero-2	🟢
Zero-3	🟢

对于 DeepSpeed Stage 3 + QLoRA，请参阅下面的在多个 GPU 上使用 PEFT QLoRA 和 DeepSpeed 与 ZeRO3 微调大型模型部分。

为了确认这些观察结果，我们运行了 Transformers Reinforcement Learning (TRL) 库的 SFT (Supervised Fine-tuning) 官方示例脚本，使用 QLoRA + PEFT 和此处提供的加速配置 here。我们在 2 个 NVIDIA T4 GPU 上运行了这些实验。

在多个设备和多个节点上使用 PEFT 和 DeepSpeed 与 ZeRO3 微调大型模型

本指南的这一部分将帮助您学习如何使用我们的 DeepSpeed 训练脚本来执行 SFT。您将配置该脚本，以便在单台机器上的 8 个 H100 80GB GPU 上使用 LoRA 和 ZeRO-3 对 Llama-70B 模型进行 SFT（监督式微调）。您可以通过更改加速配置将其配置为扩展到多台机器。

配置

首先运行以下命令，使用 🤗 Accelerate 创建 DeepSpeed 配置文件。--config_file 标志允许您将配置文件保存到特定位置，否则它将作为 🤗 Accelerate 缓存中的 default_config.yaml 文件保存。

配置文件用于设置启动训练脚本时的默认选项。

accelerate config --config_file deepspeed_config.yaml

系统会询问您一些关于设置的问题，并配置以下参数。在此示例中，您将使用 ZeRO-3，因此请确保选择这些选项。

`zero_stage`: [0] Disabled, [1] optimizer state partitioning, [2] optimizer+gradient state partitioning and [3] optimizer+gradient+parameter partitioning
`gradient_accumulation_steps`: Number of training steps to accumulate gradients before averaging and applying them. Pass the same value as you would pass via cmd argument else you will encounter mismatch error.
`gradient_clipping`: Enable gradient clipping with value. Don't set this as you will be passing it via cmd arguments.
`offload_optimizer_device`: [none] Disable optimizer offloading, [cpu] offload optimizer to CPU, [nvme] offload optimizer to NVMe SSD. Only applicable with ZeRO >= Stage-2. Set this as `none` as don't want to enable offloading.
`offload_param_device`: [none] Disable parameter offloading, [cpu] offload parameters to CPU, [nvme] offload parameters to NVMe SSD. Only applicable with ZeRO Stage-3. Set this as `none` as don't want to enable offloading.
`zero3_init_flag`: Decides whether to enable `deepspeed.zero.Init` for constructing massive models. Only applicable with ZeRO Stage-3. Set this to `True`.
`zero3_save_16bit_model`: Decides whether to save 16-bit model weights when using ZeRO Stage-3. Set this to `True`.
`mixed_precision`: `no` for FP32 training, `fp16` for FP16 mixed-precision training and `bf16` for BF16 mixed-precision training. Set this to `True`.

完成后，相应的配置应如下所示，您可以在 config 文件夹 deepspeed_config.yaml 中找到它

compute_environment: LOCAL_MACHINE                                                                                                                                           
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 4
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

启动命令

启动命令位于 run_peft_deepspeed.sh，如下所示

accelerate launch --config_file "configs/deepspeed_config.yaml"  train.py \
--seed 100 \
--model_name_or_path "meta-llama/Llama-2-70b-hf" \
--dataset_name "smangrul/ultrachat-10k-chatml" \
--chat_template_format "chatml" \
--add_special_tokens False \
--append_concat_token False \
--splits "train,test" \
--max_seq_len 2048 \
--num_train_epochs 1 \
--logging_steps 5 \
--log_level "info" \
--logging_strategy "steps" \
--eval_strategy "epoch" \
--save_strategy "epoch" \
--push_to_hub \
--hub_private_repo True \
--hub_strategy "every_save" \
--bf16 True \
--packing True \
--learning_rate 1e-4 \
--lr_scheduler_type "cosine" \
--weight_decay 1e-4 \
--warmup_ratio 0.0 \
--max_grad_norm 1.0 \
--output_dir "llama-sft-lora-deepspeed" \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--gradient_accumulation_steps 4 \
--gradient_checkpointing True \
--use_reentrant False \
--dataset_text_field "content" \
--use_flash_attn True \
--use_peft_lora True \
--lora_r 8 \
--lora_alpha 16 \
--lora_dropout 0.1 \
--lora_target_modules "all-linear" \
--use_4bit_quantization False

请注意，我们正在使用 rank=8、alpha=16 的 LoRA，并以所有线性层为目标。我们正在传递 deepspeed 配置文件，并在 ultrachat 数据集的子集上微调 70B Llama 模型。

重要部分

让我们更深入地研究脚本，以便您了解正在发生的事情，并了解其工作原理。

首先要知道的是，由于已传递 DeepSpeed 配置，因此该脚本使用 DeepSpeed 进行分布式训练。SFTTrainer 类处理使用传递的 peft 配置创建 PEFT 模型的所有繁重工作。之后，当您调用 trainer.train() 时，SFTTrainer 在内部使用 🤗 Accelerate 来准备模型、优化器和训练器，使用 DeepSpeed 配置来创建然后训练的 DeepSpeed 引擎。主要代码片段如下

# trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    peft_config=peft_config,
)
trainer.accelerator.print(f"{trainer.model}")

# train
checkpoint = None
if training_args.resume_from_checkpoint is not None:
    checkpoint = training_args.resume_from_checkpoint
trainer.train(resume_from_checkpoint=checkpoint)

# saving final model
trainer.save_model()

内存使用量

在上面的示例中，每个 GPU 消耗的内存为 64 GB (80%)，如下面的屏幕截图所示

训练运行的 GPU 内存使用量

在多个 GPU 上使用 PEFT QLoRA 和 DeepSpeed 与 ZeRO3 微调大型模型

在本节中，我们将研究如何在 2X40GB GPU 上使用 QLoRA 和 DeepSpeed Stage-3 微调 70B llama 模型。为此，我们首先需要 bitsandbytes>=0.43.3、accelerate>=1.0.1、transformers>4.44.2、trl>0.11.4 和 peft>0.13.0。使用 Accelerate 配置时，我们需要将 zero3_init_flag 设置为 true。以下是可在 deepspeed_config_z3_qlora.yaml 中找到的配置

compute_environment: LOCAL_MACHINE                                                                                                                                           
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

下面给出了启动命令，该命令可在 run_peft_qlora_deepspeed_stage3.sh 中找到

accelerate launch --config_file "configs/deepspeed_config_z3_qlora.yaml"  train.py \
--seed 100 \
--model_name_or_path "meta-llama/Llama-2-70b-hf" \
--dataset_name "smangrul/ultrachat-10k-chatml" \
--chat_template_format "chatml" \
--add_special_tokens False \
--append_concat_token False \
--splits "train,test" \
--max_seq_len 2048 \
--num_train_epochs 1 \
--logging_steps 5 \
--log_level "info" \
--logging_strategy "steps" \
--eval_strategy "epoch" \
--save_strategy "epoch" \
--push_to_hub \
--hub_private_repo True \
--hub_strategy "every_save" \
--bf16 True \
--packing True \
--learning_rate 1e-4 \
--lr_scheduler_type "cosine" \
--weight_decay 1e-4 \
--warmup_ratio 0.0 \
--max_grad_norm 1.0 \
--output_dir "llama-sft-qlora-dsz3" \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 2 \
--gradient_checkpointing True \
--use_reentrant True \
--dataset_text_field "content" \
--use_flash_attn True \
--use_peft_lora True \
--lora_r 8 \
--lora_alpha 16 \
--lora_dropout 0.1 \
--lora_target_modules "all-linear" \
--use_4bit_quantization True \
--use_nested_quant True \
--bnb_4bit_compute_dtype "bfloat16" \
--bnb_4bit_quant_storage_dtype "bfloat16"

请注意，传递了新参数 bnb_4bit_quant_storage_dtype，它表示用于打包 4 位参数的数据类型。例如，当设置为 bfloat16 时，量化后，32/4 = 8 个 4 位参数被打包在一起。

在训练代码方面，重要的代码更改是

...

bnb_config = BitsAndBytesConfig(
    load_in_4bit=args.use_4bit_quantization,
    bnb_4bit_quant_type=args.bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=args.use_nested_quant,
+   bnb_4bit_quant_storage=quant_storage_dtype,
)

...

model = AutoModelForCausalLM.from_pretrained(
    args.model_name_or_path,
    quantization_config=bnb_config,
    trust_remote_code=True,
    attn_implementation="flash_attention_2" if args.use_flash_attn else "eager",
+   torch_dtype=quant_storage_dtype or torch.float32,
)

请注意，AutoModelForCausalLM 的 torch_dtype 与 bnb_4bit_quant_storage 数据类型相同。就是这样。其他一切都由 Trainer 和 TRL 处理。

内存使用量

在上面的示例中，每个 GPU 消耗的内存为 36.6 GB。因此，对于 DeepSpeed Stage 3+LoRA 需要 8X80GB GPU，而对于 DDP+QLoRA 需要几个 80GB GPU 的情况，现在只需要 2X40GB GPU。这使得大型模型的微调更易于访问。

在单个 GPU 上使用 PEFT 和 DeepSpeed 与 ZeRO3 和 CPU 卸载微调大型模型

本指南的这一部分将帮助您学习如何使用我们的 DeepSpeed 训练脚本。您将配置该脚本，以使用 ZeRO-3 和 CPU 卸载训练用于条件生成的大型模型。

💡 为了帮助您入门，请查看我们针对因果语言建模和条件生成的示例训练脚本。您可以调整这些脚本以用于您自己的应用程序，或者如果您的任务与脚本中的任务类似，甚至可以直接开箱即用。

配置

配置文件用于设置启动训练脚本时的默认选项。

accelerate config --config_file ds_zero3_cpu.yaml

系统会询问您一些关于设置的问题，并配置以下参数。在此示例中，您将使用 ZeRO-3 以及 CPU 卸载，因此请确保选择这些选项。

`zero_stage`: [0] Disabled, [1] optimizer state partitioning, [2] optimizer+gradient state partitioning and [3] optimizer+gradient+parameter partitioning
`gradient_accumulation_steps`: Number of training steps to accumulate gradients before averaging and applying them.
`gradient_clipping`: Enable gradient clipping with value.
`offload_optimizer_device`: [none] Disable optimizer offloading, [cpu] offload optimizer to CPU, [nvme] offload optimizer to NVMe SSD. Only applicable with ZeRO >= Stage-2.
`offload_param_device`: [none] Disable parameter offloading, [cpu] offload parameters to CPU, [nvme] offload parameters to NVMe SSD. Only applicable with ZeRO Stage-3.
`zero3_init_flag`: Decides whether to enable `deepspeed.zero.Init` for constructing massive models. Only applicable with ZeRO Stage-3.
`zero3_save_16bit_model`: Decides whether to save 16-bit model weights when using ZeRO Stage-3.
`mixed_precision`: `no` for FP32 training, `fp16` for FP16 mixed-precision training and `bf16` for BF16 mixed-precision training.

示例配置文件可能如下所示。最需要注意的是，zero_stage 设置为 3，并且 offload_optimizer_device 和 offload_param_device 设置为 cpu。

compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: 'no'
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
use_cpu: false

重要部分

让我们更深入地研究脚本，以便您了解正在发生的事情，并了解其工作原理。

在 main 函数中，该脚本创建了一个 Accelerator 类，以初始化分布式训练的所有必要要求。

💡 随意更改 main 函数中的模型和数据集。如果您的数据集格式与脚本中的格式不同，您可能还需要编写自己的预处理函数。

该脚本还为您正在使用的 🤗 PEFT 方法创建配置，在本例中为 LoRA。LoraConfig 指定任务类型和重要参数，例如低秩矩阵的维度、矩阵缩放因子和 LoRA 层的 dropout 概率。如果您想使用不同的 🤗 PEFT 方法，请确保将 LoraConfig 替换为相应的类。

 def main():
+    accelerator = Accelerator()
     model_name_or_path = "facebook/bart-large"
     dataset_name = "twitter_complaints"
+    peft_config = LoraConfig(
         task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1
     )

在整个脚本中，您将看到 main_process_first 和 wait_for_everyone 函数，它们有助于控制和同步进程的执行时间。

get_peft_model() 函数接受基础模型和您之前准备的 peft_config，以创建 PeftModel

  model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)
+ model = get_peft_model(model, peft_config)

将所有相关的训练对象传递给 🤗 Accelerate 的 prepare，以确保一切都已准备就绪，可以进行训练

model, train_dataloader, eval_dataloader, test_dataloader, optimizer, lr_scheduler = accelerator.prepare(
    model, train_dataloader, eval_dataloader, test_dataloader, optimizer, lr_scheduler
)

下一段代码检查 DeepSpeed 插件是否在 Accelerator 中使用，如果插件存在，则检查我们是否正在使用 ZeRO-3。当在推理期间调用 generate 函数调用以在模型参数分片时同步 GPU 时，将使用此条件标志

is_ds_zero_3 = False
if getattr(accelerator.state, "deepspeed_plugin", None):
    is_ds_zero_3 = accelerator.state.deepspeed_plugin.zero_stage == 3

在训练循环内部，通常的 loss.backward() 被 🤗 Accelerate 的 backward 替换，后者使用基于您的配置的正确 backward() 方法

  for epoch in range(num_epochs):
      with TorchTracemalloc() as tracemalloc:
          model.train()
          total_loss = 0
          for step, batch in enumerate(tqdm(train_dataloader)):
              outputs = model(**batch)
              loss = outputs.loss
              total_loss += loss.detach().float()
+             accelerator.backward(loss)
              optimizer.step()
              lr_scheduler.step()
              optimizer.zero_grad()

就是这样！脚本的其余部分处理训练循环、评估，甚至将其推送到 Hub 以供您使用。

训练

运行以下命令以启动训练脚本。之前，您将配置文件保存到 ds_zero3_cpu.yaml，因此您需要使用 --config_file 参数将路径传递给启动器，如下所示

accelerate launch --config_file ds_zero3_cpu.yaml examples/peft_lora_seq2seq_accelerate_ds_zero3_offload.py

您将看到一些输出日志，这些日志跟踪训练期间的内存使用情况，一旦完成，脚本将返回准确率并将预测与标签进行比较

GPU Memory before entering the train : 1916
GPU Memory consumed at the end of the train (end-begin): 66
GPU Peak Memory consumed during the train (max-begin): 7488
GPU Total Peak Memory consumed during the train (max): 9404
CPU Memory before entering the train : 19411
CPU Memory consumed at the end of the train (end-begin): 0
CPU Peak Memory consumed during the train (max-begin): 0
CPU Total Peak Memory consumed during the train (max): 19411
epoch=4: train_ppl=tensor(1.0705, device='cuda:0') train_epoch_loss=tensor(0.0681, device='cuda:0')
100%|████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:27<00:00,  3.92s/it]
GPU Memory before entering the eval : 1982
GPU Memory consumed at the end of the eval (end-begin): -66
GPU Peak Memory consumed during the eval (max-begin): 672
GPU Total Peak Memory consumed during the eval (max): 2654
CPU Memory before entering the eval : 19411
CPU Memory consumed at the end of the eval (end-begin): 0
CPU Peak Memory consumed during the eval (max-begin): 0
CPU Total Peak Memory consumed during the eval (max): 19411
accuracy=100.0
eval_preds[:10]=['no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint', 'no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint']
dataset['train'][label_column][:10]=['no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint', 'no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint']

注意事项

目前不支持在使用 PEFT 和 DeepSpeed 时合并，并且会引发错误。
使用 CPU 卸载时，使用 PEFT 将优化器状态和梯度缩小到适配器权重的状态和梯度的主要优势将在 CPU RAM 上实现，并且在 GPU 内存方面不会节省。
与禁用 CPU 卸载相比，将 DeepSpeed Stage 3 和 qlora 与 CPU 卸载结合使用会导致更多的 GPU 内存使用。

< > 在 GitHub 上更新

PEFT

DeepSpeed

与 bitsandbytes 量化 + LoRA 的兼容性

在多个设备和多个节点上使用 PEFT 和 DeepSpeed 与 ZeRO3 微调大型模型

配置

启动命令

重要部分

内存使用量

更多资源

在多个 GPU 上使用 PEFT QLoRA 和 DeepSpeed 与 ZeRO3 微调大型模型

内存使用量

在单个 GPU 上使用 PEFT 和 DeepSpeed 与 ZeRO3 和 CPU 卸载微调大型模型

配置

重要部分

训练

注意事项