Hugging Face Trainer 中在线批次大小调整

社区文章 发布于 2025 年 2 月 24 日

🎉 新功能

此存储库对 Hugging Face 的 Trainer 进行了最小扩展,以支持在每个训练步骤中动态更改批次大小。

https://github.com/ifiaposto/hf_trainer_with_online_batch_size

🎯 动机

使用可动态调整的批次大小进行训练已被证明在多个方面有益

  1. 提高训练效率。:经验研究表明,逐渐增加批次大小——结合学习率衰减——可以实现小批次大小的收敛优势,同时利用大批量、多 GPU 批次大小的性能优势 [1]。这种训练方案在 DeepSeek-V2 [2] 等尖端大型语言模型中得到了重新审视。
  2. 支持高级学习算法,用于自适应混合多个训练数据源。在这种情况下,从每个数据流中提取的示例数量可能事先未知,应根据训练指标动态平衡,以优化其对损失函数的影响。
    • 多任务学习中,数据源对应于不同的任务 [3]。
    • 增量学习中,必须平衡知识保留与泛化到新数据的能力,通过有效混合新颖和过去的训练示例来实现 [4,5]。

🛠️ 安装

该存储库基于 Python 3.12.9 构建。您可以使用以下命令安装所需依赖项

pip install -r requirements.txt

以下是精确的 cuda 环境

nvidia-cublas-cu12       12.4.5.8
nvidia-cuda-cupti-cu12   12.4.127
nvidia-cuda-nvrtc-cu12   12.4.127
nvidia-cuda-runtime-cu12 12.4.127
nvidia-cudnn-cu12        9.1.0.70
nvidia-cufft-cu12        11.2.1.3
nvidia-curand-cu12       10.3.5.147
nvidia-cusolver-cu12     11.6.1.9
nvidia-cusparse-cu12     12.3.1.170
nvidia-cusparselt-cu12   0.6.2
nvidia-nccl-cu12         2.21.5
nvidia-nvjitlink-cu12    12.4.127
nvidia-nvtx-cu12         12.4.127

🚀 快速入门

使用训练器

您只需定义一个可调用的批次大小调度器,并将其作为参数传递给训练器

#   < ------- Write your own batch size scheduler ------- >
def custom_batch_size_scheduler(step: int,
                                batch_size: int,
                                interval=5,
                                increment=1):
    """
        step: current optimization step to be provided by the trainer.
        batch_size: current optimization step  to be provided by the trainer.
    """

    if step % interval == 0 and step > 0:
        return batch_size + increment
    return batch_size


# Extended trainer using the adaptive batch sampler.
trainer = AdaptiveBatchSizeTrainer(
    batch_size_scheduler=partial(custom_batch_size_scheduler,
                                 interval=5,
                                 increment=1),
    model=model,
    args=training_args,
    train_dataset=dataset,
    data_collator=collate_fn,
)

AdaptiveBatchSizeTrainer 继承自 TrainerTrainingArguments。您可以在此处找到运行脚本的完整训练参数列表。

该存储库目前仅通过参数 num_train_epochs 控制优化长度。这对应于模型将解析训练数据集的次数。请注意,如果您的批次大小调度器在线运行,优化步骤的数量可能无法事先得知。为了避免覆盖 Trainer_inner_training_loop 并保持一致性,max_steps 应设置为 -1。

理解日志

AdaptiveBatchSizeTrainer 扩展并修改了 Trainer 的训练日志,以适应每个 epoch 不同的长度

  • train_coverage:指当前 epoch 中已见训练示例的百分比。请注意,根据您使用的批次大小调度器,这可能不会线性增加。
  • epoch:现在不根据已执行的优化步骤更新,而是根据截至当前优化步骤总共看到的训练示例百分比更新。这等效于将 train_coverage 跨 epoch 相加。

🧐 演示

以下是使用不同单/多 GPU 训练配置和 dataloader_drop_last(当剩余训练示例不足时忽略最后一个批次)的测试代码的一些示例运行命令。我们将检查最后一个的输出。

单 GPU,忽略最后一个批次。
python -m torch.distributed.run --nproc-per-node=1 --master_port=29619 -m test_replay_dataloader \
    --output_dir ./results \
    --logging_dir ./logs \
    --logging_steps 1 \
    --save_strategy no \
    --num_train_epochs 10 \
    --per_device_train_batch_size 1 \
    --max_steps -1 \
    --dataloader_drop_last
多 GPU,忽略最后一个批次。
python -m torch.distributed.run --nproc-per-node=2 --master_port=29619 -m test_replay_dataloader \
    --output_dir ./results \
    --logging_dir ./logs \
    --logging_steps 1 \
    --save_strategy no \
    --num_train_epochs 10 \
    --per_device_train_batch_size 1 \
    --max_steps -1 \
    --dataloader_drop_last
单 GPU,使用最后一个批次。
python -m torch.distributed.run --nproc-per-node=1 --master_port=29619 -m test_replay_dataloader \
    --output_dir ./results \
    --logging_dir ./logs \
    --logging_steps 1 \
    --save_strategy no \
    --num_train_epochs 10 \
    --per_device_train_batch_size 1 \
    --max_steps -1 \
    --dataloader_drop_last False
    
多 GPU,使用最后一个批次。
python -m torch.distributed.run --nproc-per-node=2 --master_port=29619 -m test_replay_dataloader \
    --output_dir ./results \
    --logging_dir ./logs \
    --logging_steps 1 \
    --save_strategy no \
    --num_train_epochs 10 \
    --per_device_train_batch_size 1 \
    --max_steps -1 \
    --dataloader_drop_last False
    
  • 我们注意到每 5 个步骤(epoch 1.4、3.6 等),本地批次大小会根据调度器增加一个。

  • 由于 dataloader_drop_last= False,当 train_coverage=1.0 时,实际批次大小可能与调度器的批次大小不同。

    多 GPU 输出,包含最后一个批次。
    compute loss Current GPU device: Tesla V100-SXM2-32GB (Device 1) local batch size 1,  indices tensor([1], device='cuda:1')                                                                                  
    compute loss Current GPU device: Tesla V100-SXM2-32GB (Device 0) local batch size 1,  indices tensor([4], device='cuda:0')                                                                                  
    [rank0]:[W224 16:51:33.619610762 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in $
    n extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. 
    Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())                                                     
    [rank1]:[W224 16:51:33.629522783 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in $
    n extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. 
    Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())                                                     
    {'loss': 0.5852, 'grad_norm': 11.941976547241211, 'learning_rate': 4.9500000000000004e-05, 'train coverage': 0.2, 'epoch': 0.2}                                                                             
      1%|█▋                                                                                                                                                                     | 1/100 [00:00<00:52,  1.90it/s$
    compute loss Current GPU device: Tesla V100-SXM2-32GB (Device 0) local batch size 1,  indices tensor([7], device='cuda:0')                                                                                  
    compute loss Current GPU device: Tesla V100-SXM2-32GB (Device 1) local batch size 1,  indices tensor([5], device='cuda:1')                                                                                  
    {'loss': 0.7744, 'grad_norm': 7.989934921264648, 'learning_rate': 4.9e-05, 'train coverage': 0.4, 'epoch': 0.4}                                                                                             
      2%|███▎                                                                                                                                                                   | 2/100 [00:00<00:28,  3.38it/s$
    compute loss Current GPU device: Tesla V100-SXM2-32GB (Device 0) local batch size 1,  indices tensor([3], device='cuda:0')                                                                                  
    compute loss Current GPU device: Tesla V100-SXM2-32GB (Device 1) local batch size 1,  indices tensor([9], device='cuda:1')                                                                                  
    {'loss': 0.6889, 'grad_norm': 3.215988874435425, 'learning_rate': 4.85e-05, 'train coverage': 0.6, 'epoch': 0.6}                                                                                            
      3%|█████                                                                                                                                                                  | 3/100 [00:00<00:23,  4.18it/s$
    compute loss Current GPU device: Tesla V100-SXM2-32GB (Device 0) local batch size 1,  indices tensor([0], device='cuda:0')                                                                                  
    compute loss Current GPU device: Tesla V100-SXM2-32GB (Device 1) local batch size 1,  indices tensor([8], device='cuda:1')                                                                                  
    {'loss': 0.7219, 'grad_norm': 5.315413475036621, 'learning_rate': 4.8e-05, 'train coverage': 0.8, 'epoch': 0.8}                                                                                             
      4%|██████▋                                                                                                                                                                | 4/100 [00:01<00:20,  4.73it/s$
    compute loss Current GPU device: Tesla V100-SXM2-32GB (Device 1) local batch size 1,  indices tensor([2], device='cuda:1')                                                                                  
    compute loss Current GPU device: Tesla V100-SXM2-32GB (Device 0) local batch size 1,  indices tensor([6], device='cuda:0')                                                                                  
    {'loss': 0.7349, 'grad_norm': 9.186591148376465, 'learning_rate': 4.75e-05, 'train coverage': 1.0, 'epoch': 1.0}                                                                                            
      5%|████████▎                                                                                                                                                              | 5/100 [00:01<00:17,  5.46it/s$
    compute loss Current GPU device: Tesla V100-SXM2-32GB (Device 1) local batch size 2,  indices tensor([6, 2], device='cuda:1')                                                                               
    compute loss Current GPU device: Tesla V100-SXM2-32GB (Device 0) local batch size 2,  indices tensor([5, 1], device='cuda:0')                                                                               
    {'loss': 0.6667, 'grad_norm': 7.5692901611328125, 'learning_rate': 4.7e-05, 'train coverage': 0.4, 'epoch': 1.4}                                                                                            
      6%|██████████                                                                                                                                                             | 6/100 [00:01<00:14,  6.41it/s$
    compute loss Current GPU device: Tesla V100-SXM2-32GB (Device 1) local batch size 2,  indices tensor([8, 3], device='cuda:1')                                                                               
    compute loss Current GPU device: Tesla V100-SXM2-32GB (Device 0) local batch size 2,  indices tensor([0, 9], device='cuda:0')                                                                               
    {'loss': 0.6908, 'grad_norm': 1.293290615081787, 'learning_rate': 4.6500000000000005e-05, 'train coverage': 0.8, 'epoch': 1.8}                                                                              
      7%|███████████▋                                                                                                                                                           | 7/100 [00:01<00:14,  6.41it/s$
    compute loss Current GPU device: Tesla V100-SXM2-32GB (Device 0) local batch size 1,  indices tensor([7], device='cuda:0')
    compute loss Current GPU device: Tesla V100-SXM2-32GB (Device 1) local batch size 1,  indices tensor([4], device='cuda:1')
    {'loss': 0.6941, 'grad_norm': 0.7848922610282898, 'learning_rate': 4.600000000000001e-05, 'train coverage': 1.0, 'epoch': 2.0}
      8%|█████████████▎                                                                                                                                                         | 8/100 [00:01<00:11,  8.29it/s$
    compute loss Current GPU device: Tesla V100-SXM2-32GB (Device 0) local batch size 2,  indices tensor([8, 1], device='cuda:0')
    compute loss Current GPU device: Tesla V100-SXM2-32GB (Device 1) local batch size 2,  indices tensor([7, 5], device='cuda:1')
    {'loss': 0.6856, 'grad_norm': 9.592278480529785, 'learning_rate': 4.55e-05, 'train coverage': 0.4, 'epoch': 2.4}
      9%|███████████████                                                                                                                                                        | 9/100 [00:01<00:11,  7.86it/s$
    compute loss Current GPU device: Tesla V100-SXM2-32GB (Device 0) local batch size 2,  indices tensor([6, 0], device='cuda:0')
    compute loss Current GPU device: Tesla V100-SXM2-32GB (Device 1) local batch size 2,  indices tensor([9, 4], device='cuda:1')
    {'loss': 0.6908, 'grad_norm': 0.9208756685256958, 'learning_rate': 4.5e-05, 'train coverage': 0.8, 'epoch': 2.8}
     10%|████████████████▌                                                                                                                                                     | 10/100 [00:01<00:11,  7.94it/s$
    compute loss Current GPU device: Tesla V100-SXM2-32GB (Device 0) local batch size 1,  indices tensor([2], device='cuda:0')
    compute loss Current GPU device: Tesla V100-SXM2-32GB (Device 1) local batch size 1,  indices tensor([3], device='cuda:1')
    {'loss': 0.7796, 'grad_norm': 20.710803985595703, 'learning_rate': 4.4500000000000004e-05, 'train coverage': 1.0, 'epoch': 3.0}
     11%|██████████████████▎                                                                                                                                                   | 11/100 [00:01<00:10,  8.41it/s]
    compute loss Current GPU device: Tesla V100-SXM2-32GB (Device 0) local batch size 3,  indices tensor([6, 3, 8], device='cuda:0')
    compute loss Current GPU device: Tesla V100-SXM2-32GB (Device 1) local batch size 3,  indices tensor([0, 7, 5], device='cuda:1')
    {'loss': 0.5507, 'grad_norm': 9.137839317321777, 'learning_rate': 4.4000000000000006e-05, 'train coverage': 0.6, 'epoch': 3.6}
     12%|███████████████████▉                                                                                                                                                  | 12/100 [00:01<00:10,  8.41it/s]
    compute loss Current GPU device: Tesla V100-SXM2-32GB (Device 1) local batch size 2,  indices tensor([9, 4], device='cuda:1')
    compute loss Current GPU device: Tesla V100-SXM2-32GB (Device 0) local batch size 2,  indices tensor([1, 2], device='cuda:0')
    {'loss': 0.8429, 'grad_norm': 9.582053184509277, 'learning_rate': 4.35e-05, 'train coverage': 1.0, 'epoch': 4.0}
     13%|█████████████████████▌                                                                                                                                                | 13/100 [00:01<00:09,  9.06it/s]
    compute loss Current GPU device: Tesla V100-SXM2-32GB (Device 0) local batch size 3,  indices tensor([0, 9, 7], device='cuda:0')
    compute loss Current GPU device: Tesla V100-SXM2-32GB (Device 1) local batch size 3,  indices tensor([4, 6, 3], device='cuda:1')
    {'loss': 0.7207, 'grad_norm': 5.453138828277588, 'learning_rate': 4.3e-05, 'train coverage': 0.6, 'epoch': 4.6}
    

🤓 解决方案概述

挑战

💡 我更喜欢对 hf Trainer 进行最小侵入性扩展。因此,我依赖回调函数来重置批次大小并确定训练终止,而不是重写现有方法来支持该功能。

💡 该扩展支持按步骤而不是按 epoch 的批次大小调整,以获得更好的灵活性。因此,DataLoader 的长度在 epoch 开始时是未知数。

💡 由于 DataLoader 的长度可变,日志和回调函数应仔细处理和/或重新定义。

💡 当前应禁用批次预取,以确保批次大小同步更新。

主要组件

📌 AdaptiveBatchSampler 继承自 PyTorch 的 BatchSampler。它重写了其迭代器并添加了一个用于更新批次大小的方法。

📌 AdaptiveBatchSizeDataLoader 继承自 PyTorch 的 DataLoader。它包装了 AdaptiveBatchSampler 并添加了一个 set_epoch 方法,用于在每个 epoch 开始时重新启动采样器的种子。

📌 AdaptiveBatchSizeTrainer 继承自 Hugging Face 的 Trainer。它

  • 使用 AdaptiveBatchSizeDataLoader 并处理分布式数据并行训练。
  • 调用批次大小调度器的回调函数。
  • 处理修订/扩展的日志。

未来优化

✖️ 支持 transformers 版本 >= 4.45.2。

✖️ 使其与 accelerate 兼容或支持抢占式预取。目前 accelerate 默认预取一个批次。

📚 参考文献

[1] Devarakonda, A., Naumov, M. and Garland, M., 2017. Adabatch: Adaptive batch sizes for training deep neural networks. arXiv preprint arXiv:1712.02029.

[2] Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., Zhao, C., Dengr, C., Ruan, C., Dai, D., Guo, D. and Yang, D., 2024. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434.

[3] Li, Z., Deng, Y., Zhong, P., Razaviyayn, M. and Mirrokni, V., 2025. PiKE: Adaptive Data Mixing for Multi-Task Learning Under Low Gradient Conflicts. arXiv preprint arXiv:2502.06244.

[4] Chaudhry, A., Rohrbach, M., Elhoseiny, M., Ajanthan, T., Dokania, P.K., Torr, P.H. and Ranzato, M.A., 2019. On tiny episodic memories in continual learning. arXiv preprint arXiv:1902.10486.

[5] Wu, T., Luo, L., Li, Y.F., Pan, S., Vu, T.T. and Haffari, G., 2024. Continual learning for large language models: A survey. arXiv preprint arXiv:2402.01364.

💐 给我买花(或引用)

如果您在研究中使用此存储库,请使用以下 BibTeX 条目引用

@software{adaptive_batch_size_hf_trainer,
title={Online Batch Size Adaptation in Hugging Face Trainer},
author={Apostolopoulou, Ifigeneia},
howpublished = {\url{https://github.com/ifiaposto/hf_trainer_with_online_batch_size}},
year={2025},
}

社区

注册登录 以发表评论