编译

概述

Pytorch 2.0 引入了 torch.compile，这是一个强大的功能，它通过 JIT（即时编译）将 PyTorch 代码编译成优化的内核，从而使 PyTorch 代码运行得更快。torch.compile 的主要特性包括：

性能提升：通过优化计算图，显著加快模型执行速度。
易于使用：只需极少的代码更改即可实现，使其非常易于上手。
兼容性：与现有的 PyTorch 代码和模型无缝协作。

当与 Accelerate 一起使用时，torch.compile 可以平滑地集成到分布式训练工作流中，让您同时受益于分布式执行和编译优化。

编译后代码的首次执行通常需要更长的时间，因为它包含了编译时间，但后续的运行会快得多。为了在不同场景下获得最佳性能，torch.compile 提供了多种模式，如 "default"、"reduce-overhead"（它使用 CUDA 图来进一步减少开销）和 "max-autotune"（它执行广泛的自动调优以找到最适合您模型的内核）。

将 torch.compile 与 Accelerate 结合使用

Accelerate 提供了 TorchDynamoPlugin，可轻松无缝地将 torch.compile 集成到您的训练脚本中。

from accelerate import Accelerator
from accelerate.utils import TorchDynamoPlugin

# Configure the compilation backend
dynamo_plugin = TorchDynamoPlugin(
    backend="inductor",  # Options: "inductor", "aot_eager", "aot_nvfuser", etc.
    mode="default",      # Options: "default", "reduce-overhead", "max-autotune"
    fullgraph=True,
    dynamic=False
)

# Initialize accelerator with the plugin
accelerator = Accelerator(dynamo_plugin=dynamo_plugin)
# This will apply torch.compile to your model
model = accelerator.prepare(model)

它与 Accelerate 的所有其他功能和插件兼容，包括混合精度、分布式训练（DDP、FSDP、Deepspeed）等。

区域编译

区域编译不是尝试编译整个模型（这通常会带来巨大的优化问题空间），而是针对同一类的重复块，并按顺序编译它们以命中编译器的缓存。例如，在 GPT2LMHeadModel 中，重复的块/类是 GPT2Block，可以通过 model.transformer.h[0] 访问。模型的其余部分（例如 model.lm_head）是分开编译的。

这使我们能够加快像 LLM 和 Transformers 等模型的编译开销/冷启动速度。更多详细信息请参阅 https://pytorch.ac.cn/tutorials/recipes/regional_compilation.html。

如何使用区域编译

通过在 TorchDynamoPlugin 配置中设置 use_regional_compilation=True 即可启用它。

# Configure the compilation backend
dynamo_plugin = TorchDynamoPlugin(
    use_regional_compilation=True,
    ... # other parameters
)
# Initialize accelerator with the plugin
accelerator = Accelerator(dynamo_plugin=dynamo_plugin)
# This will apply compile_regions to your model
model = accelerator.prepare(model)

您也可以像使用 torch.compile 一样直接使用 accelerate.utils.compile_regions 实用程序。

区域编译的好处

我们使用 PyTorch 中的 torch.compile 功能，对完全编译和区域编译进行了广泛的基准测试。完整结果可在 accelerate 仓库中找到。我们基准测试的主要发现是：

性能相当：区域编译提供了与完全编译相似的性能加速，特别是对于较大的模型。
更快的编译速度：区域编译显著减少了编译模型所需的时间，使其成为更高效的部署选择。
批次大小的影响：随着批次大小的增加，编译策略之间的性能差异会减小，这表明在这些情况下，编译的开销影响较小。
模型大小的考虑：区域编译的好处在较大的模型中更为明显，因为编译时间的节省可能非常可观。
实际应用：对于实际应用，区域编译是优化训练冷启动时间的实用选择，尤其是在处理大型模型时。

结论

完全编译和区域编译都可以显著加速您的模型。区域编译在编译时间和运行时性能之间提供了一个实际的平衡，尤其适用于训练具有大批量的大型模型。

< > 在 GitHub 上更新