多 GPU 调试

分布式训练可能很棘手，因为您必须确保在整个系统中使用正确的 CUDA 版本。您可能会遇到 GPU 之间的互通问题，并且您的模型中可能存在下溢或溢出问题。

本指南介绍了如何调试这些问题，尤其是与 DeepSpeed 和 PyTorch 相关的问题。

DeepSpeed CUDA

DeepSpeed 编译 CUDA C++，这可能是构建需要 CUDA 的 PyTorch 扩展时的一个潜在错误来源。这些错误取决于 CUDA 在系统上的安装方式。本节重点介绍使用CUDA 10.2构建的 PyTorch

pip install deepspeed

对于任何其他安装问题，请向 DeepSpeed 团队提交 issue。

不相同的工具包

PyTorch 自带 CUDA 工具包，但要将 DeepSpeed 与 PyTorch 一起使用，您需要安装系统范围内的相同 CUDA 版本。例如，如果您在 Python 环境中安装了 cudatoolkit==10.2 的 PyTorch，那么您还需要在所有地方安装 CUDA 10.2。

确切位置可能因系统而异，但 usr/local/cuda-10.2 是许多 Unix 系统上最常见的位置。当 CUDA 正确设置并添加到您的 PATH 环境变量后，您可以使用以下命令找到安装位置。

which nvcc

多个工具包

您的系统上也可能安装了多个 CUDA 工具包。

/usr/local/cuda-10.2
/usr/local/cuda-11.0

通常，软件包安装程序会将路径设置为最后安装的版本。如果软件包构建失败，因为它找不到正确的 CUDA 版本（尽管它已经安装），那么您需要配置 PATH 和 LD_LIBRARY_PATH 环境变量以指向正确的路径。

首先查看以下环境变量的内容。

echo $PATH
echo $LD_LIBRARY_PATH

PATH 列出可执行文件的位置，LD_LIBRARY_PATH 列出查找共享库的位置。较早的条目优先于较晚的条目，: 用于分隔多个条目。要查找特定的 CUDA 工具包，请将正确的路径插入列表的开头。此命令预先添加而不是覆盖现有值。

# adjust the version and full path if needed
export PATH=/usr/local/cuda-10.2/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:$LD_LIBRARY_PATH

此外，您还应该检查指定的目录是否实际存在。lib64 子目录包含各种 CUDA .so 对象（如 libcudart.so），虽然您的系统不太可能以不同的名称命名它们，但您应该检查实际名称并进行相应的更改。

旧版本

有时，旧版本的 CUDA 可能拒绝使用较新的编译器进行构建。例如，如果您有 gcc-9，但 CUDA 需要 gcc-7。通常，安装最新的 CUDA 工具包可以启用对较新编译器的支持。

您也可以除了当前正在使用的编译器之外，安装旧版本的编译器（或者它可能已经安装但默认情况下未使用，并且构建系统看不到它）。要解决此问题，请创建一个符号链接，使构建系统可以看到旧版本的编译器。

# adjust the path to your system
sudo ln -s /usr/bin/gcc-7  /usr/local/cuda-10.2/bin/gcc
sudo ln -s /usr/bin/g++-7  /usr/local/cuda-10.2/bin/g++

预构建

如果您在安装 DeepSpeed 时仍然遇到问题，或者您在运行时构建 DeepSpeed，请尝试在安装 DeepSpeed 模块之前预构建它们。运行以下命令为 DeepSpeed 进行本地构建。

git clone https://github.com/deepspeedai/DeepSpeed/
cd DeepSpeed
rm -rf build
TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install . \
--global-option="build_ext" --global-option="-j8" --no-cache -v \
--disable-pip-version-check 2>&1 | tee build.log

将 DS_BUILD_AIO=1 参数添加到构建命令以使用 NVMe 卸载。确保您在系统上安装了 libaio-dev 软件包。

接下来，通过编辑 TORCH_CUDA_ARCH_LIST 变量来指定您的 GPU 架构（在此页面上找到 NVIDIA GPU 及其对应架构的完整列表）。要检查与您的架构对应的 PyTorch 版本，请运行以下命令。

python -c "import torch; print(torch.cuda.get_arch_list())"

使用以下命令查找 GPU 的架构。

相同 GPU

特定 GPU

如果您得到 8, 6，则可以设置 TORCH_CUDA_ARCH_LIST="8.6"。对于具有不同架构的多个 GPU，请像 TORCH_CUDA_ARCH_LIST="6.1;8.6" 这样列出它们。

也可以不指定 TORCH_CUDA_ARCH_LIST，构建程序会自动查询构建的 GPU 架构。但是，它可能与目标计算机上的实际 GPU 不匹配，这就是为什么最好显式指定正确的架构。

为了在具有相同设置的多台机器上进行训练，您需要创建一个二进制 wheel，如下所示。

git clone https://github.com/deepspeedai/DeepSpeed/
cd DeepSpeed
rm -rf build
TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 \
python setup.py build_ext -j8 bdist_wheel

此命令生成一个二进制 wheel，它看起来类似于 dist/deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl。在本地或另一台机器上安装此 wheel。

pip install deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl

通信

分布式训练涉及进程和/或节点之间的通信，这可能是错误的潜在来源。

下载下面的脚本以诊断网络问题，然后运行它以测试 GPU 通信。下面的示例命令测试两个 GPU 如何通信。调整 --nproc_per_node 和 --nnodes 参数以使其适应您的系统。

wget https://raw.githubusercontent.com/huggingface/transformers/main/scripts/distributed/torch-distributed-gpu-test.py
python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py

如果两个 GPU 能够通信并分配内存，则该脚本会打印 OK 状态。仔细查看诊断脚本以获取更多详细信息以及在 SLURM 环境中运行它的方法。

添加 NCCL_DEBUG=INFO 环境变量以报告更多与 NCCL 相关的调试信息。

NCCL_DEBUG=INFO python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py

下溢和溢出检测

当激活值或权重为 inf、nan 以及 loss=NaN 时，可能会发生下溢和溢出。这可能表明存在下溢或溢出问题。要检测这些问题，请在 TrainingArguments.debug() 中激活 DebugUnderflowOverflow 模块，或导入该模块并将其添加到您自己的训练循环或其他训练器类中。

Trainer

PyTorch 训练循环

DebugUnderflowOverflow 模块将 hook 插入到模型中，以在每次前向调用后测试输入和输出变量以及相应的模型权重。如果在激活值或权重的至少一个元素中检测到 inf 或 nan，则该模块会打印如下所示的报告。

下面的示例是使用 google/mt5-small 进行 fp16 混合精度训练。

Detected inf/nan during batch_number=0
Last 21 forward frames:
abs min  abs max  metadata
                  encoder.block.1.layer.1.DenseReluDense.dropout Dropout
0.00e+00 2.57e+02 input[0]
0.00e+00 2.85e+02 output
[...]
                  encoder.block.2.layer.0 T5LayerSelfAttention
6.78e-04 3.15e+03 input[0]
2.65e-04 3.42e+03 output[0]
             None output[1]
2.25e-01 1.00e+04 output[2]
                  encoder.block.2.layer.1.layer_norm T5LayerNorm
8.69e-02 4.18e-01 weight
2.65e-04 3.42e+03 input[0]
1.79e-06 4.65e+00 output
                  encoder.block.2.layer.1.DenseReluDense.wi_0 Linear
2.17e-07 4.50e+00 weight
1.79e-06 4.65e+00 input[0]
2.68e-06 3.70e+01 output
                  encoder.block.2.layer.1.DenseReluDense.wi_1 Linear
8.08e-07 2.66e+01 weight
1.79e-06 4.65e+00 input[0]
1.27e-04 2.37e+02 output
                  encoder.block.2.layer.1.DenseReluDense.dropout Dropout
0.00e+00 8.76e+03 input[0]
0.00e+00 9.74e+03 output
                  encoder.block.2.layer.1.DenseReluDense.wo Linear
1.01e-06 6.44e+00 weight
0.00e+00 9.74e+03 input[0]
3.18e-04 6.27e+04 output
                  encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense
1.79e-06 4.65e+00 input[0]
3.18e-04 6.27e+04 output
                  encoder.block.2.layer.1.dropout Dropout
3.18e-04 6.27e+04 input[0]
0.00e+00      inf output

在报告的开头，您可以看到发生错误的批次号。在本例中，它发生在第一个批次。

每个帧描述它正在报告的模块。例如，下面的帧检查了 encoder.block.2.layer.1.layer_norm。这表示编码器第二个块的第一层中的 layer norm。前向调用是对 T5LayerNorm 的调用。

                  encoder.block.2.layer.1.layer_norm T5LayerNorm
8.69e-02 4.18e-01 weight
2.65e-04 3.42e+03 input[0]
1.79e-06 4.65e+00 output

最后一帧报告了 Dropout.forward 函数。它从 DenseReluDense 类内部调用了 dropout 属性。您可以观察到溢出 (inf) 发生在编码器第二个块的第一层中的第一个批次中。绝对最大的输入元素为 6.27e+04。

                  encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense
1.79e-06 4.65e+00 input[0]
3.18e-04 6.27e+04 output
                  encoder.block.2.layer.1.dropout Dropout
3.18e-04 6.27e+04 input[0]
0.00e+00      inf output

T5DenseGatedGeluDense.forward 函数输出激活值的绝对最大值为 6.27e+04，这接近 fp16 的最大限制 6.4e+04。在下一步中，Dropout 会重新归一化权重，在将某些元素归零后，这会将绝对最大值推高到大于 6.4e+04，从而导致溢出。

现在您知道错误发生在哪里，您可以研究 modeling_t5.py 中的建模代码。

class T5DenseGatedGeluDense(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.wi_0 = nn.Linear(config.d_model, config.d_ff, bias=False)
        self.wi_1 = nn.Linear(config.d_model, config.d_ff, bias=False)
        self.wo = nn.Linear(config.d_ff, config.d_model, bias=False)
        self.dropout = nn.Dropout(config.dropout_rate)
        self.gelu_act = ACT2FN["gelu_new"]

    def forward(self, hidden_states):
        hidden_gelu = self.gelu_act(self.wi_0(hidden_states))
        hidden_linear = self.wi_1(hidden_states)
        hidden_states = hidden_gelu * hidden_linear
        hidden_states = self.dropout(hidden_states)
        hidden_states = self.wo(hidden_states)
        return hidden_states

一种解决方案是回到值开始变得太大之前的几个步骤，并切换到 fp32，这样数字在相乘或相加时就不会溢出。另一种可能的解决方案是暂时禁用混合精度训练 (amp)。

import torch

def forward(self, hidden_states):
    if torch.is_autocast_enabled():
        with torch.cuda.amp.autocast(enabled=False):
            return self._forward(hidden_states)
    else:
        return self._forward(hidden_states)

该报告仅返回完整帧的输入和输出，因此您可能还希望分析任何 forward 函数的中间值。在 forward 调用之后添加 detect_overflow 函数以跟踪中间 forwarded_states 中的 inf 或 nan 值。

from debug_utils import detect_overflow

class T5LayerFF(nn.Module):
    [...]

    def forward(self, hidden_states):
        forwarded_states = self.layer_norm(hidden_states)
        detect_overflow(forwarded_states, "after layer_norm")
        forwarded_states = self.DenseReluDense(forwarded_states)
        detect_overflow(forwarded_states, "after DenseReluDense")
        return hidden_states + self.dropout(forwarded_states)

最后，您可以配置 DebugUnderflowOverflow 打印的帧数。

from transformers.debug_utils import DebugUnderflowOverflow

debug_overflow = DebugUnderflowOverflow(model, max_frames_to_save=100)

批次追踪

DebugUnderflowOverflow 能够在禁用下溢和溢出功能的情况下，跟踪每个批次的绝对最小值和最大值。这对于识别模型中发生错误的位置很有用。

下面的示例显示了如何跟踪批次 1 和 3 中的最小值和最大值（批次从零开始索引）。

debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1, 3])

                  *** Starting batch number=1 ***
abs min  abs max  metadata
                  shared Embedding
1.01e-06 7.92e+02 weight
0.00e+00 2.47e+04 input[0]
5.36e-05 7.92e+02 output
[...]
                  decoder.dropout Dropout
1.60e-07 2.27e+01 input[0]
0.00e+00 2.52e+01 output
                  decoder T5Stack
     not a tensor output
                  lm_head Linear
1.01e-06 7.92e+02 weight
0.00e+00 1.11e+00 input[0]
6.06e-02 8.39e+01 output
                   T5ForConditionalGeneration
     not a tensor output

                  *** Starting batch number=3 ***
abs min  abs max  metadata
                  shared Embedding
1.01e-06 7.92e+02 weight
0.00e+00 2.78e+04 input[0]
5.36e-05 7.92e+02 output
[...]

DebugUnderflowOverflow 报告大量帧，这对于调试更容易。一旦您知道问题发生在哪里，例如批次 150，那么您可以集中跟踪批次 149 和 150，并比较数字在哪里开始发散。

也可以在某个批次号后中止跟踪，例如，批次 3。

debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1, 3], abort_after_batch_num=3)

< > 在 GitHub 上更新

Transformers