在 Intel CPU 上训练

CPU 训练优化的工作原理

Accelerate 完全支持 Intel CPU，你只需通过配置启用它即可。

场景 1：加速非分布式 CPU 训练

在你的机器上运行 accelerate config

$ accelerate config
-----------------------------------------------------------------------------------------------------------------------------------------------------------
In which compute environment are you running?
This machine
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Which type of machine are you using?
No distributed training
Do you want to run your training on CPU only (even if a GPU / Apple Silicon device is available)? [yes/NO]:yes
Do you wish to optimize your script with torch dynamo?[yes/NO]:NO
Do you want to use DeepSpeed? [yes/NO]: NO
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Do you wish to use FP16 or BF16 (mixed precision)?
bf16

这将生成一个配置文件，在执行时会自动使用该文件来正确设置默认选项

accelerate launch my_script.py --args_to_my_script

例如，下面是如何使用由 accelerate config 生成的 default_config.yaml 文件来运行 NLP 示例 examples/nlp_example.py（从仓库根目录运行）

compute_environment: LOCAL_MACHINE
distributed_type: 'NO'
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: true

accelerate launch examples/nlp_example.py

[!CAUTION] accelerator.prepare 目前只能处理同时准备多个模型（且无优化器）或单个模型-优化器对进行训练。其他尝试（例如，两个模型-优化器对）将引发详细错误。要解决此限制，请考虑为每个模型-优化器对分别使用 accelerator.prepare。

场景 2：加速分布式 CPU 训练，我们使用 Intel oneCCL 进行通信，并结合 Intel® MPI 库在 Intel® 架构上实现灵活、高效、可扩展的集群消息传递。你可以参考这里的安装指南

在你的机器（node0）上运行 accelerate config

$ accelerate config
-----------------------------------------------------------------------------------------------------------------------------------------------------------
In which compute environment are you running?
This machine
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Which type of machine are you using?
multi-CPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 4
-----------------------------------------------------------------------------------------------------------------------------------------------------------
What is the rank of this machine?
0
What is the IP address of the machine that will host the main process? 36.112.23.24
What is the port you will use to communicate with the main process? 29500
Are all the machines on the same local network? Answer `no` if nodes are on the cloud and/or on different network hosts [YES/no]: yes
Do you want accelerate to launch mpirun? [yes/NO]: yes
Please enter the path to the hostfile to use with mpirun [~/hostfile]: ~/hostfile
Enter the number of oneCCL worker threads [1]: 1
Do you wish to optimize your script with torch dynamo?[yes/NO]:NO
How many processes should be used for distributed training? [1]:16
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Do you wish to use FP16 or BF16 (mixed precision)?
bf16

例如，下面是如何运行 NLP 示例 examples/nlp_example.py（从仓库根目录运行），并为分布式 CPU 训练启用 IPEX。

default_config.yaml 由 accelerate config 生成

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_CPU
downcast_bf16: 'no'
machine_rank: 0
main_process_ip: 36.112.23.24
main_process_port: 29500
main_training_function: main
mixed_precision: bf16
mpirun_config:
  mpirun_ccl: '1'
  mpirun_hostfile: /home/user/hostfile
num_machines: 4
num_processes: 16
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: true

设置以下环境变量并使用 Intel MPI 启动训练

在 node0 中，你需要创建一个包含每个节点 IP 地址的配置文件（例如 hostfile），并将该配置文件的路径作为参数传递。

如果你选择让 Accelerate 启动 mpirun，请确保你的 hostfile 的位置与配置中的路径匹配。

$ cat hostfile
xxx.xxx.xxx.xxx #node0 ip
xxx.xxx.xxx.xxx #node1 ip
xxx.xxx.xxx.xxx #node2 ip
xxx.xxx.xxx.xxx #node3 ip

在执行 accelerate launch 命令之前，你需要加载 oneCCL 绑定的 setvars.sh 以正确设置你的 Intel MPI 环境。请注意，python 脚本和环境都需要在用于多 CPU 训练的所有机器上可用。

oneccl_bindings_for_pytorch_path=$(python -c "from oneccl_bindings_for_pytorch import cwd; print(cwd)")
source $oneccl_bindings_for_pytorch_path/env/setvars.sh

accelerate launch examples/nlp_example.py

你也可以直接使用 mpirun 命令启动分布式训练，你需要在 node0 中运行以下命令，这样 16DDP 将在 node0、node1、node2、node3 上以 BF16 混合精度启用。使用此方法时，python 脚本、python 环境和 accelerate 配置文件都需要在用于多 CPU 训练的所有机器上可用。

oneccl_bindings_for_pytorch_path=$(python -c "from oneccl_bindings_for_pytorch import cwd; print(cwd)")
source $oneccl_bindings_for_pytorch_path/env/setvars.sh
export CCL_WORKER_COUNT=1
export MASTER_ADDR=xxx.xxx.xxx.xxx #node0 ip
export CCL_ATL_TRANSPORT=ofi
mpirun -f hostfile -n 16 -ppn 4 accelerate launch examples/nlp_example.py

< > 在 GitHub 上更新