Intel® PyTorch 扩展

IPEX 针对配备 AVX-512 或更高版本的 CPU 进行了优化，并且在功能上适用于仅配备 AVX2 的 CPU。因此，预计它将为配备 AVX-512 或更高版本的 Intel CPU 世代带来性能优势，而仅配备 AVX2 的 CPU（例如，AMD CPU 或较旧的 Intel CPU）在 IPEX 下可能会获得更好的性能，但不能保证。IPEX 为使用 Float32 和 BFloat16 的 CPU 训练提供性能优化。BFloat16 的使用是以下各节的重点。

低精度数据类型 BFloat16 已在具有 AVX512 指令集的第三代 Xeon® 可扩展处理器（又名 Cooper Lake）上得到原生支持，并将通过 Intel® 高级矩阵扩展（Intel® AMX）指令集在下一代 Intel® Xeon® 可扩展处理器上得到支持，从而进一步提升性能。自 PyTorch-1.10 以来，CPU 后端的自动混合精度已启用。与此同时，Intel® PyTorch 扩展中已大规模启用对 CPU 的 BFloat16 自动混合精度支持和运算符的 BFloat16 优化，并已部分上游到 PyTorch 主分支。用户可以通过 IPEX 自动混合精度获得更好的性能和用户体验。

IPEX 安装：

IPEX 版本与 PyTorch 版本一致，可通过 pip 安装

PyTorch 版本	IPEX 版本
2.0	2.0.0
1.13	1.13.0
1.12	1.12.300
1.11	1.11.200
1.10	1.10.100

pip install intel_extension_for_pytorch==<version_name> -f https://developer.intel.com/ipex-whl-stable-cpu

查看更多 IPEX 安装方法。

CPU 训练优化原理

Accelerate 集成了 IPEX，你只需通过配置启用它即可。

场景 1：加速非分布式 CPU 训练

在你的机器上运行 accelerate config

$ accelerate config
-----------------------------------------------------------------------------------------------------------------------------------------------------------
In which compute environment are you running?
This machine
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Which type of machine are you using?
No distributed training
Do you want to run your training on CPU only (even if a GPU / Apple Silicon device is available)? [yes/NO]:yes
Do you want to use Intel PyTorch Extension (IPEX) to speed up training on CPU? [yes/NO]:yes
Do you wish to optimize your script with torch dynamo?[yes/NO]:NO
Do you want to use DeepSpeed? [yes/NO]: NO
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Do you wish to use FP16 or BF16 (mixed precision)?
bf16

这将生成一个配置文件，在执行以下操作时，该文件将自动用于正确设置默认选项

accelerate launch my_script.py --args_to_my_script

例如，以下是如何在启用 IPEX 的情况下运行 NLP 示例 examples/nlp_example.py（来自仓库根目录）的方法。在 accelerate config 之后生成的 default_config.yaml

compute_environment: LOCAL_MACHINE
distributed_type: 'NO'
downcast_bf16: 'no'
ipex_config:
  ipex: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: true

accelerate launch examples/nlp_example.py

场景 2：加速分布式 CPU 训练，我们使用 Intel oneCCL 进行通信，并结合 Intel® MPI 库在 Intel® 架构上提供灵活、高效、可扩展的集群消息传递。你可以参考此处获取安装指南

在你的机器（node0）上运行 accelerate config

$ accelerate config
-----------------------------------------------------------------------------------------------------------------------------------------------------------
In which compute environment are you running?
This machine
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Which type of machine are you using?
multi-CPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 4
-----------------------------------------------------------------------------------------------------------------------------------------------------------
What is the rank of this machine?
0
What is the IP address of the machine that will host the main process? 36.112.23.24
What is the port you will use to communicate with the main process? 29500
Are all the machines on the same local network? Answer `no` if nodes are on the cloud and/or on different network hosts [YES/no]: yes
Do you want to use Intel PyTorch Extension (IPEX) to speed up training on CPU? [yes/NO]:yes
Do you want accelerate to launch mpirun? [yes/NO]: yes
Please enter the path to the hostfile to use with mpirun [~/hostfile]: ~/hostfile
Enter the number of oneCCL worker threads [1]: 1
Do you wish to optimize your script with torch dynamo?[yes/NO]:NO
How many processes should be used for distributed training? [1]:16
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Do you wish to use FP16 or BF16 (mixed precision)?
bf16

例如，以下是如何在启用 IPEX 的情况下运行 NLP 示例 examples/nlp_example.py（来自仓库根目录）以进行分布式 CPU 训练的方法。

在 accelerate config 之后生成的 default_config.yaml

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_CPU
downcast_bf16: 'no'
ipex_config:
  ipex: true
machine_rank: 0
main_process_ip: 36.112.23.24
main_process_port: 29500
main_training_function: main
mixed_precision: bf16
mpirun_config:
  mpirun_ccl: '1'
  mpirun_hostfile: /home/user/hostfile
num_machines: 4
num_processes: 16
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: true

设置以下 env 并使用 intel MPI 启动训练

在 node0 中，你需要创建一个配置文件，其中包含每个节点的 IP 地址（例如 hostfile），并将该配置文件的路径作为参数传递。如果你选择让 Accelerate 启动 mpirun，请确保你的 hostfile 位置与配置中的路径匹配。

$ cat hostfile
xxx.xxx.xxx.xxx #node0 ip
xxx.xxx.xxx.xxx #node1 ip
xxx.xxx.xxx.xxx #node2 ip
xxx.xxx.xxx.xxx #node3 ip

当 Accelerate 启动 mpirun 时，获取 oneCCL 绑定 setvars.sh 以获取你的 Intel MPI 环境，然后使用 accelerate launch 运行你的脚本。请注意，python 脚本和环境需要存在于用于多 CPU 训练的所有机器上。

oneccl_bindings_for_pytorch_path=$(python -c "from oneccl_bindings_for_pytorch import cwd; print(cwd)")
source $oneccl_bindings_for_pytorch_path/env/setvars.sh

accelerate launch examples/nlp_example.py

否则，如果你选择不让 Accelerate 启动 mpirun，请在 node0 中运行以下命令，并且将在 node0、node1、node2、node3 中启用 16DDP，并使用 BF16 混合精度。使用此方法时，python 脚本、python 环境和 accelerate 配置文件需要存在于用于多 CPU 训练的所有机器上。

oneccl_bindings_for_pytorch_path=$(python -c "from oneccl_bindings_for_pytorch import cwd; print(cwd)")
source $oneccl_bindings_for_pytorch_path/env/setvars.sh
export CCL_WORKER_COUNT=1
export MASTER_ADDR=xxx.xxx.xxx.xxx #node0 ip
export CCL_ATL_TRANSPORT=ofi
mpirun -f hostfile -n 16 -ppn 4 accelerate launch examples/nlp_example.py

Accelerate

Intel® PyTorch 扩展

IPEX 安装：

CPU 训练优化原理

相关资源