将模型导出到 Inferentia

总结

将 PyTorch 模型导出到 Neuron 模型非常简单，只需执行以下操作：

optimum-cli export neuron \
  --model bert-base-uncased \
  --sequence_length 128 \
  --batch_size 1 \
  bert_neuron/

查看帮助以了解更多选项

optimum-cli export neuron --help

为什么要编译为 Neuron 模型？

AWS 提供了两代 Inferentia 加速器，专为机器学习推理而构建，具有更高的吞吐量、更低的延迟和更低的成本：inf2 (NeuronCore-v2) 和 inf1 (NeuronCore-v1)。

在生产环境中，为了在 Neuron 设备上部署 🤗 Transformers 模型，您需要先编译您的模型并将其导出为序列化格式，然后才能进行推理。通过使用 Neuron Compiler（neuronx-cc 或 neuron-cc ）进行提前 (AOT) 编译，您的模型将被转换为序列化和优化的 TorchScript 模块。

虽然预编译避免了推理过程中的开销，但编译后的 Neuron 模型有一些限制：

编译期间使用的输入形状和数据类型无法更改。
Neuron 模型是针对每个硬件和 SDK 版本专门设计的，这意味着：
- 使用 Neuron 编译的模型无法在非 Neuron 环境中执行。
- 为 inf1 (NeuronCore-v1) 编译的模型与 inf2 (NeuronCore-v2) 不兼容，反之亦然。
- 为某个 SDK 版本编译的模型（通常）与其他 SDK 版本不兼容。

在本指南中，我们将向您展示如何将模型导出为针对 Neuron 设备优化的序列化模型。

🤗 Optimum 通过利用配置对象为 Neuron 导出提供支持。这些配置对象已为多种模型架构准备就绪，并且设计为易于扩展到其他架构。

要查看支持的架构，请转到配置参考页面。

使用 CLI 将模型导出到 Neuron

要将 🤗 Transformers 模型导出到 Neuron，您首先需要安装一些额外的依赖项：

对于 Inf2：

pip install optimum-neuron[neuronx]

对于 Inf1：

pip install optimum-neuron[neuron]

Optimum Neuron 导出可以通过 Optimum 命令行使用：

optimum-cli export neuron --help

usage: optimum-cli export neuron [-h] -m MODEL [--task TASK] [--atol ATOL] [--cache_dir CACHE_DIR] [--trust-remote-code]
                                 [--compiler_workdir COMPILER_WORKDIR] [--disable-validation] [--auto_cast {none,matmul,all}]
                                 [--auto_cast_type {bf16,fp16,tf32}] [--dynamic-batch-size] [--num_cores NUM_CORES] [--unet UNET]
                                 [--output_hidden_states] [--output_attentions] [--batch_size BATCH_SIZE]
                                 [--sequence_length SEQUENCE_LENGTH] [--num_beams NUM_BEAMS] [--num_choices NUM_CHOICES]
                                 [--num_channels NUM_CHANNELS] [--width WIDTH] [--height HEIGHT]
                                 [--num_images_per_prompt NUM_IMAGES_PER_PROMPT] [-O1 | -O2 | -O3]
                                 output

optional arguments:
  -h, --help            show this help message and exit
  -O1                   Enables the core performance optimizations in the compiler, while also minimizing compile time.
  -O2                   [Default] Provides the best balance between model performance and compile time.
  -O3                   May provide additional model execution performance but may incur longer compile times and higher host
                        memory usage during model compilation.

Required arguments:
  -m MODEL, --model MODEL
                        Model ID on huggingface.co or path on disk to load model from.
  output                Path indicating the directory where to store generated Neuronx compiled TorchScript model.

Optional arguments:
  --task TASK           The task to export the model for. If not specified, the task will be auto-inferred based on the model.
                        Available tasks depend on the model, but are among: ['audio-classification', 'audio-frame-
                        classification', 'audio-xvector', 'automatic-speech-recognition', 'conversational', 'depth-estimation',
                        'feature-extraction', 'fill-mask', 'image-classification', 'image-segmentation', 'image-to-image',
                        'image-to-text', 'mask-generation', 'masked-im', 'multiple-choice', 'object-detection', 'question-
                        answering', 'semantic-segmentation', 'text-to-audio', 'text-generation', 'text2text-generation', 'text-
                        classification', 'token-classification', 'zero-shot-image-classification', 'zero-shot-object-detection',
                        'stable-diffusion', 'stable-diffusion-xl'].
  --atol ATOL           If specified, the absolute difference tolerance when validating the model. Otherwise, the default atol
                        for the model will be used.
  --cache_dir CACHE_DIR
                        Path indicating where to store cache.
  --trust-remote-code   Allow to use custom code for the modeling hosted in the model repository. This option should only be set
                        for repositories you trust and in which you have read the code, as it will execute on your local machine
                        arbitrary code present in the model repository.
  --compiler_workdir COMPILER_WORKDIR
                        Path indicating the directory where to store intermediary files generated by Neuronx compiler.
  --disable-validation  Whether to disable the validation of inference on neuron device compared to the outputs of original
                        PyTorch model on CPU.
  --auto_cast {none,matmul,all}
                        Whether to cast operations from FP32 to lower precision to speed up the inference. Can be `"none"`,
                        `"matmul"` or `"all"`.
  --auto_cast_type {bf16,fp16,tf32}
                        The data type to cast FP32 operations to when auto-cast mode is enabled. Can be `"bf16"`, `"fp16"` or
                        `"tf32"`.
  --dynamic-batch-size  Enable dynamic batch size for neuron compiled model. If this option is enabled, the input batch size can
                        be a multiple of the batch size during the compilation, but it comes with a potential tradeoff in terms
                        of latency.
  --num_cores NUM_CORES
                        The number of cores on which the model should be deployed (text-generation only).
  --unet UNET           UNet model ID on huggingface.co or path on disk to load model from. This will replace the unet in the
                        original Stable Diffusion pipeline.
  --output_hidden_states
                        Whether or not for the traced model to return the hidden states of all layers.
  --output_attentions   Whether or not for the traced model to return the attentions tensors of all attention layers.

Input shapes:
  --batch_size BATCH_SIZE
                        Batch size that the Neuronx-cc compiler exported model will be able to take as input.
  --sequence_length SEQUENCE_LENGTH
                        Sequence length that the Neuronx-cc compiler exported model will be able to take as input.
  --num_beams NUM_BEAMS
                        Number of beams for beam search that the Neuronx-cc compiler exported model will be able to take as
                        input.
  --num_choices NUM_CHOICES
                        Only for the multiple-choice task. Num choices that the Neuronx-cc compiler exported model will be able
                        to take as input.
  --num_channels NUM_CHANNELS
                        Image tasks only. Number of channels that the Neuronx-cc compiler exported model will be able to take as
                        input.
  --width WIDTH         Image tasks only. Width that the Neuronx-cc compiler exported model will be able to take as input.
  --height HEIGHT       Image tasks only. Height that the Neuronx-cc compiler exported model will be able to take as input.
  --num_images_per_prompt NUM_IMAGES_PER_PROMPT
                        Stable diffusion only. Number of images per prompt that the Neuronx-cc compiler exported model will be
                        able to take as input.

导出标准（非 LLM）模型

Hugging Face Hub 上大多数模型都可以通过 torch trace 直接导出，然后转换为序列化和优化的 TorchScript 模块。

NEFF：Neuron 可执行文件格式，它是 Neuron 设备上的二进制可执行文件。

导出模型时，必须传递两组导出参数：

compiler_args 是编译器的可选参数，这些参数通常控制编译器如何在推理性能（延迟和吞吐量）和准确性之间进行权衡。
input_shapes 是您需要发送给 Neuron 编译器的强制性静态形状信息。

请键入以下命令以查看所有导出参数：

optimum-cli export neuron -h

导出标准 NLP 模型可以按如下方式完成：

optimum-cli export neuron --model distilbert-base-uncased-distilled-squad \
                          --batch_size 1 --sequence_length 16 \
                          --auto_cast matmul --auto_cast_type fp16 \
                          distilbert_base_uncased_squad_neuron/

这里导出的模型具有 (1, 16) 的静态输入形状，并且编译器参数指定矩阵乘法运算必须使用 float16 精度以获得更快的推理速度。

导出后，您应该看到以下日志，这些日志通过将 Neuron 设备上的模型与 CPU 上的 PyTorch 模型进行比较来验证模型：

Validating Neuron model...
        -[✓] Neuron model output names match reference model (last_hidden_state)
        - Validating Neuron Model output "last_hidden_state":
                -[✓] (1, 16, 32) matches (1, 16, 32)
                -[✓] all values close (atol: 0.0001)
The Neuronx export succeeded and the exported model was saved at: distilbert_base_uncased_squad_neuron/

这将导出由 --model 参数定义的检查点的 Neuron 编译的 TorchScript 模块。

如您所见，任务已自动检测到。这是因为模型在 Hub 上。对于本地模型，需要提供 --task 参数，否则将默认为模型架构，而没有任何特定于任务的头部。

optimum-cli export neuron --model local_path --task question-answering --batch_size 1 --sequence_length 16 --dynamic-batch-size distilbert_base_uncased_squad_neuron/

请注意，为 Hub 上的模型提供 --task 参数将禁用自动任务检测。生成的 model.neuron 文件随后可以加载并在 Neuron 设备上运行。

对于每个模型架构，您可以通过 ~exporters.tasks.TasksManager 找到支持的任务列表。例如，对于 DistilBERT，对于 Neuron 导出，我们有：

>>> from optimum.exporters.tasks import TasksManager
>>> from optimum.exporters.neuron.model_configs import *  # Register neuron specific configs to the TasksManager

>>> distilbert_tasks = list(TasksManager.get_supported_tasks_for_model_type("distilbert", "neuron").keys())
>>> print(distilbert_tasks)
['feature-extraction', 'fill-mask', 'multiple-choice', 'question-answering', 'text-classification', 'token-classification']

然后，您可以将这些任务之一传递给上述 optimum-cli export neuron 命令中的 --task 参数。

导出后，neuron 模型可以直接与 NeuronModelForXXX 类一起用于推理：

>>> from transformers import AutoTokenizer
>>> from optimum.neuron import NeuronModelForSequenceClassification

>>> tokenizer = AutoTokenizer.from_pretrained("./distilbert-base-uncased-finetuned-sst-2-english_neuron/")
>>> model = NeuronModelForSequenceClassification.from_pretrained("./distilbert-base-uncased-finetuned-sst-2-english_neuron/")

>>> inputs = tokenizer("Hamilton is considered to be the best musical of human history.", return_tensors="pt")
>>> logits = model(**inputs).logits
>>> print(model.config.id2label[logits.argmax().item()])
'POSITIVE'

如您所见，无需传递导出期间使用的 neuron 参数，因为它们保存在 config.json 文件中，并且将由 NeuronModelForXXX 类自动恢复。

请注意，输入始终会填充到编译时使用的形状，并且填充会带来计算开销。调整静态形状使其高于您在推理期间将输入到模型中的输入的形状，但不要高太多。

将 Stable Diffusion 导出到 Neuron

使用 Optimum CLI，您可以编译 Stable Diffusion 流水线中的组件，以在推理期间获得 Neuron 设备上的加速。

到目前为止，我们支持导出流水线中的以下组件：

CLIP 文本编码器
U-Net
VAE 编码器
VAE 解码器

“选择这些模块是因为它们代表了流水线中的大部分计算，并且性能基准测试表明，在 Neuron 上运行它们可以带来显著的性能优势。”

此外，请随时调整编译配置，以在您的用例中找到性能与准确性之间的最佳权衡。默认情况下，我们建议将 FP32 矩阵乘法运算转换为 BF16，这在适度牺牲准确性的情况下提供了良好的性能。查看 AWS Neuron 文档中的指南，以更好地了解您的编译选项。

可以使用 CLI 完成 Stable Diffusion 检查点的导出：

optimum-cli export neuron --model stabilityai/stable-diffusion-2-1-base \
  --task stable-diffusion \
  --batch_size 1 \
  --height 512 `# height in pixels of generated image, eg. 512, 768` \
  --width 512 `# width in pixels of generated image, eg. 512, 768` \
  --num_images_per_prompt 4 `# number of images to generate per prompt, defaults to 1` \
  --auto_cast matmul `# cast only matrix multiplication operations` \
  --auto_cast_type bf16 `# cast operations from FP32 to BF16` \
  sd_neuron/

将 Stable Diffusion XL 导出到 Neuron

与 Stable Diffusion 类似，您可以使用 Optimum CLI 来编译 SDXL 流水线中的组件，以便在 Neuron 设备上进行推理。

我们支持导出流水线中的以下组件以提高速度：

文本编码器
第二文本编码器
U-Net（比 Stable Diffusion 流水线中的 UNet 大三倍）
VAE 编码器
VAE 解码器

“Stable Diffusion XL 在 768 到 1024 之间的图像上效果特别好。”

可以使用 CLI 完成 SDXL 检查点的导出：

optimum-cli export neuron --model stabilityai/stable-diffusion-xl-base-1.0 \
  --task stable-diffusion-xl \
  --batch_size 1 \
  --height 1024 `# height in pixels of generated image, eg. 768, 1024` \
  --width 1024 `# width in pixels of generated image, eg. 768, 1024` \
  --num_images_per_prompt 4 `# number of images to generate per prompt, defaults to 1` \
  --auto_cast matmul `# cast only matrix multiplication operations` \
  --auto_cast_type bf16 `# cast operations from FP32 to BF16` \
  sd_neuron/

将 LLM 导出到 Neuron

LLM 模型不使用 Torch 追踪导出，而是直接转换为 Neuron 图，Transformer 检查点权重可以加载到这些图中。

与标准 NLP 模型一样，您需要在导出 LLM 模型时指定静态参数：

batch_size 是模型将接受的输入序列的数量。默认为 1。
sequence_length 是输入序列中令牌的最大数量。默认为 max_position_embeddings（旧模型为 n_positions）。
auto_cast_type 指定对权重进行编码的格式。它可以是 fp32 (float32)、fp16 (float16) 或 bf16 (bfloat16) 之一。默认为 fp32。
num_cores 是实例化模型时使用的 neuron 核心数。每个 neuron 核心有 16 Gb 的内存，这意味着更大的模型需要拆分到多个核心上。默认为 1。

optimum-cli export neuron --model meta-llama/Meta-Llama-3-8B \
  --batch_size 1 \
  --sequence_length 4096 \
  --auto_cast_type fp16 `# cast operations from BF16 to FP16` \
  --num_cores 2 \
  llama3_neuron/

一个重要的限制是 LLM 模型只能在 Neuron 平台上导出，因为它们是为在导出期间适应实际设备而量身定制的。

LLM 模型的导出可能比标准模型花费更长的时间（有时超过一小时）。

如前所述，neuron 模型参数是静态的。这意味着特别是在推理期间：

输入的 batch_size 应低于导出期间使用的 batch_size。
输入序列的 length 应低于导出期间使用的 sequence_length。
令牌的最大数量（输入 + 生成）不能超过导出期间使用的 sequence_length。

导出后，neuron 模型可以使用 NeuronModelForCausalLM 类简单地重新加载。与原始 transformers 模型一样，使用 generate() 而不是 forward() 来生成文本序列。

from transformers import AutoTokenizer
-from transformers import AutoModelForCausalLM
+from optimum.neuron import NeuronModelForCausalLM

# Instantiate and convert to Neuron a PyTorch checkpoint
-model = AutoModelForCausalLM.from_pretrained("gpt2")
+model = NeuronModelForCausalLM.from_pretrained("./gpt2-neuron")

tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token_id = tokenizer.eos_token_id

tokens = tokenizer("I really wish ", return_tensors="pt")
with torch.inference_mode():
    sample_output = model.generate(
        **tokens,
        do_sample=True,
        min_length=128,
        max_length=256,
        temperature=0.7,
    )
    outputs = [tokenizer.decode(tok) for tok in sample_output]
    print(outputs)

生成是高度可配置的。有关详细信息，请参阅 https://huggingface.co/docs/transformers/generation_strategies。

请注意：

对于每个模型架构，都为所有参数提供了默认值，但传递给 generate 方法的值将优先。
生成参数可以存储在 generation_config.json 文件中。当模型目录中存在此类文件时，将对其进行解析以设置默认参数（传递给 generate 方法的值仍然优先）。

通过 NeuronModel 以编程方式导出模型到 Neuron

作为 optimim-cli 的替代方案，您还可以使用 optimum.neuron.NeuronModelForXXX 模型类在您自己的 python 脚本或 notebook 中将模型导出到 Neuron。

这是一个示例：

>>> from optimum.neuron import NeuronModelForSequenceClassification

>>> input_shapes = {"batch_size": 1, "sequence_length": 64}  # mandatory shapes
>>> model = NeuronModelForSequenceClassification.from_pretrained(
...   "distilbert-base-uncased-finetuned-sst-2-english", export=True, **input_shapes
... )

# Save the model
>>> model.save_pretrained("./distilbert-base-uncased-finetuned-sst-2-english_neuron/")

# Push the neuron model to HF Hub
>>> model.push_to_hub(
...     "a_local_path_for_compiled_neuron_model", repository_id="my-neuron-repo", use_auth_token=True
... )

此示例可以针对其他模型类型进行调整，使用与 optimum-cli 相同的导出参数。

使用 NeuronX TGI 导出 neuron 模型

NeuronX TGI 镜像不仅包含 NeuronX 运行时，还包含导出 Neuron 模型所需的所有软件包和工具。

使用以下命令使用 TGI 镜像将模型导出到 Neuron：

docker run --entrypoint optimum-cli \
       -v $(pwd)/data:/data \
       --privileged \
       ghcr.io/huggingface/neuronx-tgi:latest \
       export neuron \
       --model <organization>/<model> \
       --batch_size 1 \
       --sequence_length 4096 \
       --auto_cast_type fp16 \
       --num_cores 2 \
       /data/<neuron_model_path>

导出的模型将保存在 ./data/<neuron_model_path> 下。

AWS Trainium & Inferentia