Optimum 文档

使用 optimum.exporters.onnx 将模型导出到 ONNX

您正在查看 主分支 版本,需要从源代码安装。如果您想要使用常规的 pip 安装,请查看最新的稳定版本(v1.23.1)。
Hugging Face's logo
加入 Hugging Face 社区

并获得增强型文档体验

开始使用

使用 optimum.exporters.onnx 将模型导出到 ONNX

摘要

将模型导出到 ONNX 非常简单,只需

optimum-cli export onnx --model gpt2 gpt2_onnx/

查看帮助以获取更多选项

optimum-cli export onnx --help

为什么要使用 ONNX?

如果您需要在生产环境中部署 🤗 Transformers 或 🤗 Diffusers 模型,我们建议将其导出为可以加载并在专用运行时和硬件上执行的序列化格式。在本指南中,我们将向您展示如何将这些模型导出到 ONNX(开放神经网络交换)

ONNX 是一种开放标准,它定义了一组通用的算子和一种通用的文件格式,用于在各种框架(包括 PyTorch 和 TensorFlow)中表示深度学习模型。当模型导出到 ONNX 格式时,这些算子用于构建计算图(通常称为中间表示),它表示数据在神经网络中的流动。

通过公开具有标准化算子和数据类型的图,ONNX 使在框架之间切换变得容易。例如,可以在 PyTorch 中训练的模型可以导出到 ONNX 格式,然后导入到 TensorRT 或 OpenVINO 中。

导出后,可以通过图优化和量化等技术优化模型以进行推理。查看optimum.onnxruntime子包以优化和运行 ONNX 模型!

🤗 Optimum 通过利用配置对象提供对 ONNX 导出的支持。这些配置对象为许多模型架构做好了准备,并且旨在轻松扩展到其他架构。

要检查支持的架构,请访问配置参考页面

使用 CLI 将模型导出到 ONNX

要将 🤗 Transformers 或 🤗 Diffusers 模型导出到 ONNX,您首先需要安装一些额外的依赖项

pip install optimum[exporters]

Optimum ONNX 导出可以通过 Optimum 命令行使用

optimum-cli export onnx --help

usage: optimum-cli <command> [<args>] export onnx [-h] -m MODEL [--task TASK] [--monolith] [--device DEVICE] [--opset OPSET] [--atol ATOL]
                                                  [--framework {pt,tf}] [--pad_token_id PAD_TOKEN_ID] [--cache_dir CACHE_DIR] [--trust-remote-code]
                                                  [--no-post-process] [--optimize {O1,O2,O3,O4}] [--batch_size BATCH_SIZE]
                                                  [--sequence_length SEQUENCE_LENGTH] [--num_choices NUM_CHOICES] [--width WIDTH] [--height HEIGHT]
                                                  [--num_channels NUM_CHANNELS] [--feature_size FEATURE_SIZE] [--nb_max_frames NB_MAX_FRAMES]
                                                  [--audio_sequence_length AUDIO_SEQUENCE_LENGTH]
                                                  output

optional arguments:
  -h, --help            show this help message and exit

Required arguments:
  -m MODEL, --model MODEL
                        Model ID on huggingface.co or path on disk to load model from.
  output                Path indicating the directory where to store generated ONNX model.

Optional arguments:
  --task TASK           The task to export the model for. If not specified, the task will be auto-inferred based on the model. Available tasks depend on the model, but are among: ['default', 'fill-mask', 'text-generation', 'text2text-generation', 'text-classification', 'token-classification', 'multiple-choice', 'object-detection', 'question-answering', 'image-classification', 'image-segmentation', 'masked-im', 'semantic-segmentation', 'automatic-speech-recognition', 'audio-classification', 'audio-frame-classification', 'automatic-speech-recognition', 'audio-xvector', 'image-to-text', 'zero-shot-object-detection', 'image-to-image', 'inpainting', 'text-to-image']. For decoder models, use `xxx-with-past` to export the model using past key values in the decoder.
  --monolith            Force to export the model as a single ONNX file. By default, the ONNX exporter may break the model in several ONNX files, for example for encoder-decoder models where the encoder should be run only once while the decoder is looped over.
  --device DEVICE       The device to use to do the export. Defaults to "cpu".
  --opset OPSET         If specified, ONNX opset version to export the model with. Otherwise, the default opset will be used.
  --atol ATOL           If specified, the absolute difference tolerance when validating the model. Otherwise, the default atol for the model will be used.
  --framework {pt,tf}   The framework to use for the ONNX export. If not provided, will attempt to use the local checkpoint's original framework or what is available in the environment.
  --pad_token_id PAD_TOKEN_ID
                        This is needed by some models, for some tasks. If not provided, will attempt to use the tokenizer to guess it.
  --cache_dir CACHE_DIR
                        Path indicating where to store cache.
  --trust-remote-code   Allows to use custom code for the modeling hosted in the model repository. This option should only be set for repositories you trust and in which you have read the code, as it will execute on your local machine arbitrary code present in the model repository.
  --no-post-process     Allows to disable any post-processing done by default on the exported ONNX models. For example, the merging of decoder and decoder-with-past models into a single ONNX model file to reduce memory usage.
  --optimize {O1,O2,O3,O4}
                        Allows to run ONNX Runtime optimizations directly during the export. Some of these optimizations are specific to ONNX Runtime, and the resulting ONNX will not be usable with other runtime as OpenVINO or TensorRT. Possible options:
                            - O1: Basic general optimizations
                            - O2: Basic and extended general optimizations, transformers-specific fusions
                            - O3: Same as O2 with GELU approximation
                            - O4: Same as O3 with mixed precision (fp16, GPU-only, requires `--device cuda`)

导出检查点可以按如下方式完成

optimum-cli export onnx --model distilbert-base-uncased-distilled-squad distilbert_base_uncased_squad_onnx/

您应该会看到以下日志(以及此处出于清晰性而隐藏的 PyTorch/TensorFlow 的潜在日志)

Automatic task detection to question-answering.
Framework not specified. Using pt to export the model.
Using framework PyTorch: 1.12.1

Validating ONNX model...
        -[✓] ONNX model output names match reference model (start_logits, end_logits)
        - Validating ONNX Model output "start_logits":
                -[✓] (2, 16) matches (2, 16)
                -[✓] all values close (atol: 0.0001)
        - Validating ONNX Model output "end_logits":
                -[✓] (2, 16) matches (2, 16)
                -[✓] all values close (atol: 0.0001)
All good, model saved at: distilbert_base_uncased_squad_onnx/model.onnx

这会导出由--model参数定义的检查点的 ONNX 图。如您所见,任务已自动检测到。这是因为模型位于 Hub 上。

对于本地模型,需要提供--task参数,否则它将默认为没有任何特定任务头的模型架构

optimum-cli export onnx --model local_path --task question-answering distilbert_base_uncased_squad_onnx/

请注意,为 Hub 上的模型提供--task参数将禁用自动任务检测。

然后可以在支持 ONNX 标准的许多加速器之一上运行生成的model.onnx文件。例如,我们可以使用optimum.onnxruntime包按如下方式使用ONNX 运行时加载和运行模型

>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForQuestionAnswering

>>> tokenizer = AutoTokenizer.from_pretrained("distilbert_base_uncased_squad_onnx")
>>> model = ORTModelForQuestionAnswering.from_pretrained("distilbert_base_uncased_squad_onnx")
>>> inputs = tokenizer("What am I using?", "Using DistilBERT with ONNX Runtime!", return_tensors="pt")
>>> outputs = model(**inputs)

打印输出将给出

QuestionAnsweringModelOutput(loss=None, start_logits=tensor([[-4.7652, -1.0452, -7.0409, -4.6864, -4.0277, -6.2021, -4.9473,  2.6287,
          7.6111, -1.2488, -2.0551, -0.9350,  4.9758, -0.7707,  2.1493, -2.0703,
         -4.3232, -4.9472]]), end_logits=tensor([[ 0.4382, -1.6502, -6.3654, -6.0661, -4.1482, -3.5779, -0.0774, -3.6168,
         -1.8750, -2.8910,  6.2582,  0.5425, -3.7699,  3.8232, -1.5073,  6.2311,
          3.3604, -0.0772]]), hidden_states=None, attentions=None)

如您所见,将模型转换为 ONNX 并不意味着离开 Hugging Face 生态系统。您最终会得到与常规 🤗 Transformers 模型类似的 API!

也可以通过执行以下操作,直接从ORTModelForQuestionAnswering类将模型导出到 ONNX

>>> model = ORTModelForQuestionAnswering.from_pretrained("distilbert-base-uncased-distilled-squad", export=True)

有关更多信息,请查看optimum.onnxruntime文档关于此主题的页面

对于 Hub 上的 TensorFlow 检查点,该过程相同。例如,我们可以按如下方式导出来自Keras 组织的纯 TensorFlow 检查点

optimum-cli export onnx --model keras-io/transformers-qa distilbert_base_cased_squad_onnx/

导出用于 Optimum 的 ORTModel 的模型

通过optimum-cli export onnx导出的模型可以直接在ORTModel中使用。这对于编码器-解码器模型特别有用,在这种情况下,导出会将编码器和解码器拆分为两个.onnx文件,因为编码器通常只运行一次,而解码器在自动生成任务中可能会运行多次。

使用解码器中的过去键/值导出模型

在导出用于生成的解码器模型时,将过去键和值的重用封装到导出的 ONNX 中可能很有用。这可以避免在生成过程中重新计算相同的中间激活。

在 ONNX 导出中,默认情况下会重用过去的键/值。此行为对应于--task text2text-generation-with-past--task text-generation-with-past--task automatic-speech-recognition-with-past。如果您出于任何目的希望禁用使用过去键/值重用的导出,则需要显式地将任务text2text-generationtext-generationautomatic-speech-recognition传递给optimum-cli export onnx

使用过去键/值导出的模型可以直接在 Optimum 的ORTModel中重用

optimum-cli export onnx --model gpt2 gpt2_onnx/

以及

>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForCausalLM

>>> tokenizer = AutoTokenizer.from_pretrained("./gpt2_onnx/")
>>> model = ORTModelForCausalLM.from_pretrained("./gpt2_onnx/")
>>> inputs = tokenizer("My name is Arthur and I live in", return_tensors="pt")
>>> gen_tokens = model.generate(**inputs)
>>> print(tokenizer.batch_decode(gen_tokens))
# prints ['My name is Arthur and I live in the United States of America. I am a member of the']

选择任务

在大多数情况下,从 Hugging Face Hub 上的模型导出时,不需要指定--task

但是,如果您需要检查给定模型架构支持哪些任务,我们已为您准备就绪。首先,您可以查看此处PyTorch 和 TensorFlow 支持的任务列表

对于每个模型架构,您可以通过TasksManager找到支持的任务列表。例如,对于 DistilBERT,对于 ONNX 导出,我们有

>>> from optimum.exporters.tasks import TasksManager

>>> distilbert_tasks = list(TasksManager.get_supported_tasks_for_model_type("distilbert", "onnx").keys())
>>> print(distilbert_tasks)
['default', 'fill-mask', 'text-classification', 'multiple-choice', 'token-classification', 'question-answering']

然后,您可以将其中一项任务传递给optimum-cli export onnx命令中的--task参数,如上所述。

Transformers 模型的自定义导出

自定义官方 Transformers 模型的导出

Optimum 允许高级用户更细粒度地控制 ONNX 导出的配置。如果您想使用不同的关键字参数导出模型,例如使用output_attentions=Trueoutput_hidden_states=True,这尤其有用。

为了支持这些用例,~exporters.main_export支持两个参数:model_kwargscustom_onnx_configs,它们以如下方式使用

  • model_kwargs允许覆盖模型forward的一些默认参数,实际上是作为model(**reference_model_inputs, **model_kwargs)
  • custom_onnx_configs应该是Dict[str, OnnxConfig],将子模型名称(通常为modelencoder_modeldecoder_modeldecoder_model_with_past - 参考)映射到给定子模型的自定义 ONNX 配置。

下面给出了一个完整的示例,允许导出具有output_attentions=True的模型。

from optimum.exporters.onnx import main_export
from optimum.exporters.onnx.model_configs import WhisperOnnxConfig
from transformers import AutoConfig

from optimum.exporters.onnx.base import ConfigBehavior
from typing import Dict

class CustomWhisperOnnxConfig(WhisperOnnxConfig):
    @property
    def outputs(self) -> Dict[str, Dict[int, str]]:
        common_outputs = super().outputs

        if self._behavior is ConfigBehavior.ENCODER:
            for i in range(self._config.encoder_layers):
                common_outputs[f"encoder_attentions.{i}"] = {0: "batch_size"}
        elif self._behavior is ConfigBehavior.DECODER:
            for i in range(self._config.decoder_layers):
                common_outputs[f"decoder_attentions.{i}"] = {
                    0: "batch_size",
                    2: "decoder_sequence_length",
                    3: "past_decoder_sequence_length + 1"
                }
            for i in range(self._config.decoder_layers):
                common_outputs[f"cross_attentions.{i}"] = {
                    0: "batch_size",
                    2: "decoder_sequence_length",
                    3: "encoder_sequence_length_out"
                }

        return common_outputs

    @property
    def torch_to_onnx_output_map(self):
        if self._behavior is ConfigBehavior.ENCODER:
            # The encoder export uses WhisperEncoder that returns the key "attentions"
            return {"attentions": "encoder_attentions"}
        else:
            return {}

model_id = "openai/whisper-tiny.en"
config = AutoConfig.from_pretrained(model_id)

custom_whisper_onnx_config = CustomWhisperOnnxConfig(
        config=config,
        task="automatic-speech-recognition",
)

encoder_config = custom_whisper_onnx_config.with_behavior("encoder")
decoder_config = custom_whisper_onnx_config.with_behavior("decoder", use_past=False)
decoder_with_past_config = custom_whisper_onnx_config.with_behavior("decoder", use_past=True)

custom_onnx_configs={
    "encoder_model": encoder_config,
    "decoder_model": decoder_config,
    "decoder_with_past_model": decoder_with_past_config,
}

main_export(
    model_id,
    output="custom_whisper_onnx",
    no_post_process=True,
    model_kwargs={"output_attentions": True},
    custom_onnx_configs=custom_onnx_configs
)

对于只需要单个 ONNX 文件的任务(例如仅编码器),然后可以使用类optimum.onnxruntime.ORTModelForCustomTasks使用具有自定义输入/输出的导出模型,以便在 CPU 或 GPU 上使用 ONNX 运行时进行推理。

使用自定义建模自定义 Transformers 模型的导出

Optimum 支持导出使用trust_remote_code=True的具有自定义建模的 Transformers 模型,这在 Transormers 库中没有得到官方支持,但可以使用其功能作为管道生成

此类模型的示例包括THUDM/chatglm2-6bmosaicml/mpt-30b

要导出自定义模型,需要将字典custom_onnx_configs传递给main_export(),其中包含要导出模型的所有子部分的 ONNX 配置定义(例如,编码器和解码器子部分)。以下示例允许导出mosaicml/mpt-7b模型

from optimum.exporters.onnx import main_export

from transformers import AutoConfig

from optimum.exporters.onnx.config import TextDecoderOnnxConfig
from optimum.utils import NormalizedTextConfig, DummyPastKeyValuesGenerator
from typing import Dict


class MPTDummyPastKeyValuesGenerator(DummyPastKeyValuesGenerator):
    """
    MPT swaps the two last dimensions for the key cache compared to usual transformers
    decoder models, thus the redefinition here.
    """
    def generate(self, input_name: str, framework: str = "pt"):
        past_key_shape = (
            self.batch_size,
            self.num_attention_heads,
            self.hidden_size // self.num_attention_heads,
            self.sequence_length,
        )
        past_value_shape = (
            self.batch_size,
            self.num_attention_heads,
            self.sequence_length,
            self.hidden_size // self.num_attention_heads,
        )
        return [
            (
                self.random_float_tensor(past_key_shape, framework=framework),
                self.random_float_tensor(past_value_shape, framework=framework),
            )
            for _ in range(self.num_layers)
        ]

class CustomMPTOnnxConfig(TextDecoderOnnxConfig):
    DUMMY_INPUT_GENERATOR_CLASSES = (MPTDummyPastKeyValuesGenerator,) + TextDecoderOnnxConfig.DUMMY_INPUT_GENERATOR_CLASSES
    DUMMY_PKV_GENERATOR_CLASS = MPTDummyPastKeyValuesGenerator

    DEFAULT_ONNX_OPSET = 14  # aten::tril operator requires opset>=14
    NORMALIZED_CONFIG_CLASS = NormalizedTextConfig.with_args(
        hidden_size="d_model",
        num_layers="n_layers",
        num_attention_heads="n_heads"
    )

    def add_past_key_values(self, inputs_or_outputs: Dict[str, Dict[int, str]], direction: str):
        """
        Adapted from https://github.com/huggingface/optimum/blob/v1.9.0/optimum/exporters/onnx/base.py#L625
        """
        if direction not in ["inputs", "outputs"]:
            raise ValueError(f'direction must either be "inputs" or "outputs", but {direction} was given')

        if direction == "inputs":
            decoder_sequence_name = "past_sequence_length"
            name = "past_key_values"
        else:
            decoder_sequence_name = "past_sequence_length + 1"
            name = "present"

        for i in range(self._normalized_config.num_layers):
            inputs_or_outputs[f"{name}.{i}.key"] = {0: "batch_size", 3: decoder_sequence_name}
            inputs_or_outputs[f"{name}.{i}.value"] = {0: "batch_size", 2: decoder_sequence_name}


model_id = "/home/fxmarty/hf_internship/optimum/tiny-mpt-random-remote-code"
config = AutoConfig.from_pretrained(model_id, trust_remote_code=True)

onnx_config = CustomMPTOnnxConfig(
    config=config,
    task="text-generation",
    use_past_in_inputs=False,
    use_present_in_outputs=True,
)
onnx_config_with_past = CustomMPTOnnxConfig(config, task="text-generation", use_past=True)

custom_onnx_configs = {
    "decoder_model": onnx_config,
    "decoder_with_past_model": onnx_config_with_past,
}

main_export(
    model_id,
    output="mpt_onnx",
    task="text-generation-with-past",
    trust_remote_code=True,
    custom_onnx_configs=custom_onnx_configs,
    no_post_process=True,
)

此外,main_export的高级参数fn_get_submodels允许自定义在模型需要导出到多个子模型的情况下如何提取子模型。此类函数的示例可以[此处查阅](合并后链接到 utils.py 相关代码)。

< > GitHub 上的更新