Optimum 文档

导出您的模型

您正在查看 main 版本,需要从源代码安装. 如果您想使用常规 pip 安装,请查看最新的稳定版本(v1.23.1)。
Hugging Face's logo
加入 Hugging Face 社区

并获得增强文档体验

以开始

导出您的模型

要导出托管在 Hub 上的 模型,您可以使用我们的 空间。转换后,将向您的命名空间推送一个存储库,此存储库可以是公共的也可以是私有的。

使用 CLI

要将您的模型导出到 OpenVINO IR 格式,可以使用 CLI

optimum-cli export openvino --model meta-llama/Meta-Llama-3-8B ov_model/

模型参数可以是托管在 Hub 上的模型的模型 ID,也可以是本地托管模型的路径。对于本地模型,您需要指定模型在导出之前应该加载的任务,这在 支持的任务 列表中。

optimum-cli export openvino --model local_llama --task text-generation-with-past ov_model/

查看帮助以获取更多选项

optimum-cli export openvino --help

usage: optimum-cli export openvino [-h] -m MODEL [--task TASK] [--framework {pt,tf}] [--trust-remote-code] [--weight-format {fp32,fp16,int8,int4}]
                                   [--library {transformers,diffusers,timm,sentence_transformers}] [--cache_dir CACHE_DIR] [--pad-token-id PAD_TOKEN_ID] [--ratio RATIO] [--sym]
                                   [--group-size GROUP_SIZE] [--dataset DATASET] [--all-layers] [--awq] [--scale-estimation] [--sensitivity-metric SENSITIVITY_METRIC] [--num-samples NUM_SAMPLES]
                                   [--disable-stateful] [--disable-convert-tokenizer]
                                   output

optional arguments:
  -h, --help            show this help message and exit

Required arguments:
  --model MODEL         Model ID on huggingface.co or path on disk to load model from.

  output                Path indicating the directory where to store the generated OV model.

Optional arguments:
  --task TASK           The task to export the model for. If not specified, the task will be auto-inferred based on the model. Available tasks depend on the model, but are among: ['image-segmentation',
                        'feature-extraction', 'mask-generation', 'audio-classification', 'conversational', 'stable-diffusion-xl', 'question-answering', 'sentence-similarity', 'text2text-generation',
                        'masked-im', 'automatic-speech-recognition', 'fill-mask', 'image-to-text', 'text-generation', 'zero-shot-object-detection', 'multiple-choice', 'object-detection', 'stable-
                        diffusion', 'audio-xvector', 'text-to-audio', 'zero-shot-image-classification', 'token-classification', 'image-classification', 'depth-estimation', 'image-to-image', 'audio-
                        frame-classification', 'semantic-segmentation', 'text-classification']. For decoder models, use `xxx-with-past` to export the model using past key values in the decoder.
  --framework {pt,tf}   The framework to use for the export. If not provided, will attempt to use the local checkpoints original framework or what is available in the environment.
  --trust-remote-code   Allows to use custom code for the modeling hosted in the model repository. This option should only be set for repositories you trust and in which you have read the code, as it
                        will execute on your local machine arbitrary code present in the model repository.
  --weight-format {fp32,fp16,int8,int4}
                        The weight format of the exported model.
  --library {transformers,diffusers,timm,sentence_transformers}
                        The library used to load the model before export. If not provided, will attempt to infer the local checkpoints library.
  --cache_dir CACHE_DIR
                        The path to a directory in which the downloaded model should be cached if the standard cache should not be used.
  --pad-token-id PAD_TOKEN_ID
                        This is needed by some models, for some tasks. If not provided, will attempt to use the tokenizer to guess it.
  --ratio RATIO         A parameter used when applying 4-bit quantization to control the ratio between 4-bit and 8-bit quantization. If set to 0.8, 80% of the layers will be quantized to int4 while
                        20% will be quantized to int8. This helps to achieve better accuracy at the sacrifice of the model size and inference latency. Default value is 1.0.
  --sym                 Whether to apply symmetric quantization
  --group-size GROUP_SIZE
                        The group size to use for int4 quantization. Recommended value is 128 and -1 will results in per-column quantization.
  --dataset DATASET     The dataset used for data-aware compression or quantization with NNCF. You can use the one from the list ['wikitext2','c4','c4-new'] for language models or
                        ['conceptual_captions','laion/220k-GPT4Vision-captions-from-LIVIS','laion/filtered-wit'] for diffusion models.
  --all-layers          Whether embeddings and last MatMul layers should be compressed to INT4. If not provided an weight compression is applied, they are compressed to INT8.
  --awq                 Whether to apply AWQ algorithm. AWQ improves generation quality of INT4-compressed LLMs, but requires additional time for tuning weights on a calibration dataset. To run AWQ,
                        please also provide a dataset argument. Note: it is possible that there will be no matching patterns in the model to apply AWQ, in such case it will be skipped.
  --scale-estimation    Indicates whether to apply a scale estimation algorithm that minimizes the L2 error between the original and compressed layers. Providing a dataset is required to run scale
                        estimation. Please note, that applying scale estimation takes additional memory and time.
  --sensitivity-metric SENSITIVITY_METRIC
                        The sensitivity metric for assigning quantization precision to layers. Can be one of the following: ['weight_quantization_error', 'hessian_input_activation',
                        'mean_activation_variance', 'max_activation_variance', 'mean_activation_magnitude'].
  --num-samples NUM_SAMPLES
                        The maximum number of samples to take from the dataset for quantization.
  --disable-stateful    Disable stateful converted models, stateless models will be generated instead. Stateful models are produced by default when this key is not used. In stateful models all kv-cache
                        inputs and outputs are hidden in the model and are not exposed as model inputs and outputs. If --disable-stateful option is used, it may result in sub-optimal inference
                        performance. Use it when you intentionally want to use a stateless model, for example, to be compatible with existing OpenVINO native inference code that expects kv-cache inputs
                        and outputs in the model.
  --disable-convert-tokenizer
                        Do not add converted tokenizer and detokenizer OpenVINO models.

您还可以通过将 --weight-format 设置为 fp16int8int4,在导出模型时对线性、卷积和嵌入层应用 fp16、8 位或 4 位权重仅量化。

optimum-cli export openvino --model meta-llama/Meta-Llama-3-8B --weight-format int8 ov_model/

有关量化参数的更多信息,请查看 文档

默认情况下,大于 10 亿参数的模型将使用 8 位权重导出到 OpenVINO 格式。您可以使用 --weight-format fp32 禁用此功能。

解码器模型

对于具有解码器的模型,我们默认情况下启用了过去键和值的重用。这允许在每个生成步骤中避免重新计算相同的中间激活。要导出不带过去键和值的模型,您需要在指定任务时删除 -with-past 后缀。

使用 K-V 缓存 不使用 K-V 缓存
text-generation-with-past text-generation
text2text-generation-with-past text2text-generation
automatic-speech-recognition-with-past automatic-speech-recognition

扩散模型

当 Stable Diffusion 模型导出到 OpenVINO 格式时,它们会被分解成不同的组件,这些组件会在推理过程中被组合起来。

  • 文本编码器
  • U-Net
  • VAE 编码器
  • VAE 解码器

要使用 CLI 将您的 Stable Diffusion XL 模型导出到 OpenVINO IR 格式,您可以执行以下操作

optimum-cli export openvino --model stabilityai/stable-diffusion-xl-base-1.0 ov_sdxl/

加载模型时

您还可以通过在加载模型时设置 export=True,加载您的 PyTorch 检查点并将其实时转换为 OpenVINO 格式。

为了方便保存生成的模型,您可以使用 save_pretrained() 方法,它将保存描述图的 BIN 和 XML 文件。将标记器保存到同一个目录中非常有用,这样可以方便地加载用于模型的标记器。

- from transformers import AutoModelForCausalLM
+ from optimum.intel import OVModelForCausalLM
  from transformers import AutoTokenizer

  model_id = "meta-llama/Meta-Llama-3-8B"
- model = AutoModelForCausalLM.from_pretrained(model_id)
+ model = OVModelForCausalLM.from_pretrained(model_id, export=True)
  tokenizer = AutoTokenizer.from_pretrained(model_id)

  save_directory = "ov_model"
  model.save_pretrained(save_directory)
  tokenizer.save_pretrained(save_directory)

加载模型后

from transformers import AutoModelForCausalLM
from optimum.exporters.openvino import export_from_model

model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B")
export_from_model(model, output="ov_model", task="text-generation-with-past")

模型导出后,您可以通过用相应的 OVModelForXxx 类替换 AutoModelForXxx 类来加载 OpenVINO 模型

< > GitHub 上的更新