Optimum 文档
导出模型
并获得增强的文档体验
开始使用
导出模型
要导出托管在 Hub 上的模型,您可以使用我们的空间。转换后,将在您的命名空间下推送一个仓库,该仓库可以是公共的或私有的。
使用 CLI
要使用 CLI 将模型导出为 OpenVINO IR 格式
optimum-cli export openvino --model meta-llama/Meta-Llama-3-8B ov_model/
要导出私有模型或需要访问权限的模型,您可以运行 huggingface-cli login
永久登录,或者将环境变量 HF_TOKEN
设置为具有模型访问权限的令牌。有关更多信息,请参阅身份验证文档。
模型参数可以是托管在 Hub 上的模型 ID,也可以是本地托管模型的路径。对于本地模型,您需要从支持的任务列表中指定模型在导出前应加载的任务。
optimum-cli export openvino --model local_llama --task text-generation-with-past ov_model/
查看帮助以获取更多选项
usage: optimum-cli export openvino [-h] -m MODEL [--task TASK] [--framework {pt,tf}] [--trust-remote-code] [--weight-format {fp32,fp16,int8,int4,mxfp4,nf4,cb4}] [--quant-mode {int8,f8e4m3,f8e5m2,nf4_f8e4m3,nf4_f8e5m2,cb4_f8e4m3,int4_f8e4m3,int4_f8e5m2}] [--library {transformers,diffusers,timm,sentence_transformers,open_clip}] [--cache_dir CACHE_DIR] [--pad-token-id PAD_TOKEN_ID] [--ratio RATIO] [--sym] [--group-size GROUP_SIZE] [--backup-precision {none,int8_sym,int8_asym}] [--dataset DATASET] [--all-layers] [--awq] [--scale-estimation] [--gptq] [--lora-correction] [--sensitivity-metric SENSITIVITY_METRIC] [--quantization-statistics-path QUANTIZATION_STATISTICS_PATH] [--num-samples NUM_SAMPLES] [--disable-stateful] [--disable-convert-tokenizer] [--smooth-quant-alpha SMOOTH_QUANT_ALPHA] output optional arguments: -h, --help show this help message and exit Required arguments: -m MODEL, --model MODEL Model ID on huggingface.co or path on disk to load model from. output Path indicating the directory where to store the generated OV model. Optional arguments: --task TASK The task to export the model for. If not specified, the task will be auto-inferred based on the model. Available tasks depend on the model, but are among: ['image-to-image', 'image-segmentation', 'inpainting', 'sentence-similarity', 'text-to-audio', 'image-to-text', 'automatic-speech-recognition', 'token-classification', 'text-to-image', 'audio-classification', 'feature-extraction', 'semantic-segmentation', 'masked-im', 'audio-xvector', 'audio-frame-classification', 'text2text-generation', 'multiple-choice', 'depth-estimation', 'image-classification', 'fill-mask', 'zero-shot-object-detection', 'object-detection', 'question-answering', 'zero-shot-image-classification', 'mask-generation', 'text-generation', 'text-classification']. For decoder models, use 'xxx-with-past' to export the model using past key values in the decoder. --framework {pt,tf} The framework to use for the export. If not provided, will attempt to use the local checkpoint's original framework or what is available in the environment. --trust-remote-code Allows to use custom code for the modeling hosted in the model repository. This option should only be set for repositories you trust and in which you have read the code, as it will execute on your local machine arbitrary code present in the model repository. --weight-format {fp32,fp16,int8,int4,mxfp4,nf4,cb4} The weight format of the exported model. Option 'cb4' represents a codebook with 16 fixed fp8 values in E4M3 format. --quant-mode {int8,f8e4m3,f8e5m2,nf4_f8e4m3,nf4_f8e5m2,cb4_f8e4m3,int4_f8e4m3,int4_f8e5m2} Quantization precision mode. This is used for applying full model quantization including activations. --library {transformers,diffusers,timm,sentence_transformers,open_clip} The library used to load the model before export. If not provided, will attempt to infer the local checkpoint's library --cache_dir CACHE_DIR The path to a directory in which the downloaded model should be cached if the standard cache should not be used. --pad-token-id PAD_TOKEN_ID This is needed by some models, for some tasks. If not provided, will attempt to use the tokenizer to guess it. --ratio RATIO A parameter used when applying 4-bit quantization to control the ratio between 4-bit and 8-bit quantization. If set to 0.8, 80% of the layers will be quantized to int4 while 20% will be quantized to int8. This helps to achieve better accuracy at the sacrifice of the model size and inference latency. Default value is 1.0. Note: If dataset is provided, and the ratio is less than 1.0, then data-aware mixed precision assignment will be applied. --sym Whether to apply symmetric quantization. This argument is related to integer-typed --weight-format and --quant-mode options. In case of full or mixed quantization (--quant-mode) symmetric quantization will be applied to weights in any case, so only activation quantization will be affected by --sym argument. For weight-only quantization (--weight-format) --sym argument does not affect backup precision. Examples: (1) --weight-format int8 --sym => int8 symmetric quantization of weights; (2) --weight-format int4 => int4 asymmetric quantization of weights; (3) --weight-format int4 --sym --backup-precision int8_asym => int4 symmetric quantization of weights with int8 asymmetric backup precision; (4) --quant-mode int8 --sym => weights and activations are quantized to int8 symmetric data type; (5) --quant-mode int8 => activations are quantized to int8 asymmetric data type, weights -- to int8 symmetric data type; (6) --quant-mode int4_f8e5m2 --sym => activations are quantized to f8e5m2 data type, weights -- to int4 symmetric data type. --group-size GROUP_SIZE The group size to use for quantization. Recommended value is 128 and -1 uses per-column quantization. --backup-precision {none,int8_sym,int8_asym} Defines a backup precision for mixed-precision weight compression. Only valid for 4-bit weight formats. If not provided, backup precision is int8_asym. 'none' stands for original floating- point precision of the model weights, in this case weights are retained in their original precision without any quantization. 'int8_sym' stands for 8-bit integer symmetric quantization without zero point. 'int8_asym' stands for 8-bit integer asymmetric quantization with zero points per each quantization group. --dataset DATASET The dataset used for data-aware compression or quantization with NNCF. For language models you can use the one from the list ['auto','wikitext2','c4','c4-new']. With 'auto' the dataset will be collected from model's generations. For diffusion models it should be on of ['conceptual_captions','laion/220k-GPT4Vision-captions-from-LIVIS','laion/filtered-wit']. For visual language models the dataset must be set to 'contextual'. Note: if none of the data-aware compression algorithms are selected and ratio parameter is omitted or equals 1.0, the dataset argument will not have an effect on the resulting model. --all-layers Whether embeddings and last MatMul layers should be compressed to INT4. If not provided an weight compression is applied, they are compressed to INT8. --awq Whether to apply AWQ algorithm. AWQ improves generation quality of INT4-compressed LLMs. If dataset is provided, a data-aware activation-based version of the algorithm will be executed, which requires additional time. Otherwise, data-free AWQ will be applied which relies on per-column magnitudes of weights instead of activations. Note: it is possible that there will be no matching patterns in the model to apply AWQ, in such case it will be skipped. --scale-estimation Indicates whether to apply a scale estimation algorithm that minimizes the L2 error between the original and compressed layers. Providing a dataset is required to run scale estimation. Please note, that applying scale estimation takes additional memory and time. --gptq Indicates whether to apply GPTQ algorithm that optimizes compressed weights in a layer-wise fashion to minimize the difference between activations of a compressed and original layer. Please note, that applying GPTQ takes additional memory and time. --lora-correction Indicates whether to apply LoRA Correction algorithm. When enabled, this algorithm introduces low-rank adaptation layers in the model that can recover accuracy after weight compression at some cost of inference latency. Please note, that applying LoRA Correction algorithm takes additional memory and time. --sensitivity-metric SENSITIVITY_METRIC The sensitivity metric for assigning quantization precision to layers. It can be one of the following: ['weight_quantization_error', 'hessian_input_activation', 'mean_activation_variance', 'max_activation_variance', 'mean_activation_magnitude']. --quantization-statistics-path QUANTIZATION_STATISTICS_PATH Directory path to dump/load data-aware weight-only quantization statistics. This is useful when running data-aware quantization multiple times on the same model and dataset to avoid recomputing statistics. This option is applicable exclusively for weight-only quantization. Please note that the statistics depend on the dataset, so if you change the dataset, you should also change the statistics path to avoid confusion. --num-samples NUM_SAMPLES The maximum number of samples to take from the dataset for quantization. --disable-stateful Disable stateful converted models, stateless models will be generated instead. Stateful models are produced by default when this key is not used. In stateful models all kv-cache inputs and outputs are hidden in the model and are not exposed as model inputs and outputs. If --disable- stateful option is used, it may result in sub-optimal inference performance. Use it when you intentionally want to use a stateless model, for example, to be compatible with existing OpenVINO native inference code that expects KV-cache inputs and outputs in the model. --disable-convert-tokenizer Do not add converted tokenizer and detokenizer OpenVINO models. --smooth-quant-alpha SMOOTH_QUANT_ALPHA SmoothQuant alpha parameter that improves the distribution of activations before MatMul layers and reduces quantization error. Valid only when activations quantization is enabled.
您还可以在导出模型时,通过将 `--weight-format` 分别设置为 `fp16`、`int8` 或 `int4`,对 Linear、Convolutional 和 Embedding 层应用 fp16、8 位或 4 位仅权重量化。
使用 INT8 权重压缩导出
optimum-cli export openvino --model meta-llama/Meta-Llama-3-8B --weight-format int8 ov_model/
使用 INT4 权重压缩导出
optimum-cli export openvino --model meta-llama/Meta-Llama-3-8B --weight-format int4 ov_model/
使用 INT4 权重压缩和无数据 AWQ 算法导出
optimum-cli export openvino --model meta-llama/Meta-Llama-3-8B --weight-format int4 --awq ov_model/
使用 INT4 权重压缩和数据感知 AWQ 和尺度估计算法导出
optimum-cli export openvino --model meta-llama/Meta-Llama-3-8B \
--weight-format int4 --awq --scale-estimation --dataset wikitext2 ov_model/
有关量化参数的更多信息,请查看文档
默认情况下,大于 10 亿参数的模型将以 8 位权重导出为 OpenVINO 格式。您可以通过 `--weight-format fp32` 禁用此功能。
除了仅权重化量化外,您还可以通过将 `--quant-mode` 设置为首选精度来应用包括激活在内的完整模型量化。这将把线性层、卷积层和其他一些层的权重和激活量化到所选模式。请参阅以下示例。
optimum-cli export openvino -m openai/whisper-large-v3-turbo --quant-mode int8 --dataset librispeech --num-samples 32 --smooth-quant-alpha 0.9 ./whisper-large-v3-turbo
默认量化配置
对于某些模型,我们维护了一组默认量化配置(链接)。要应用默认的 4 位仅权重化量化,应提供 --weight-format int4
,不带任何额外参数。对于 int8 权重和激活量化,应为 --quant-mode int8
。例如:
optimum-cli export openvino -m microsoft/Phi-4-mini-instruct --weight-format int4 ./Phi-4-mini-instruct
或者
optimum-cli export openvino -m openai/clip-vit-base-patch16 --quant-mode int8 ./clip-vit-base-patch16
解码器模型
对于带有解码器的模型,我们默认启用过去键和值的重用。这有助于避免在每个生成步骤中重新计算相同的中间激活。要不带此功能导出模型,您需要在指定任务时删除 -with-past
后缀。
带 K-V 缓存 | 不带 K-V 缓存 |
---|---|
text-generation-with-past | 文本生成 |
text2text-generation-with-past | text2text-generation |
automatic-speech-recognition-with-past | 自动语音识别 |
扩散模型
当 Stable Diffusion 模型导出为 OpenVINO 格式时,它们被分解为不同的组件,这些组件在推理时再进行组合。
- 文本编码器
- U-Net
- VAE 编码器
- VAE 解码器
要使用 CLI 将 Stable Diffusion XL 模型导出为 OpenVINO IR 格式,您可以执行以下操作:
optimum-cli export openvino --model stabilityai/stable-diffusion-xl-base-1.0 ov_sdxl/
您还可以在模型导出期间应用混合量化。例如:
optimum-cli export openvino --model stabilityai/stable-diffusion-xl-base-1.0 \
--weight-format int8 --dataset conceptual_captions ov_sdxl/
有关混合量化的更多信息,请参阅此 Jupyter 笔记本。
加载模型时
您还可以在加载模型时通过设置 export=True
来加载 PyTorch 检查点并将其即时转换为 OpenVINO 格式。
为了方便保存结果模型,您可以使用 save_pretrained()
方法,它将同时保存描述图的 BIN 和 XML 文件。将分词器保存到同一目录,以便轻松加载模型的相应分词器。
- from transformers import AutoModelForCausalLM
+ from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer
model_id = "meta-llama/Meta-Llama-3-8B"
- model = AutoModelForCausalLM.from_pretrained(model_id)
+ model = OVModelForCausalLM.from_pretrained(model_id, export=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
save_directory = "ov_model"
model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)
加载模型后
from transformers import AutoModelForCausalLM
from optimum.exporters.openvino import export_from_model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B")
export_from_model(model, output="ov_model", task="text-generation-with-past")
模型导出后,您现在可以通过将 AutoModelForXxx
类替换为相应的 OVModelForXxx
类来加载您的 OpenVINO 模型。