导出您的模型
要导出托管在 Hub 上的 模型,您可以使用我们的 空间。转换后,将向您的命名空间推送一个存储库,此存储库可以是公共的也可以是私有的。
使用 CLI
要将您的模型导出到 OpenVINO IR 格式,可以使用 CLI
optimum-cli export openvino --model meta-llama/Meta-Llama-3-8B ov_model/
模型参数可以是托管在 Hub 上的模型的模型 ID,也可以是本地托管模型的路径。对于本地模型,您需要指定模型在导出之前应该加载的任务,这在 支持的任务 列表中。
optimum-cli export openvino --model local_llama --task text-generation-with-past ov_model/
查看帮助以获取更多选项
optimum-cli export openvino --help
usage: optimum-cli export openvino [-h] -m MODEL [--task TASK] [--framework {pt,tf}] [--trust-remote-code] [--weight-format {fp32,fp16,int8,int4}]
[--library {transformers,diffusers,timm,sentence_transformers}] [--cache_dir CACHE_DIR] [--pad-token-id PAD_TOKEN_ID] [--ratio RATIO] [--sym]
[--group-size GROUP_SIZE] [--dataset DATASET] [--all-layers] [--awq] [--scale-estimation] [--sensitivity-metric SENSITIVITY_METRIC] [--num-samples NUM_SAMPLES]
[--disable-stateful] [--disable-convert-tokenizer]
output
optional arguments:
-h, --help show this help message and exit
Required arguments:
--model MODEL Model ID on huggingface.co or path on disk to load model from.
output Path indicating the directory where to store the generated OV model.
Optional arguments:
--task TASK The task to export the model for. If not specified, the task will be auto-inferred based on the model. Available tasks depend on the model, but are among: ['image-segmentation',
'feature-extraction', 'mask-generation', 'audio-classification', 'conversational', 'stable-diffusion-xl', 'question-answering', 'sentence-similarity', 'text2text-generation',
'masked-im', 'automatic-speech-recognition', 'fill-mask', 'image-to-text', 'text-generation', 'zero-shot-object-detection', 'multiple-choice', 'object-detection', 'stable-
diffusion', 'audio-xvector', 'text-to-audio', 'zero-shot-image-classification', 'token-classification', 'image-classification', 'depth-estimation', 'image-to-image', 'audio-
frame-classification', 'semantic-segmentation', 'text-classification']. For decoder models, use `xxx-with-past` to export the model using past key values in the decoder.
--framework {pt,tf} The framework to use for the export. If not provided, will attempt to use the local checkpoints original framework or what is available in the environment.
--trust-remote-code Allows to use custom code for the modeling hosted in the model repository. This option should only be set for repositories you trust and in which you have read the code, as it
will execute on your local machine arbitrary code present in the model repository.
--weight-format {fp32,fp16,int8,int4}
The weight format of the exported model.
--library {transformers,diffusers,timm,sentence_transformers}
The library used to load the model before export. If not provided, will attempt to infer the local checkpoints library.
--cache_dir CACHE_DIR
The path to a directory in which the downloaded model should be cached if the standard cache should not be used.
--pad-token-id PAD_TOKEN_ID
This is needed by some models, for some tasks. If not provided, will attempt to use the tokenizer to guess it.
--ratio RATIO A parameter used when applying 4-bit quantization to control the ratio between 4-bit and 8-bit quantization. If set to 0.8, 80% of the layers will be quantized to int4 while
20% will be quantized to int8. This helps to achieve better accuracy at the sacrifice of the model size and inference latency. Default value is 1.0.
--sym Whether to apply symmetric quantization
--group-size GROUP_SIZE
The group size to use for int4 quantization. Recommended value is 128 and -1 will results in per-column quantization.
--dataset DATASET The dataset used for data-aware compression or quantization with NNCF. You can use the one from the list ['wikitext2','c4','c4-new'] for language models or
['conceptual_captions','laion/220k-GPT4Vision-captions-from-LIVIS','laion/filtered-wit'] for diffusion models.
--all-layers Whether embeddings and last MatMul layers should be compressed to INT4. If not provided an weight compression is applied, they are compressed to INT8.
--awq Whether to apply AWQ algorithm. AWQ improves generation quality of INT4-compressed LLMs, but requires additional time for tuning weights on a calibration dataset. To run AWQ,
please also provide a dataset argument. Note: it is possible that there will be no matching patterns in the model to apply AWQ, in such case it will be skipped.
--scale-estimation Indicates whether to apply a scale estimation algorithm that minimizes the L2 error between the original and compressed layers. Providing a dataset is required to run scale
estimation. Please note, that applying scale estimation takes additional memory and time.
--sensitivity-metric SENSITIVITY_METRIC
The sensitivity metric for assigning quantization precision to layers. Can be one of the following: ['weight_quantization_error', 'hessian_input_activation',
'mean_activation_variance', 'max_activation_variance', 'mean_activation_magnitude'].
--num-samples NUM_SAMPLES
The maximum number of samples to take from the dataset for quantization.
--disable-stateful Disable stateful converted models, stateless models will be generated instead. Stateful models are produced by default when this key is not used. In stateful models all kv-cache
inputs and outputs are hidden in the model and are not exposed as model inputs and outputs. If --disable-stateful option is used, it may result in sub-optimal inference
performance. Use it when you intentionally want to use a stateless model, for example, to be compatible with existing OpenVINO native inference code that expects kv-cache inputs
and outputs in the model.
--disable-convert-tokenizer
Do not add converted tokenizer and detokenizer OpenVINO models.
您还可以通过将 --weight-format
设置为 fp16
、int8
或 int4
,在导出模型时对线性、卷积和嵌入层应用 fp16、8 位或 4 位权重仅量化。
optimum-cli export openvino --model meta-llama/Meta-Llama-3-8B --weight-format int8 ov_model/
有关量化参数的更多信息,请查看 文档
默认情况下,大于 10 亿参数的模型将使用 8 位权重导出到 OpenVINO 格式。您可以使用 --weight-format fp32
禁用此功能。
解码器模型
对于具有解码器的模型,我们默认情况下启用了过去键和值的重用。这允许在每个生成步骤中避免重新计算相同的中间激活。要导出不带过去键和值的模型,您需要在指定任务时删除 -with-past
后缀。
使用 K-V 缓存 | 不使用 K-V 缓存 |
---|---|
text-generation-with-past | text-generation |
text2text-generation-with-past | text2text-generation |
automatic-speech-recognition-with-past | automatic-speech-recognition |
扩散模型
当 Stable Diffusion 模型导出到 OpenVINO 格式时,它们会被分解成不同的组件,这些组件会在推理过程中被组合起来。
- 文本编码器
- U-Net
- VAE 编码器
- VAE 解码器
要使用 CLI 将您的 Stable Diffusion XL 模型导出到 OpenVINO IR 格式,您可以执行以下操作
optimum-cli export openvino --model stabilityai/stable-diffusion-xl-base-1.0 ov_sdxl/
加载模型时
您还可以通过在加载模型时设置 export=True
,加载您的 PyTorch 检查点并将其实时转换为 OpenVINO 格式。
为了方便保存生成的模型,您可以使用 save_pretrained()
方法,它将保存描述图的 BIN 和 XML 文件。将标记器保存到同一个目录中非常有用,这样可以方便地加载用于模型的标记器。
- from transformers import AutoModelForCausalLM
+ from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer
model_id = "meta-llama/Meta-Llama-3-8B"
- model = AutoModelForCausalLM.from_pretrained(model_id)
+ model = OVModelForCausalLM.from_pretrained(model_id, export=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
save_directory = "ov_model"
model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)
加载模型后
from transformers import AutoModelForCausalLM
from optimum.exporters.openvino import export_from_model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B")
export_from_model(model, output="ov_model", task="text-generation-with-past")
模型导出后,您可以通过用相应的 OVModelForXxx
类替换 AutoModelForXxx
类来加载 OpenVINO 模型。