快速入门
从本质上讲,🤗 Optimum 使用配置对象来定义在不同加速器上进行优化的参数。然后,这些对象用于实例化专用的优化器、量化器和修剪器。
在应用量化或优化之前,我们首先需要将模型导出到 ONNX 格式。
>>> from optimum.onnxruntime import ORTModelForSequenceClassification
>>> from transformers import AutoTokenizer
>>> model_checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
>>> save_directory = "tmp/onnx/"
>>> # Load a model from transformers and export it to ONNX
>>> ort_model = ORTModelForSequenceClassification.from_pretrained(model_checkpoint, export=True)
>>> tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
>>> # Save the onnx model and tokenizer
>>> ort_model.save_pretrained(save_directory)
>>> tokenizer.save_pretrained(save_directory)
现在让我们看看如何使用 ONNX Runtime 应用动态量化
>>> from optimum.onnxruntime.configuration import AutoQuantizationConfig
>>> from optimum.onnxruntime import ORTQuantizer
>>> # Define the quantization methodology
>>> qconfig = AutoQuantizationConfig.arm64(is_static=False, per_channel=False)
>>> quantizer = ORTQuantizer.from_pretrained(ort_model)
>>> # Apply dynamic quantization on the model
>>> quantizer.quantize(save_dir=save_directory, quantization_config=qconfig)
在这个示例中,我们对来自 Hugging Face Hub 的模型进行了量化,但它也可以是本地模型目录的路径。应用 quantize()
方法的结果是 model_quantized.onnx
文件,该文件可用于运行推理。以下是如何加载 ONNX Runtime 模型并使用它生成预测的示例
>>> from optimum.onnxruntime import ORTModelForSequenceClassification
>>> from transformers import pipeline, AutoTokenizer
>>> model = ORTModelForSequenceClassification.from_pretrained(save_directory, file_name="model_quantized.onnx")
>>> tokenizer = AutoTokenizer.from_pretrained(save_directory)
>>> cls_pipeline = pipeline("text-classification", model=model, tokenizer=tokenizer)
>>> results = cls_pipeline("I love burritos!")
同样,您也可以通过在实例化 QuantizationConfig
对象时将 is_static
设置为 True
来应用静态量化
>>> qconfig = AutoQuantizationConfig.arm64(is_static=True, per_channel=False)
静态量化依赖于通过模型馈送数据批次来估计推理时间之前的激活量化参数。为了支持这一点,🤗 Optimum 允许您提供校准数据集。校准数据集可以是来自 🤗 Datasets 库的简单 Dataset
对象,也可以是托管在 Hugging Face Hub 上的任何数据集。在这个示例中,我们将选择该模型最初训练过的 sst2
数据集
>>> from functools import partial
>>> from optimum.onnxruntime.configuration import AutoCalibrationConfig
# Define the processing function to apply to each example after loading the dataset
>>> def preprocess_fn(ex, tokenizer):
... return tokenizer(ex["sentence"])
>>> # Create the calibration dataset
>>> calibration_dataset = quantizer.get_calibration_dataset(
... "glue",
... dataset_config_name="sst2",
... preprocess_function=partial(preprocess_fn, tokenizer=tokenizer),
... num_samples=50,
... dataset_split="train",
... )
>>> # Create the calibration configuration containing the parameters related to calibration.
>>> calibration_config = AutoCalibrationConfig.minmax(calibration_dataset)
>>> # Perform the calibration step: computes the activations quantization ranges
>>> ranges = quantizer.fit(
... dataset=calibration_dataset,
... calibration_config=calibration_config,
... operators_to_quantize=qconfig.operators_to_quantize,
... )
>>> # Apply static quantization on the model
>>> quantizer.quantize(
... save_dir=save_directory,
... calibration_tensors_range=ranges,
... quantization_config=qconfig,
... )
作为最后一个示例,让我们看一下应用图优化技术,例如操作融合和常量折叠。和以前一样,我们加载配置对象,但这次是通过设置优化级别而不是量化方法
>>> from optimum.onnxruntime.configuration import OptimizationConfig
>>> # Here the optimization level is selected to be 1, enabling basic optimizations such as redundant node eliminations and constant folding. Higher optimization level will result in a hardware dependent optimized graph.
>>> optimization_config = OptimizationConfig(optimization_level=1)
接下来,我们加载一个优化器,将这些优化应用于我们的模型
>>> from optimum.onnxruntime import ORTOptimizer
>>> optimizer = ORTOptimizer.from_pretrained(ort_model)
>>> # Optimize the model
>>> optimizer.optimize(save_dir=save_directory, optimization_config=optimization_config)
就是这样 - 模型现在已优化并准备进行推理!如您所见,每个过程都类似
- 通过
OptimizationConfig
/QuantizationConfig
对象定义优化/量化策略 - 实例化
ORTQuantizer
或ORTOptimizer
类 - 应用
quantize()
或optimize()
方法 - 运行推理
查看 examples
目录以获取更复杂的用法。
优化快乐 🤗!
< > 更新 在 GitHub 上