使用 Optimum 和 Transformers Pipelines 进行加速推理

发布于 2022 年 5 月 10 日

1. 什么是 Optimum？ELI5 (给 5 岁小孩的解释)

2. 新的 Optimum 推理和 pipeline 特性

3. 加速 RoBERTa 用于问答任务的端到端教程，包括量化和优化
3.1 为 Onnxruntime 安装 Optimum

3.2 将 Hugging Face Transformers 模型转换为 ONNX 以进行推理**

3.3 使用 ORTOptimizer 优化模型

3.4 使用 ORTQuantizer 应用动态量化

3.5 使用 Transformers pipelines 运行加速推理

3.6 评估性能和速度

4. 当前的局限性

5. Optimum 推理常见问题解答

6. 接下来是什么？

Optimum 现已支持推理，支持 Hugging Face Transformers pipelines，包括使用 ONNX Runtime 进行文本生成。

BERT 和 Transformers 的采用持续增长。基于 Transformer 的模型现在不仅在自然语言处理领域取得了最先进的性能，在计算机视觉、语音和时间序列领域也是如此。💬 🖼 🎤 ⏳

公司现在正从实验和研究阶段转向生产阶段，以便将 Transformer 模型用于大规模工作负载。但默认情况下，BERT 及其同类模型与传统的机器学习算法相比，相对较慢、较大且复杂。

为了解决这个挑战，我们创建了 Optimum——它是 Hugging Face Transformers 的一个扩展，旨在加速像 BERT 这样的 Transformer 模型的训练和推理。

在这篇博文中，你将学到

1. 什么是 Optimum？ELI5 (给 5 岁小孩的解释)
2. 新的 Optimum 推理和 pipeline 特性
3. 加速 RoBERTa 用于问答任务的端到端教程，包括量化和优化
4. 当前的局限性
5. Optimum 推理常见问题解答
6. 接下来是什么？

让我们开始吧！🚀

1. 什么是 Optimum？ELI5 (给 5 岁小孩的解释)

Hugging Face Optimum 是一个开源库，是 Hugging Face Transformers 的扩展，它提供了一套统一的性能优化工具 API，以在加速硬件上训练和运行模型时实现最高效率，包括用于在 Graphcore IPU 和 Habana Gaudi 上优化性能的工具包。Optimum 可用于加速训练、量化、图优化，现在也支持 transformers pipelines 的推理。

2. 新的 Optimum 推理和 pipeline 特性

随着 Optimum 1.2 的发布，我们增加了对推理和 transformers pipelines 的支持。这使得 Optimum 用户可以利用他们习惯的 transformers API，并结合像 ONNX Runtime 这样的加速运行时的强大功能。

从 Transformers 切换到 Optimum 推理 Optimum 推理模型在 API 上与 Hugging Face Transformers 模型兼容。这意味着你只需将你的 AutoModelForXxx 类替换为 Optimum 中相应的 ORTModelForXxx 类。例如，以下是如何在 Optimum 中使用问答模型：

from transformers import AutoTokenizer, pipeline
-from transformers import AutoModelForQuestionAnswering
+from optimum.onnxruntime import ORTModelForQuestionAnswering

-model = AutoModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2") # pytorch checkpoint
+model = ORTModelForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2") # onnx checkpoint
tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")

optimum_qa = pipeline("question-answering", model=model, tokenizer=tokenizer)

question = "What's my name?"
context = "My name is Philipp and I live in Nuremberg."
pred = optimum_qa(question, context)

在第一个版本中，我们添加了对 ONNX Runtime 的支持，未来还会有更多！这些新的 ORTModelForXX 现在可以与 transformers pipelines 一起使用。它们也完全集成到 Hugging Face Hub 中，可以从社区推送和拉取优化的检查点。此外，你可以使用 ORTQuantizer 和 ORTOptimizer 先对模型进行量化和优化，然后再进行推理。查看加速 RoBERTa 用于问答任务的端到端教程，包括量化和优化了解更多详情。

3. 加速 RoBERTa 用于问答任务的端到端教程，包括量化和优化

在这个加速 RoBERTa 用于问答任务的端到端教程中，你将学到如何：

为 ONNX Runtime 安装 Optimum
将 Hugging Face Transformers 模型转换为 ONNX 以进行推理
使用 ORTOptimizer 优化模型
使用 ORTQuantizer 应用动态量化
使用 Transformers pipelines 运行加速推理
评估性能和速度

让我们开始吧 🚀

本教程是在一个 m5.xlarge AWS EC2 实例上创建和运行的。

3.1 为 Onnxruntime 安装 `Optimum`

我们的第一步是安装带有 onnxruntime 实用工具的 Optimum。

pip install "optimum[onnxruntime]==1.2.0"

这将为我们安装所有必需的包，包括 transformers、torch 和 onnxruntime。如果你要使用 GPU，可以用 pip install optimum[onnxruntime-gpu] 来安装 optimum。

3.2 将 Hugging Face `Transformers` 模型转换为 ONNX 以进行推理**

在我们开始优化之前，我们需要将我们的普通 transformers 模型转换为 onnx 格式。为此，我们将使用新的 ORTModelForQuestionAnswering 类，并调用带 from_transformers 属性的 from_pretrained() 方法。我们使用的模型是 deepset/roberta-base-squad2，一个在 SQUAD2 数据集上微调的 RoBERTa 模型，F1 分数达到 82.91，其任务 (feature) 为 question-answering。

from pathlib import Path
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForQuestionAnswering

model_id = "deepset/roberta-base-squad2"
onnx_path = Path("onnx")
task = "question-answering"

# load vanilla transformers and convert to onnx
model = ORTModelForQuestionAnswering.from_pretrained(model_id, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# save onnx checkpoint and tokenizer
model.save_pretrained(onnx_path)
tokenizer.save_pretrained(onnx_path)

# test the model with using transformers pipeline, with handle_impossible_answer for squad_v2
optimum_qa = pipeline(task, model=model, tokenizer=tokenizer, handle_impossible_answer=True)
prediction = optimum_qa(question="What's my name?", context="My name is Philipp and I live in Nuremberg.")

print(prediction)
# {'score': 0.9041663408279419, 'start': 11, 'end': 18, 'answer': 'Philipp'}

我们成功地将我们的普通 transformers 模型转换成了 onnx 格式，并使用 transformers.pipelines 运行了第一次预测。现在，让我们来优化它。🏎

如果你想了解更多关于导出 transformers 模型的信息，请查看文档：导出 🤗 Transformers 模型

3.3 使用 `ORTOptimizer` 优化模型

将 onnx 检查点保存到 onnx/ 后，我们现在可以使用 ORTOptimizer 来应用图优化，例如算子融合和常量折叠，以加速延迟和推理。

from optimum.onnxruntime import ORTOptimizer
from optimum.onnxruntime.configuration import OptimizationConfig

# create ORTOptimizer and define optimization configuration
optimizer = ORTOptimizer.from_pretrained(model_id, feature=task)
optimization_config = OptimizationConfig(optimization_level=99) # enable all optimizations

# apply the optimization configuration to the model
optimizer.export(
    onnx_model_path=onnx_path / "model.onnx",
    onnx_optimized_model_output_path=onnx_path / "model-optimized.onnx",
    optimization_config=optimization_config,
)

为了测试性能，我们可以再次使用 ORTModelForQuestionAnswering 类，并提供一个额外的 file_name 参数来加载我们优化后的模型。（这也适用于 Hub 上可用的模型）。

from optimum.onnxruntime import ORTModelForQuestionAnswering

# load quantized model
opt_model = ORTModelForQuestionAnswering.from_pretrained(onnx_path, file_name="model-optimized.onnx")

# test the quantized model with using transformers pipeline
opt_optimum_qa = pipeline(task, model=opt_model, tokenizer=tokenizer, handle_impossible_answer=True)
prediction = opt_optimum_qa(question="What's my name?", context="My name is Philipp and I live in Nuremberg.")
print(prediction)
# {'score': 0.9041663408279419, 'start': 11, 'end': 18, 'answer': 'Philipp'}

我们将在步骤 3.6 评估性能和速度中详细评估性能变化。

3.4 使用 `ORTQuantizer` 应用动态量化

优化模型后，我们可以使用 ORTQuantizer 对其进行量化以进一步加速。ORTOptimizer 可用于应用动态量化，以减小模型大小并加速延迟和推理。

我们使用 avx512_vnni，因为实例由支持 avx512 的英特尔 Cascade Lake CPU 驱动。

from optimum.onnxruntime import ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig

# create ORTQuantizer and define quantization configuration
quantizer = ORTQuantizer.from_pretrained(model_id, feature=task)
qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=True)

# apply the quantization configuration to the model
quantizer.export(
    onnx_model_path=onnx_path / "model-optimized.onnx",
    onnx_quantized_model_output_path=onnx_path / "model-quantized.onnx",
    quantization_config=qconfig,
)

我们现在可以比较这个模型的尺寸以及一些延迟性能

import os
# get model file size
size = os.path.getsize(onnx_path / "model.onnx")/(1024*1024)
print(f"Vanilla Onnx Model file size: {size:.2f} MB")
size = os.path.getsize(onnx_path / "model-quantized.onnx")/(1024*1024)
print(f"Quantized Onnx Model file size: {size:.2f} MB")

# Vanilla Onnx Model file size: 473.31 MB
# Quantized Onnx Model file size: 291.77 MB

我们已将模型大小从 473MB 减小到 291MB，减小了近 50%。要运行推理，我们可以再次使用 ORTModelForQuestionAnswering 类，并提供一个额外的 file_name 参数来加载我们量化后的模型。（这也适用于 Hub 上可用的模型）。

# load quantized model
quantized_model = ORTModelForQuestionAnswering.from_pretrained(onnx_path, file_name="model-quantized.onnx")

# test the quantized model with using transformers pipeline
quantized_optimum_qa = pipeline(task, model=quantized_model, tokenizer=tokenizer, handle_impossible_answer=True)
prediction = quantized_optimum_qa(question="What's my name?", context="My name is Philipp and I live in Nuremberg.")
print(prediction)
# {'score': 0.9246969819068909, 'start': 11, 'end': 18, 'answer': 'Philipp'}

不错！模型预测了相同的答案。

3.5 使用 Transformers pipelines 运行加速推理

Optimum 内置支持 transformers pipelines。这使我们能够利用我们熟悉的 PyTorch 和 TensorFlow 模型 API。我们已经在步骤 3.2、3.3 和 3.4 中使用了此功能来测试我们转换和优化后的模型。在撰写本文时，我们支持 ONNX Runtime，未来还会支持更多。下面是一个如何使用 transformers pipelines 的示例。

from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForQuestionAnswering

tokenizer = AutoTokenizer.from_pretrained(onnx_path)
model = ORTModelForQuestionAnswering.from_pretrained(onnx_path)

optimum_qa = pipeline("question-answering", model=model, tokenizer=tokenizer)
prediction = optimum_qa(question="What's my name?", context="My name is Philipp and I live in Nuremberg.")

print(prediction)
# {'score': 0.9041663408279419, 'start': 11, 'end': 18, 'answer': 'Philipp'}

此外，我们还为 Optimum 添加了一个 pipelines API，以为您的加速模型提供更多安全性。这意味着如果您尝试将 optimum.pipelines 与不支持的模型或任务一起使用，您会看到一个错误。您可以将 optimum.pipelines 作为 transformers.pipelines 的替代品。

from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForQuestionAnswering
from optimum.pipelines import pipeline

tokenizer = AutoTokenizer.from_pretrained(onnx_path)
model = ORTModelForQuestionAnswering.from_pretrained(onnx_path)

optimum_qa = pipeline("question-answering", model=model, tokenizer=tokenizer, handle_impossible_answer=True)
prediction = optimum_qa(question="What's my name?", context="My name is Philipp and I live in Nuremberg.")

print(prediction)
# {'score': 0.9041663408279419, 'start': 11, 'end': 18, 'answer': 'Philipp'}

3.6 评估性能和速度

在这个加速 RoBERTa 用于问答任务的端到端教程（包括量化和优化）中，我们创建了 3 个不同的模型。一个普通转换的模型，一个优化过的模型，以及一个量化过的模型。

作为本教程的最后一步，我们想详细看看我们模型的性能和准确性。应用优化技术，如图优化或量化，不仅会影响性能（延迟），也可能对模型的准确性产生影响。所以加速你的模型是有权衡的。

让我们来评估我们的模型。我们的 transformers 模型 deepset/roberta-base-squad2 是在 SQUAD2 数据集上微调的。这将是我们用来评估我们模型的数据集。

from datasets import load_metric,load_dataset

metric = load_metric("squad_v2")
dataset = load_dataset("squad_v2")["validation"]

print(f"length of dataset {len(dataset)}")
#length of dataset 11873

我们现在可以利用 datasets 的 map 函数来遍历 squad 2 的验证集，并为每个数据点运行预测。因此，我们编写一个 evaluate 辅助方法，它使用我们的 pipelines 并应用一些转换来处理 squad v2 指标。

这可能需要相当长的时间（1.5 小时）

def evaluate(example):
  default = optimum_qa(question=example["question"], context=example["context"])
  optimized = opt_optimum_qa(question=example["question"], context=example["context"])
  quantized = quantized_optimum_qa(question=example["question"], context=example["context"])
  return {
      'reference': {'id': example['id'], 'answers': example['answers']},
      'default': {'id': example['id'],'prediction_text': default['answer'], 'no_answer_probability': 0.},
      'optimized': {'id': example['id'],'prediction_text': optimized['answer'], 'no_answer_probability': 0.},
      'quantized': {'id': example['id'],'prediction_text': quantized['answer'], 'no_answer_probability': 0.},
      }

result = dataset.map(evaluate)
# COMMENT IN to run evaluation on 2000 subset of the dataset
# result = dataset.shuffle().select(range(2000)).map(evaluate)

现在让我们来比较结果

default_acc = metric.compute(predictions=result["default"], references=result["reference"])
optimized = metric.compute(predictions=result["optimized"], references=result["reference"])
quantized = metric.compute(predictions=result["quantized"], references=result["reference"])

print(f"vanilla model: exact={default_acc['exact']}% f1={default_acc['f1']}%")
print(f"optimized model: exact={optimized['exact']}% f1={optimized['f1']}%")
print(f"quantized model: exact={quantized['exact']}% f1={quantized['f1']}%")

# vanilla model: exact=79.07858165585783% f1=82.14970024570314%
# optimized model: exact=79.07858165585783% f1=82.14970024570314%
# quantized model: exact=78.75010528088941% f1=81.82526107204629%

我们优化和量化后的模型实现了 78.75% 的精确匹配和 81.83% 的 f1 分数，这是原始准确率的 99.61%。达到原始模型的 99% 已经非常好了，特别是考虑到我们使用了动态量化。

好的，让我们测试一下我们优化和量化后模型的性能（延迟）。

但首先，让我们将上下文和问题扩展到一个更有意义的序列长度，即 128。

context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value."
question="As what is Philipp working?"

为了简单起见，我们将使用一个 Python 循环来计算我们的原始模型以及优化和量化后模型的平均/均值延迟。

from time import perf_counter
import numpy as np

def measure_latency(pipe):
    latencies = []
    # warm up
    for _ in range(10):
        _ = pipe(question=question, context=context)
    # Timed run
    for _ in range(100):
        start_time = perf_counter()
        _ =  pipe(question=question, context=context)
        latency = perf_counter() - start_time
        latencies.append(latency)
    # Compute run statistics
    time_avg_ms = 1000 * np.mean(latencies)
    time_std_ms = 1000 * np.std(latencies)
    return f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}"

print(f"Vanilla model {measure_latency(optimum_qa)}")
print(f"Optimized & Quantized model {measure_latency(quantized_optimum_qa)}")

# Vanilla model Average latency (ms) - 117.61 +\- 8.48
# Optimized & Quantized model Average latency (ms) - 64.94 +\- 3.65

我们成功地将模型的延迟从 117.61ms 加速到 64.94ms，大约快了 2 倍，同时保持了 99.61% 的准确率。我们应该记住的是，我们使用的是一个中等性能的 CPU 实例，拥有 2 个物理核心。通过切换到 GPU 或更高性能的 CPU 实例，例如由 ice-lake 驱动的实例，您可以将延迟数字降低到几毫秒。

4. 当前的局限性

我们刚刚开始在 https://github.com/huggingface/optimum 中支持推理，所以我们也想分享一下当前的局限性。所有这些局限性都在我们的路线图上，并将在不久的将来得到解决。

大于 2GB 的远程模型： 目前，只能从 Hugging Face Hub 加载小于 2GB 的模型。我们正在努力增加对大于 2GB 的模型/多文件模型的支持。
Seq2Seq 任务/模型： 我们尚不支持 seq2seq 任务，例如摘要和像 T5 这样的模型，这主要是由于对单个模型的支持有限。但我们正在积极努力解决这个问题，以便为您提供与在 transformers 中熟悉的相同体验。
Past key values (过去键值)： 像 GPT-2 这样的生成模型使用一种叫做过去键值的东西，它们是注意力块的预计算键值对，可以用来加速解码。目前 ORTModelForCausalLM 尚未使用过去键值。
无缓存： 目前加载优化后的模型 (*.onnx) 时，它不会被本地缓存。

5. Optimum 推理常见问题解答

支持哪些任务？

你可以在文档中找到所有支持任务的列表。目前支持的 pipelines 任务有 `feature-extraction`、`text-classification`、`token-classification`、`question-answering`、`zero-shot-classification`、`text-generation`。

支持哪些模型？

任何可以使用 transformers.onnx 导出并且有支持任务的模型都可以使用，这包括 BERT、ALBERT、GPT2、RoBERTa、XLM-RoBERTa、DistilBERT 等。

支持哪些运行时？

目前支持 ONNX Runtime。我们正在努力在未来添加更多。如果您对特定的运行时感兴趣，请告诉我们。

如何将 Optimum 与 Transformers 一起使用？

您可以在我们的文档中找到示例和说明。

如何使用 GPU？

要使用 GPU，您只需安装 optimum[onnxruntine-gpu]，它将安装所需的 GPU 提供程序并默认使用它们。

如何将量化和优化后的模型与 pipelines 一起使用？

您可以使用新的 ORTModelForXXX 类，通过 from_pretrained 方法加载优化或量化后的模型。您可以在我们的文档中了解更多信息。

6. 接下来是什么？

你问 Optimum 的下一步是什么？有很多事情。我们专注于使 Optimum 成为用于加速和优化 transformers 的参考开源工具包。为了实现这一点，我们将解决当前的限制，改进文档，创建更多内容和示例，并推动加速和优化 transformers 的极限。

除了当前的限制之外，Optimum 路线图上的一些重要功能包括：

支持语音模型 (Wav2vec2) 和语音任务 (自动语音识别)
支持视觉模型 (ViT) 和视觉任务 (图像分类)
通过增加对 OrtValue 和 IOBinding 的支持来提高性能
更简便地评估加速模型的方法
增加对其他运行时和提供商的支持，如 TensorRT 和 AWS-Neuron

感谢阅读！如果您和我一样对加速 Transformers、提高其效率并将其扩展到数十亿次请求感到兴奋，那么您应该申请，我们正在招聘。🚀

如果您有任何问题，请随时通过 Github 或论坛与我联系。您也可以在 Twitter 或 LinkedIn 上与我联系。

更多博客文章

使用 Sentence Transformers v5 训练和微调稀疏嵌入模型

作者： 2025年7月1日 • 106

使用 Sentence Transformers v4 训练和微调 Reranker 模型

作者： 2025年3月26日 • 155

社区

通过拖放到文本输入框、粘贴或点击此处上传图片、音频和视频。

点击或粘贴此处以上传图片

· 注册或登录以发表评论