在 Hugging Face Transformers 中更快的 TensorFlow 模型

发布于 2021 年 1 月 26 日

在 GitHub 上更新

Julien Plu

jplu

在过去的几个月中，Hugging Face 团队一直致力于改进 Transformers 的 TensorFlow 模型，使其更加稳健和快速。最近的改进主要集中在两个方面：

计算性能：BERT、RoBERTa、ELECTRA 和 MPNet 得到了改进，以大幅缩短计算时间。这种计算性能的提升在所有计算方面都很显著：graph/eager 模式、TF Serving 以及 CPU/GPU/TPU 设备。
TensorFlow Serving：每个 TensorFlow 模型都可以通过 TensorFlow Serving 进行部署，从而在推理时受益于这种计算性能的提升。

计算性能

为了展示计算性能的提升，我们进行了一项全面的基准测试，将 v4.2.0 版本中 BERT 在 TensorFlow Serving 上的性能与 Google 的官方实现进行了比较。该基准测试在 GPU V100 上运行，序列长度为 128（时间单位为毫秒）。

批次大小	Google 实现	v4.2.0 实现	Google/v4.2.0 实现的相对差异
1	6.7	6.26	6.79%
2	9.4	8.68	7.96%
4	14.4	13.1	9.45%
8	24	21.5	10.99%
16	46.6	42.3	9.67%
32	83.9	80.4	4.26%
64	171.5	156	9.47%
128	338.5	309	9.11%

目前 v4.2.0 中的 Bert 实现比 Google 的实现快了约 10%。除此之外，它也比 4.1.1 版本中的实现快了两倍。

TensorFlow Serving

上一节展示了全新的 Bert 模型在最新版本的 Transformers 中计算性能得到了显著提升。在本节中，我们将逐步向您展示如何使用 TensorFlow Serving 部署 Bert 模型，以便在生产环境中受益于计算性能的提升。

什么是 TensorFlow Serving？

TensorFlow Serving 属于 TensorFlow Extended (TFX) 提供的工具集，它使将模型部署到服务器的任务变得前所未有的简单。TensorFlow Serving 提供了两个 API，一个可以通过 HTTP 请求调用，另一个则使用 gRPC 在服务器上运行推理。

什么是 SavedModel？

SavedModel 包含一个独立的 TensorFlow 模型，包括其权重和架构。它不需要模型的原始源代码即可运行，这使其非常适合与任何支持读取 SavedModel 的后端（如 Java、Go、C++ 或 JavaScript 等）共享或部署。SavedModel 的内部结构表示如下：

savedmodel
    /assets
        -> here the needed assets by the model (if any)
    /variables
        -> here the model checkpoints that contains the weights
   saved_model.pb -> protobuf file representing the model graph

如何安装 TensorFlow Serving？

有三种方法可以安装和使用 TensorFlow Serving：

通过 Docker 容器，
通过 apt 包，
或者使用 pip。

为了简化操作并与所有现有操作系统兼容，我们将在本教程中使用 Docker。

如何创建 SavedModel？

SavedModel 是 TensorFlow Serving 期望的格式。自 Transformers v4.2.0 起，创建 SavedModel 具有三个附加功能：

序列长度可以在不同运行之间自由修改。
所有模型输入都可用于推理。
现在，当使用 output_hidden_states=True 或 output_attentions=True 返回时，隐藏状态 (hidden states) 或 注意力 (attention) 会被分组到单个输出中。

下面，你可以找到保存为 TensorFlow SavedModel 的 TFBertForSequenceClassification 的输入和输出表示：

The given SavedModel SignatureDef contains the following input(s):
  inputs['attention_mask'] tensor_info:
      dtype: DT_INT32
      shape: (-1, -1)
      name: serving_default_attention_mask:0
  inputs['input_ids'] tensor_info:
      dtype: DT_INT32
      shape: (-1, -1)
      name: serving_default_input_ids:0
  inputs['token_type_ids'] tensor_info:
      dtype: DT_INT32
      shape: (-1, -1)
      name: serving_default_token_type_ids:0
The given SavedModel SignatureDef contains the following output(s):
  outputs['attentions'] tensor_info:
      dtype: DT_FLOAT
      shape: (12, -1, 12, -1, -1)
      name: StatefulPartitionedCall:0
  outputs['logits'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 2)
      name: StatefulPartitionedCall:1
Method name is: tensorflow/serving/predict

要直接传递 inputs_embeds（词元嵌入）而不是 input_ids（词元 ID）作为输入，我们需要对模型进行子类化，以获得新的服务签名。以下代码片段展示了如何做到这一点：

from transformers import TFBertForSequenceClassification
import tensorflow as tf

# Creation of a subclass in order to define a new serving signature
class MyOwnModel(TFBertForSequenceClassification):
    # Decorate the serving method with the new input_signature
    # an input_signature represents the name, the data type and the shape of an expected input
    @tf.function(input_signature=[{
        "inputs_embeds": tf.TensorSpec((None, None, 768), tf.float32, name="inputs_embeds"),
        "attention_mask": tf.TensorSpec((None, None), tf.int32, name="attention_mask"),
        "token_type_ids": tf.TensorSpec((None, None), tf.int32, name="token_type_ids"),
    }])
    def serving(self, inputs):
        # call the model to process the inputs
        output = self.call(inputs)

        # return the formated output
        return self.serving_output(output)

# Instantiate the model with the new serving method
model = MyOwnModel.from_pretrained("bert-base-cased")
# save it with saved_model=True in order to have a SavedModel version along with the h5 weights.
model.save_pretrained("my_model", saved_model=True)

serving 方法必须通过 tf.function 装饰器的新的 input_signature 参数进行重写。请参阅官方文档了解更多关于 input_signature 参数的信息。serving 方法用于定义 SavedModel 在使用 TensorFlow Serving 部署时的行为。现在 SavedModel 看起来如预期，请参见新的 inputs_embeds 输入：

The given SavedModel SignatureDef contains the following input(s):
  inputs['attention_mask'] tensor_info:
      dtype: DT_INT32
      shape: (-1, -1)
      name: serving_default_attention_mask:0
  inputs['inputs_embeds'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, -1, 768)
      name: serving_default_inputs_embeds:0
  inputs['token_type_ids'] tensor_info:
      dtype: DT_INT32
      shape: (-1, -1)
      name: serving_default_token_type_ids:0
The given SavedModel SignatureDef contains the following output(s):
  outputs['attentions'] tensor_info:
      dtype: DT_FLOAT
      shape: (12, -1, 12, -1, -1)
      name: StatefulPartitionedCall:0
  outputs['logits'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 2)
      name: StatefulPartitionedCall:1
Method name is: tensorflow/serving/predict

如何部署和使用 SavedModel？

让我们一步步看看如何部署和使用一个 BERT 模型进行情感分类。

第 1 步

创建 SavedModel。要创建 SavedModel，Transformers 库允许你加载一个名为 nateraw/bert-base-uncased-imdb 的 PyTorch 模型，该模型在 IMDB 数据集上训练过，并为你将其转换为 TensorFlow Keras 模型。

from transformers import TFBertForSequenceClassification

model = TFBertForSequenceClassification.from_pretrained("nateraw/bert-base-uncased-imdb", from_pt=True)
# the saved_model parameter is a flag to create a SavedModel version of the model in same time than the h5 weights
model.save_pretrained("my_model", saved_model=True)

第 2 步

创建并运行一个包含 SavedModel 的 Docker 容器。首先，拉取 CPU 版本的 TensorFlow Serving Docker 镜像（对于 GPU，将 serving 替换为 serving:latest-gpu）：

docker pull tensorflow/serving

接下来，以守护进程方式运行一个名为 serving_base 的服务镜像：

docker run -d --name serving_base tensorflow/serving

将新创建的 SavedModel 复制到 serving_base 容器的 models 文件夹中：

docker cp my_model/saved_model serving_base:/models/bert

提交服务模型的容器，将 MODEL_NAME 更改为与模型名称匹配（此处为 bert），该名称 (bert) 对应于我们想给 SavedModel 的名称：

docker commit --change "ENV MODEL_NAME bert" serving_base my_bert_model

然后杀死以守护进程运行的 serving_base 镜像，因为我们不再需要它：

docker kill serving_base

最后，将我们的 SavedModel 镜像作为守护进程运行，并将容器中的端口 8501 (REST API) 和 8500 (gRPC API) 映射到主机，并将容器命名为 bert。

docker run -d -p 8501:8501 -p 8500:8500 --name bert my_bert_model

第 3 步

通过 REST API 查询模型：

from transformers import BertTokenizerFast, BertConfig
import requests
import json
import numpy as np

sentence = "I love the new TensorFlow update in transformers."

# Load the corresponding tokenizer of our SavedModel
tokenizer = BertTokenizerFast.from_pretrained("nateraw/bert-base-uncased-imdb")

# Load the model config of our SavedModel
config = BertConfig.from_pretrained("nateraw/bert-base-uncased-imdb")

# Tokenize the sentence
batch = tokenizer(sentence)

# Convert the batch into a proper dict
batch = dict(batch)

# Put the example into a list of size 1, that corresponds to the batch size
batch = [batch]

# The REST API needs a JSON that contains the key instances to declare the examples to process
input_data = {"instances": batch}

# Query the REST API, the path corresponds to http://host:port/model_version/models_root_folder/model_name:method
r = requests.post("https://:8501/v1/models/bert:predict", data=json.dumps(input_data))

# Parse the JSON result. The results are contained in a list with a root key called "predictions"
# and as there is only one example, takes the first element of the list
result = json.loads(r.text)["predictions"][0]

# The returned results are probabilities, that can be positive or negative hence we take their absolute value
abs_scores = np.abs(result)

# Take the argmax that correspond to the index of the max probability.
label_id = np.argmax(abs_scores)

# Print the proper LABEL with its index
print(config.id2label[label_id])

这应该返回 POSITIVE。也可以通过 gRPC (Google Remote Procedure Call) API 来获得相同的结果：

from transformers import BertTokenizerFast, BertConfig
import numpy as np
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
import grpc

sentence = "I love the new TensorFlow update in transformers."
tokenizer = BertTokenizerFast.from_pretrained("nateraw/bert-base-uncased-imdb")
config = BertConfig.from_pretrained("nateraw/bert-base-uncased-imdb")

# Tokenize the sentence but this time with TensorFlow tensors as output already batch sized to 1. Ex:
# {
#    'input_ids': <tf.Tensor: shape=(1, 3), dtype=int32, numpy=array([[  101, 19082,   102]])>,
#    'token_type_ids': <tf.Tensor: shape=(1, 3), dtype=int32, numpy=array([[0, 0, 0]])>,
#    'attention_mask': <tf.Tensor: shape=(1, 3), dtype=int32, numpy=array([[1, 1, 1]])>
# }
batch = tokenizer(sentence, return_tensors="tf")

# Create a channel that will be connected to the gRPC port of the container
channel = grpc.insecure_channel("localhost:8500")

# Create a stub made for prediction. This stub will be used to send the gRPC request to the TF Server.
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)

# Create a gRPC request made for prediction
request = predict_pb2.PredictRequest()

# Set the name of the model, for this use case it is bert
request.model_spec.name = "bert"

# Set which signature is used to format the gRPC query, here the default one
request.model_spec.signature_name = "serving_default"

# Set the input_ids input from the input_ids given by the tokenizer
# tf.make_tensor_proto turns a TensorFlow tensor into a Protobuf tensor
request.inputs["input_ids"].CopyFrom(tf.make_tensor_proto(batch["input_ids"]))

# Same with attention mask
request.inputs["attention_mask"].CopyFrom(tf.make_tensor_proto(batch["attention_mask"]))

# Same with token type ids
request.inputs["token_type_ids"].CopyFrom(tf.make_tensor_proto(batch["token_type_ids"]))

# Send the gRPC request to the TF Server
result = stub.Predict(request)

# The output is a protobuf where the only one output is a list of probabilities
# assigned to the key logits. As the probabilities as in float, the list is
# converted into a numpy array of floats with .float_val
output = result.outputs["logits"].float_val

# Print the proper LABEL with its index
print(config.id2label[np.argmax(np.abs(output))])

结论

得益于 transformers 中 TensorFlow 模型的最新更新，现在可以轻松地使用 TensorFlow Serving 在生产中部署模型。我们正在考虑的下一步骤之一是直接将预处理部分集成到 SavedModel 中，以使事情变得更加简单。

更多博客文章

使用 Sentence Transformers v5 训练和微调稀疏嵌入模型

作者 2025 年 7 月 1 日 • 106

使用 Sentence Transformers v4 训练和微调 Reranker 模型

作者 2025 年 3 月 26 日 • 155

社区

通过拖放到文本输入框、粘贴或点击此处上传图片、音频和视频。

点击或粘贴此处以上传图片

· 注册或登录发表评论