使用 TensorFlow 和 XLA 加速文本生成

发布于 2022 年 7 月 27 日

TL;DR：现在可以使用 XLA 编译 🤗 transformers 中使用 TensorFlow 进行的文本生成。它比以前快 100 倍，甚至比 PyTorch 更快——请查看下面的 colab！

文本生成

随着大型语言模型质量的提高，我们对这些模型能力的期望也随之提高。特别是自 OpenAI 发布 GPT-2 以来，具有文本生成能力的模型一直备受关注。这有其正当理由——这些模型可以用于总结、翻译，甚至在某些语言任务中展示了零样本学习能力。这篇博客文章将展示如何使用 TensorFlow 充分利用这项技术。

🤗 transformers 库最初是针对 NLP 模型而开发的，因此文本生成对我们来说至关重要。它是 Hugging Face 民主化工作的一部分，旨在确保其易于访问、易于控制且高效。之前有一篇博客文章介绍了不同类型的文本生成。尽管如此，下面仍然快速回顾一下核心功能——如果您熟悉我们的 generate 函数并想直接了解 TensorFlow 的具体细节，请随意跳过。

让我们从基础开始。文本生成可以是确定性的，也可以是随机的，这取决于 do_sample 标志。默认情况下，它设置为 False，导致输出是确定性的，也称为贪婪解码（Greedy Decoding）。当它设置为 True 时，也称为采样（Sampling），输出将是随机的，但您仍然可以通过 seed 参数获得可重现的结果（格式与无状态 TensorFlow 随机数生成中的相同）。通常情况下，如果您希望从模型中获取事实信息，则需要确定性生成；如果您希望获得更具创意性的输出，则需要随机生成。

# Requires transformers >= 4.21.0;
# Sampling outputs may differ, depending on your hardware.
from transformers import AutoTokenizer, TFAutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = TFAutoModelForCausalLM.from_pretrained("gpt2")
model.config.pad_token_id = model.config.eos_token_id
inputs = tokenizer(["TensorFlow is"], return_tensors="tf")

generated = model.generate(**inputs, do_sample=True, seed=(42, 0))
print("Sampling output: ", tokenizer.decode(generated[0]))
# > Sampling output: TensorFlow is a great learning platform for learning about
# data structure and structure in data science..

根据目标应用程序，可能需要更长的输出。您可以使用 max_new_tokens 控制生成输出的长度，请记住，更长的生成将需要更多的资源。

generated = model.generate(
    **inputs, do_sample=True, seed=(42, 0), max_new_tokens=5
)
print("Limiting to 5 new tokens:", tokenizer.decode(generated[0]))
# > Limiting to 5 new tokens: TensorFlow is a great learning platform for
generated = model.generate(
    **inputs, do_sample=True, seed=(42, 0), max_new_tokens=30
)
print("Limiting to 30 new tokens:", tokenizer.decode(generated[0]))
# > Limiting to 30 new tokens: TensorFlow is a great learning platform for
# learning about data structure and structure in data science................

采样有一些你可以用来控制随机性的旋钮。最重要的是 temperature，它设置了输出的整体熵——低于 1.0 的值将优先采样可能性更高的标记，而高于 1.0 的值则相反。将其设置为 0.0 会将行为简化为贪婪解码，而非常大的值则近似于均匀采样。

generated = model.generate(
    **inputs, do_sample=True, seed=(42, 0), temperature=0.7
)
print("Temperature 0.7: ", tokenizer.decode(generated[0]))
# > Temperature 0.7: TensorFlow is a great way to do things like this........
generated = model.generate(
    **inputs, do_sample=True, seed=(42, 0), temperature=1.5
)
print("Temperature 1.5: ", tokenizer.decode(generated[0]))
# > Temperature 1.5: TensorFlow is being developed for both Cython and Bamboo.
# On Bamboo...

与采样相反，贪婪解码在生成的每次迭代中总是选择最可能的标记。然而，它通常会导致次优输出。您可以通过 num_beams 参数提高结果的质量。当它大于 1 时，它会触发 Beam Search，它会持续探索高概率序列。这种探索是以额外的资源和计算时间为代价的。

generated = model.generate(**inputs, num_beams=2)
print("Beam Search output:", tokenizer.decode(generated[0]))
# > Beam Search output: TensorFlow is an open-source, open-source,
# distributed-source application framework for the

最后，在运行采样或波束搜索时，您可以使用 num_return_sequences 返回多个序列。对于采样，它等同于从同一个输入提示运行生成多次，而对于波束搜索，它以降序返回得分最高的生成波束。

generated = model.generate(**inputs, num_beams=2, num_return_sequences=2)
print(
    "All generated hypotheses:",
    "\n".join(tokenizer.decode(out) for out in generated)
)
# > All generated hypotheses: TensorFlow is an open-source, open-source,
# distributed-source application framework for the
# > TensorFlow is an open-source, open-source, distributed-source application
# framework that allows

如您所见，文本生成的基础功能易于控制。然而，上面示例中未涵盖的选项还有很多，建议阅读文档以了解高级用例。遗憾的是，当您使用 TensorFlow 运行 generate 时，您可能会注意到执行时间较长。如果您的目标应用程序期望低延迟或大量输入提示，使用 TensorFlow 运行文本生成似乎是一项昂贵的任务。😬

别担心，本博客文章的其余部分旨在证明一行代码可以带来显著的改进。如果您想直接进入操作，colab 中有一个您可以摆弄的交互式示例！

TensorFlow 和 XLA

XLA，即加速线性代数（Accelerated Linear Algebra），是一种最初为加速 TensorFlow 模型而开发的编译器。如今，它也是 JAX 背后的编译器，甚至可以与 PyTorch 一起使用。尽管“编译器”这个词对某些人来说可能听起来令人生畏，但 XLA 在 TensorFlow 中使用起来很简单——它作为 tensorflow 库的一部分打包，并且可以在任何创建图的函数中使用 jit_compile 参数触发。

对于熟悉 TensorFlow 1 🧓 的人来说，TensorFlow 图的概念自然而然，因为它是唯一的操作模式。首先，您以声明式方式定义操作以创建图。然后，您可以通过图输入数据并观察输出。快速、高效，但调试起来很痛苦。随着 TensorFlow 2 的到来，Eager Execution（即时执行）和命令式编码模型的能力也随之而来——TensorFlow 团队在他们的博客文章中更详细地解释了这种差异。

Hugging Face 在编写 TensorFlow 模型时考虑到了 Eager Execution。透明度是一个核心价值，能够随时检查模型内部对于实现这一目标非常有益。然而，这意味着模型的某些使用方式不能直接从图模式的性能优势中受益（例如，当调用 model(args) 时）。

幸运的是，TensorFlow 团队已经为我们这些用户考虑周全了🥳！将包含 TensorFlow 代码的函数用 tf.function 封装后，当您调用被封装的函数时，它会尝试将其转换为图。如果您正在训练模型，调用 model.compile()（不带 run_eagerly=True）正是进行了这种封装，以便您在调用 model.fit() 时能够从图模式中受益。由于 tf.function 可以用于任何包含 TensorFlow 代码的函数，这意味着您可以在超出模型推理的函数中使用它，从而创建一个单一的优化图。

既然您知道如何创建 TensorFlow 图，使用 XLA 编译它们就很简单了——只需将 jit_compile=True 作为参数添加到上述函数中（tf.function 和 tf.keras.Model.compile）。假设一切顺利（下面会详细介绍）并且您正在使用 GPU 或 TPU，您会注意到第一次调用会花费一些时间，但随后的调用会快得多。这是一个执行模型推理和对其输出进行一些后处理的简单函数示例

# Note: execution times are deeply dependent on hardware -- a 3090 was used here.
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = TFAutoModelForCausalLM.from_pretrained("gpt2")
inputs = tokenizer(["TensorFlow is"], return_tensors="tf")

def most_likely_next_token(inputs):
    model_output = model(inputs)
    return tf.argmax(model_output.logits[:, -1, :], axis=-1)

print("Calling regular function with TensorFlow code...")
most_likely_next_token(inputs)
# > Execution time -- 48.8 ms

只需一行代码，您就可以从上面的函数创建一个 XLA 加速函数。

xla_most_likely_next_token = tf.function(most_likely_next_token, jit_compile=True)

print("Calling XLA function... (for the first time -- will be slow)")
xla_most_likely_next_token(inputs)
# > Execution time -- 3951.0 ms
print("Calling XLA function... (for the second time -- will be fast)")
xla_most_likely_next_token(inputs)
# > Execution time -- 1.6 ms

使用 TensorFlow 和 XLA 进行文本生成

与任何优化过程一样，XLA 也不例外——它并非免费的午餐。从文本生成用户的角度来看，您只需要记住一个技术方面。无需深入探讨细节，XLA 以这种方式使用时，会在调用 tf.function 时进行即时 (JIT) 编译，这依赖于多态性。

当您以这种方式编译函数时，XLA 会跟踪每个张量的形状和类型，以及每个非张量函数输入的数据。函数被编译成二进制文件，每次使用相同的张量形状和类型（带有任何张量数据）和相同的非张量参数调用时，编译后的函数都可以重复使用。相反，如果您使用输入张量中不同的形状或类型，或者使用不同的非张量参数，那么将进行新的昂贵的编译步骤。以下是一个简单的示例总结

# Note: execution times are deeply dependent on hardware -- a 3090 was used here.
import tensorflow as tf

@tf.function(jit_compile=True)
def max_plus_constant(tensor, scalar):
    return tf.math.reduce_max(tensor) + scalar

# Slow: XLA compilation will kick in, as it is the first call
max_plus_constant(tf.constant([0, 0, 0]), 1)
# > Execution time -- 520.4 ms

# Fast: Not the first call with this tensor shape, tensor type, and exact same
# non-tensor argument
max_plus_constant(tf.constant([1000, 0, -10]), 1)
# > Execution time -- 0.6 ms

# Slow: Different tensor type
max_plus_constant(tf.constant([0, 0, 0], dtype=tf.int64), 1)
# > Execution time -- 27.1 ms

# Slow: Different tensor shape
max_plus_constant(tf.constant([0, 0, 0, 0]), 1)
# > Execution time -- 25.5 ms

# Slow: Different non-tensor argument
max_plus_constant(tf.constant([0, 0, 0]), 2)
# > Execution time -- 24.9 ms

在实践中，对于文本生成，这意味着输入应该填充到某个长度的倍数（以便它具有有限数量的可能形状），并且第一次使用不同的选项会很慢。让我们看看当你天真地用 XLA 调用生成时会发生什么。

# Note: execution times are deeply dependent on hardware -- a 3090 was used here.
import time
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForCausalLM

# Notice the new argument, `padding_side="left"` -- decoder-only models, which can
# be instantiated with TFAutoModelForCausalLM, should be left-padded, as they
# continue generating from the input prompt.
tokenizer = AutoTokenizer.from_pretrained(
    "gpt2", padding_side="left", pad_token="</s>"
)
model = TFAutoModelForCausalLM.from_pretrained("gpt2")
model.config.pad_token_id = model.config.eos_token_id
input_1 = ["TensorFlow is"]
input_2 = ["TensorFlow is a"]

# One line to create a XLA generation function
xla_generate = tf.function(model.generate, jit_compile=True)

# Calls XLA generation without padding
tokenized_input_1 = tokenizer(input_1, return_tensors="tf")  # length = 4
tokenized_input_2 = tokenizer(input_2, return_tensors="tf")  # length = 5
print(f"`tokenized_input_1` shape = {tokenized_input_1.input_ids.shape}")
print(f"`tokenized_input_2` shape = {tokenized_input_2.input_ids.shape}")

print("Calling XLA generation with tokenized_input_1...")
print("(will be slow as it is the first call)")
start = time.time_ns()
xla_generate(**tokenized_input_1)
end = time.time_ns()
print(f"Execution time -- {(end - start) / 1e6:.1f} ms\n")
# > Execution time -- 9565.1 ms

print("Calling XLA generation with tokenized_input_2...")
print("(has a different length = will trigger tracing again)")
start = time.time_ns()
xla_generate(**tokenized_input_2)
end = time.time_ns()
print(f"Execution time -- {(end - start) / 1e6:.1f} ms\n")
# > Execution time -- 6815.0 ms

哦不，这太慢了！正如前面提到的，保持不同形状组合受控的一个解决方案是通过填充。分词器类有一个 pad_to_multiple_of 参数，可以用来在接受任意输入长度和限制跟踪之间取得平衡。

padding_kwargs = {"pad_to_multiple_of": 8, "padding": True}
tokenized_input_1_with_padding = tokenizer(
    input_1, return_tensors="tf", **padding_kwargs
)  # length = 8
tokenized_input_2_with_padding = tokenizer(
    input_2, return_tensors="tf", **padding_kwargs
)  # length = 8
print(
    "`tokenized_input_1_with_padding` shape = ",
    f"{tokenized_input_1_with_padding.input_ids.shape}"
)
print(
    "`tokenized_input_2_with_padding` shape = ",
    f"{tokenized_input_2_with_padding.input_ids.shape}"
)

print("Calling XLA generation with tokenized_input_1_with_padding...")
print("(slow, first time running with this length)")
start = time.time_ns()
xla_generate(**tokenized_input_1_with_padding)
end = time.time_ns()
print(f"Execution time -- {(end - start) / 1e6:.1f} ms\n")
# > Execution time -- 6815.4 ms

print("Calling XLA generation with tokenized_input_2_with_padding...")
print("(will be fast!)")
start = time.time_ns()
xla_generate(**tokenized_input_2_with_padding)
end = time.time_ns()
print(f"Execution time -- {(end - start) / 1e6:.1f} ms\n")
# > Execution time -- 19.3 ms

好多了，这样执行的连续生成调用将比以前快几个数量级。请记住，随时尝试新的生成选项都会触发跟踪。

print("Calling XLA generation with the same input, but with new options...")
print("(slow again)")
start = time.time_ns()
xla_generate(**tokenized_input_1_with_padding, num_beams=2)
end = time.time_ns()
print(f"Execution time -- {(end - start) / 1e6:.1f} ms\n")
# > Execution time -- 9644.2 ms

从开发人员的角度来看，依赖 XLA 意味着需要注意一些额外的细微差别。当数据结构的大小提前已知时，例如在模型训练中，XLA 会大放异彩。另一方面，当它们的维度无法确定或使用某些动态切片时，XLA 将无法编译。文本生成的现代实现是自回归的，其自然行为是扩展张量并在进行过程中突然中断某些操作——换句话说，默认情况下不适合 XLA。我们已经重写了整个 TensorFlow 文本生成代码库，以向量化操作并使用带有填充的固定大小结构。我们的 NLP 模型也进行了修改，以便在存在填充结构的情况下正确使用其位置嵌入。对于 TensorFlow 文本生成用户来说，结果应该是不可见的，除了 XLA 编译的可用性。

基准测试和结论

上面您看到了如何将 TensorFlow 函数转换为图并使用 XLA 编译对其进行加速。当前形式的文本生成仅仅是一种自回归函数，它在模型前向传递和一些后处理之间交替，每次迭代生成一个标记。通过 XLA 编译，整个过程得到优化，从而实现更快的执行。但是快多少呢？下面的 Gradio 演示包含了一些基准测试，比较了 Hugging Face 在两种主要机器学习框架（TensorFlow 和 PyTorch）上在多个 GPU 模型上的文本生成。

如果您探究这些结果，很快就会得出两个结论

正如本篇博客文章所铺垫的，当使用 XLA 时，TensorFlow 文本生成要快得多。在某些情况下，我们谈论的是超过 100 倍的加速，这真正展示了编译图的强大能力 🚀
在绝大多数情况下，使用 XLA 的 TensorFlow 文本生成是速度最快的选择，有些情况下甚至快达 9 倍，这驳斥了 PyTorch 是严肃 NLP 任务首选框架的迷思 💪

试试这个 colab，享受由 XLA 强化的文本生成功能吧！

更多博客文章

使用 Sentence Transformers v5 训练和微调稀疏嵌入模型

作者： 2025 年 7 月 1 日 • 106

使用 Sentence Transformers v4 训练和微调 Reranker 模型

作者： 2025 年 3 月 26 日 • 155

社区

通过拖放到文本输入框、粘贴或点击此处上传图片、音频和视频。

点击或粘贴此处以上传图片

· 注册或登录发表评论