XLA

加速线性代数 (XLA) 是一个线性代数编译器，可以优化模型在不同硬件和框架上的运行时。

本指南将专门介绍如何使用 XLA 加速 TensorFlow 模型。

TensorFlow

XLA 可以在不更改任何源代码的情况下加速 TensorFlow 模型。它已经与 TensorFlow 库打包在一起，并且在任何创建图的函数（例如 tf.function）中通过 jit_compile 触发。

如果你正在使用 Keras 方法（如 fit 和 predict），请通过将 jit_compile=True 传递给 compile 来启用 XLA。

model.compile(jit_compile=True)

XLA 可以用于加速任何任意的 tf.function。

具有 TensorFlow 实现的模型（如 GPT2、T5、OPT 和 Whisper）与 XLA 兼容。加速效果取决于模型，但一般来说，Transformers 中的 TensorFlow 模型可以获得约 100 倍的加速。

函数

下面展示了 TensorFlow 模型中的典型前向传递。要使用 XLA 运行前向传递，请使用 tf.function 包装模型并设置 jit_compile=True。

import tensorflow as tf

model = tf.keras.Sequential(
    [tf.keras.layers.Dense(10, input_shape=(10,), activation="relu"), tf.keras.layers.Dense(5, activation="softmax")]
)
# Generate random inputs for the model.
batch_size = 16
input_vector_dim = 10
random_inputs = tf.random.normal((batch_size, input_vector_dim))

# Run a forward pass.
- _ = model(random_inputs)
+ xla_fn = tf.function(model, jit_compile=True)
+ _ = xla_fn(random_inputs)

模型的默认 call 函数用于编译 XLA 图。但是，如果你想使用 XLA 编译任何其他模型函数，请使用 tf.function 包装它们。

my_xla_fn = tf.function(model.my_xla_fn, jit_compile=True)

文本生成

你也可以使用 XLA 编译其他模型函数。例如，通过使用 tf.function 包装 generate()，为文本生成启用 XLA。

import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForCausalLM
# Will error if the minimal version of Transformers is not installed.
from transformers.utils import check_min_version

check_min_version("4.21.0")

tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2", padding_side="left", pad_token="</s>")
model = TFAutoModelForCausalLM.from_pretrained("openai-community/gpt2")
input_string = ["TensorFlow is"]

xla_generate = tf.function(model.generate, jit_compile=True)

tokenized_input = tokenizer(input_string, return_tensors="tf")
generated_tokens = xla_generate(**tokenized_input, num_beams=2)

decoded_text = tokenizer.decode(generated_tokens[0], skip_special_tokens=True)
print(f"Generated -- {decoded_text}")
"Generated -- TensorFlow is an open-source, open-source, distributed-source application framework for the"

追踪

首次执行启用 XLA 的函数时，它会尝试在一个称为追踪的过程中推断计算图。这是一个耗时的步骤，但任何后续对该函数的调用都会快得多，因为它不必再次追踪计算图。

为了确保一个函数只被追踪一次，输入必须与构建图时具有相同的形状。这对于像图像这样的固定输入形状通常不是问题，但对于像文本这样的可变形状输入可能会成为问题。

一种处理这个问题的方法是填充你的文本，使其始终具有相同的形状。在 tokenizer 中配置填充选项，例如 pad_to_multiple_of。

import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2", padding_side="left", pad_token="</s>")
model = TFAutoModelForCausalLM.from_pretrained("openai-community/gpt2")
input_string = ["TensorFlow is"]

xla_generate = tf.function(model.generate, jit_compile=True)

# Call tokenizer with padding options.
tokenized_input = tokenizer(input_string, pad_to_multiple_of=8, padding=True, return_tensors="tf")

generated_tokens = xla_generate(**tokenized_input, num_beams=2)
decoded_text = tokenizer.decode(generated_tokens[0], skip_special_tokens=True)
print(f"Generated -- {decoded_text}")

除了输入形状外，在任何时候更改生成选项也会触发追踪。

资源

通过以下资源了解更多关于 XLA 的信息。

一个演示与 XLA 兼容的编码器-解码器和仅解码器文本生成模型的 notebook。
《使用 TensorFlow 和 XLA 加速文本生成》博客文章比较了与 XLA 兼容的模型的基准测试，并友好地介绍了 TensorFlow 中的 XLA。
《Hugging Face 如何通过 XLA 提升文本生成性能》博客文章讨论了在 Transformers 中将 XLA 添加到 TensorFlow 模型背后的设计理念。
《图和 tf.function 简介》指南。
《使用 tf.function 获得更好性能》指南。
XLA 文档。

< > 在 GitHub 上更新