Diffusers 文档

加速文本到图像扩散模型的推理

Diffusers

加入 Hugging Face 社区

并获取增强的文档体验

在模型、数据集和 Spaces 上协作

通过加速推理获得更快的示例

切换文档主题

开始使用

加速文本到图像扩散模型的推理

由于迭代和顺序的逆向扩散过程，扩散模型比其 GAN 对应模型速度更慢。有几种技术可以解决此限制，例如渐进式时间步蒸馏 (LCM LoRA)、模型压缩 (SSD-1B) 和重用去噪器的相邻特征 (DeepCache)。

但是，您不一定需要使用这些技术来加速推理。仅使用 PyTorch 2，您就可以将文本到图像扩散 pipelines 的推理延迟最多加速 3 倍。本教程将向您展示如何逐步应用 PyTorch 2 中的优化来减少推理延迟。在本教程中，您将使用 Stable Diffusion XL (SDXL) pipeline，但这些技术也适用于其他文本到图像扩散 pipelines。

确保您使用的是最新版本的 Diffusers

pip install -U diffusers

然后也升级其他必需的库

pip install -U transformers accelerate peft

安装 PyTorch nightly 以从最新和最快的内核中获益

pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121

下面报告的结果来自 80GB 400W A100，其时钟频率设置为最大值。如果您对完整的基准测试代码感兴趣，请查看 huggingface/diffusion-fast。

基线

让我们从基线开始。禁用降低的精度和 scaled_dot_product_attention (SDPA) 函数，Diffusers 会自动使用该函数

from diffusers import StableDiffusionXLPipeline

# Load the pipeline in full-precision and place its model components on CUDA.
pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0"
).to("cuda")

# Run the attention ops without SDPA.
pipe.unet.set_default_attn_processor()
pipe.vae.set_default_attn_processor()

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt, num_inference_steps=30).images[0]

此默认设置耗时 7.36 秒。

bfloat16

启用第一个优化，降低精度，更具体地说是 bfloat16。使用降低精度有几个好处

对于推理，使用降低的数值精度（例如 float16 或 bfloat16）不会影响生成质量，但会显着提高延迟。
与 float16 相比，使用 bfloat16 的好处取决于硬件，但现代 GPU 往往更喜欢 bfloat16。
与 float16 相比，bfloat16 在与量化一起使用时更具弹性，但我们使用的最新版本的量化库 (torchao) 在 float16 中没有数值问题。

from diffusers import StableDiffusionXLPipeline
import torch

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16
).to("cuda")

# Run the attention ops without SDPA.
pipe.unet.set_default_attn_processor()
pipe.vae.set_default_attn_processor()

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt, num_inference_steps=30).images[0]

bfloat16 将延迟从 7.36 秒减少到 4.63 秒。

在我们稍后使用 float16 进行的实验中，最新版本的 torchao 不会导致 float16 出现数值问题。

请查看加速推理指南，以了解有关以降低的精度运行推理的更多信息。

SDPA

注意力模块运行密集。但是借助 PyTorch 的 scaled_dot_product_attention 函数，它效率更高。Diffusers 默认使用此函数，因此您无需对代码进行任何更改。

from diffusers import StableDiffusionXLPipeline
import torch

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16
).to("cuda")

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt, num_inference_steps=30).images[0]

Scaled dot product attention 将延迟从 4.63 秒提高到 3.31 秒。

torch.compile

PyTorch 2 包含 torch.compile，它使用快速且优化的内核。在 Diffusers 中，通常编译 UNet 和 VAE，因为它们是计算最密集的模块。首先，配置一些编译器标志（有关更多选项，请参阅完整列表）

from diffusers import StableDiffusionXLPipeline
import torch

torch._inductor.config.conv_1x1_as_mm = True
torch._inductor.config.coordinate_descent_tuning = True
torch._inductor.config.epilogue_fusion = False
torch._inductor.config.coordinate_descent_check_all_directions = True

在编译 UNet 和 VAE 时，将它们的内存布局更改为“channels_last”也很重要，以确保最大速度。

pipe.unet.to(memory_format=torch.channels_last)
pipe.vae.to(memory_format=torch.channels_last)

现在编译并执行推理

# Compile the UNet and VAE.
pipe.unet = torch.compile(pipe.unet, mode="max-autotune", fullgraph=True)
pipe.vae.decode = torch.compile(pipe.vae.decode, mode="max-autotune", fullgraph=True)

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"

# First call to `pipe` is slow, subsequent ones are faster.
image = pipe(prompt, num_inference_steps=30).images[0]

torch.compile 提供不同的后端和模式。为了获得最大推理速度，请为 inductor 后端使用“max-autotune”。“max-autotune”使用 CUDA graphs 并专门针对延迟优化编译图。CUDA graphs 通过使用一种通过单个 CPU 操作启动多个 GPU 操作的机制，大大减少了启动 GPU 操作的开销。

使用 SDPA 注意力并编译 UNet 和 VAE 可以将延迟从 3.31 秒缩短到 2.54 秒。

从 PyTorch 2.3.1 开始，您可以控制 torch.compile() 的缓存行为。这对于像 "max-autotune" 这样的编译模式尤其有利，它会对多个编译标志执行网格搜索以找到最佳配置。在 torch.compile 中的编译时间缓存教程中了解更多信息。

防止图中断

指定 fullgraph=True 可确保底层模型中没有图中断，从而充分利用 torch.compile 而不会降低任何性能。对于 UNet 和 VAE，这意味着更改您访问返回变量的方式。

- latents = unet(
-   latents, timestep=timestep, encoder_hidden_states=prompt_embeds
-).sample

+ latents = unet(
+   latents, timestep=timestep, encoder_hidden_states=prompt_embeds, return_dict=False
+)[0]

编译后删除 GPU 同步

在迭代逆向扩散过程中，每次去噪器预测较少噪声的潜在嵌入后，都会在 scheduler 上调用 step() 函数。在 step() 内部，sigmas 变量被索引，当放置在 GPU 上时，会导致 CPU 和 GPU 之间的通信同步。这会引入延迟，并且在去噪器已经编译后变得更加明显。

但是，如果 sigmas 数组始终保留在 CPU 上，则不会发生 CPU 和 GPU 同步，并且您不会获得任何延迟。通常，任何 CPU 和 GPU 通信同步都应为零或保持在最低限度，因为它会影响推理延迟。

组合注意力模块的投影矩阵

SDXL 中的 UNet 和 VAE 使用类似 Transformer 的模块，该模块由注意力模块和前馈模块组成。

在注意力模块中，输入使用三个不同的投影矩阵（Q、K 和 V）投影到三个子空间中。这些投影在输入上分别执行。但是我们可以水平地将投影矩阵组合成一个矩阵，并在一个步骤中执行投影。这增加了输入投影的矩阵乘法的大小，并提高了量化的影响。

您只需一行代码即可组合投影矩阵

pipe.fuse_qkv_projections()

这提供了从 2.54 秒到 2.52 秒的略微改进。

对 fuse_qkv_projections() 的支持是有限且实验性的。它不适用于许多非 Stable Diffusion pipelines，例如 Kandinsky。您可以参考此 PR，了解如何为其他 pipelines 启用此功能。

动态量化

您还可以使用超轻量级的 PyTorch 量化库 torchao (commit SHA 54bcd5a10d0abbe7b0c045052029257099f83fd9) 将动态 int8 量化应用于 UNet 和 VAE。量化为模型增加了额外的转换开销，希望更快的 matmuls（动态量化）可以弥补这一点。如果 matmuls 太小，这些技术可能会降低性能。

首先，配置所有编译器标签

from diffusers import StableDiffusionXLPipeline
import torch

# Notice the two new flags at the end.
torch._inductor.config.conv_1x1_as_mm = True
torch._inductor.config.coordinate_descent_tuning = True
torch._inductor.config.epilogue_fusion = False
torch._inductor.config.coordinate_descent_check_all_directions = True
torch._inductor.config.force_fuse_int_mm_with_mul = True
torch._inductor.config.use_mixed_mm = True

UNet 和 VAE 中的某些线性层不会从动态 int8 量化中受益。您可以使用下面显示的 dynamic_quant_filter_fn 过滤掉这些层。

def dynamic_quant_filter_fn(mod, *args):
    return (
        isinstance(mod, torch.nn.Linear)
        and mod.in_features > 16
        and (mod.in_features, mod.out_features)
        not in [
            (1280, 640),
            (1920, 1280),
            (1920, 640),
            (2048, 1280),
            (2048, 2560),
            (2560, 1280),
            (256, 128),
            (2816, 1280),
            (320, 640),
            (512, 1536),
            (512, 256),
            (512, 512),
            (640, 1280),
            (640, 1920),
            (640, 320),
            (640, 5120),
            (640, 640),
            (960, 320),
            (960, 640),
        ]
    )


def conv_filter_fn(mod, *args):
    return (
        isinstance(mod, torch.nn.Conv2d) and mod.kernel_size == (1, 1) and 128 in [mod.in_channels, mod.out_channels]
    )

最后，应用到目前为止讨论的所有优化

# SDPA + bfloat16.
pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16
).to("cuda")

# Combine attention projection matrices.
pipe.fuse_qkv_projections()

# Change the memory layout.
pipe.unet.to(memory_format=torch.channels_last)
pipe.vae.to(memory_format=torch.channels_last)

由于动态量化仅限于线性层，因此将适当的逐点卷积层转换为线性层，以最大限度地发挥其优势。

from torchao import swap_conv2d_1x1_to_linear

swap_conv2d_1x1_to_linear(pipe.unet, conv_filter_fn)
swap_conv2d_1x1_to_linear(pipe.vae, conv_filter_fn)

应用动态量化

from torchao import apply_dynamic_quant

apply_dynamic_quant(pipe.unet, dynamic_quant_filter_fn)
apply_dynamic_quant(pipe.vae, dynamic_quant_filter_fn)

最后，编译并执行推理

pipe.unet = torch.compile(pipe.unet, mode="max-autotune", fullgraph=True)
pipe.vae.decode = torch.compile(pipe.vae.decode, mode="max-autotune", fullgraph=True)

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt, num_inference_steps=30).images[0]

应用动态量化将延迟从 2.52 秒提高到 2.43 秒。

< > 更新在 GitHub 上

←加载 LoRA 用于推理处理大型模型→