Diffusers

加入 Hugging Face 社区

并获得增强的文档体验

在模型、数据集和 Spaces 上进行协作

通过加速推理获得更快的示例

切换文档主题

开始使用

编译和卸载量化模型

模型优化通常涉及在推理速度和内存使用之间进行权衡。例如，虽然缓存可以提高推理速度，但它也会增加内存消耗，因为它需要存储中间注意力层的输出。一种更均衡的优化策略是结合模型量化、torch.compile 和各种卸载方法。

对于图像生成，结合量化和模型卸载通常可以在质量、速度和内存之间实现最佳权衡。分组卸载对于图像生成效果不佳，因为如果计算内核完成速度较快，通常无法完全重叠数据传输。这会导致 CPU 和 GPU 之间产生一些通信开销。

对于视频生成，结合量化和分组卸载往往效果更好，因为视频模型更受计算限制。

下表比较了不同优化策略组合及其对 Flux 延迟和内存使用的影响。

组合	延迟 (秒)	内存使用 (GB)
量化	32.602	14.9453
量化, torch.compile	25.847	14.9448
量化, torch.compile, 模型 CPU 卸载	32.312	12.2369

这些结果是在 RTX 4090 上对 Flux 进行基准测试得出的。transformer 和 text_encoder 组件被量化。如果您有兴趣评估自己的模型，请参考此基准测试脚本。

本指南将向您展示如何使用 bitsandbytes 编译和卸载量化模型。请确保您正在使用 PyTorch nightly 和最新版本的 bitsandbytes。

pip install -U bitsandbytes

量化和 torch.compile

首先，量化一个模型以减少存储所需的内存，并编译它以加速推理。

配置 Dynamo 的 `capture_dynamic_output_shape_ops = True` 以在编译 bitsandbytes 模型时处理动态输出。

import torch
from diffusers import DiffusionPipeline
from diffusers.quantizers import PipelineQuantizationConfig

torch._dynamo.config.capture_dynamic_output_shape_ops = True

# quantize
pipeline_quant_config = PipelineQuantizationConfig(
    quant_backend="bitsandbytes_4bit",
    quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
    components_to_quantize=["transformer", "text_encoder_2"],
)
pipeline = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    quantization_config=pipeline_quant_config,
    torch_dtype=torch.bfloat16,
).to("cuda")

# compile
pipeline.transformer.to(memory_format=torch.channels_last)
pipeline.transformer.compile(mode="max-autotune", fullgraph=True)
pipeline("""
    cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
    highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
"""
).images[0]

量化、torch.compile 和卸载

除了量化和 torch.compile，如果您需要进一步减少内存使用，可以尝试卸载。卸载会根据计算需要，将各种层或模型组件从 CPU 移动到 GPU。

在卸载过程中配置 Dynamo 的 `cache_size_limit` 以避免过度重新编译，并设置 `capture_dynamic_output_shape_ops = True` 以在编译 bitsandbytes 模型时处理动态输出。

模型 CPU 卸载

分组卸载

< > 在 GitHub 上更新

←减少内存使用 Pruna→