Diffusers 文档

AutoencoderDC

Diffusers

加入 Hugging Face 社区

并获取增强的文档体验

在模型、数据集和 Spaces 上协作

通过加速推理获得更快的示例

切换文档主题

开始使用

AutoencoderDC

在 SANA 中使用并在 DCAE 中引入的 2D Autoencoder 模型，作者来自 MIT HAN 实验室的 Junyu Chen*, Han Cai*, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, Song Han。

论文摘要如下：

我们提出了深度压缩自编码器 (DC-AE)，这是一种用于加速高分辨率扩散模型的新型自编码器模型系列。现有的自编码器模型在中等空间压缩率（例如 8 倍）下表现出了令人印象深刻的结果，但在高空间压缩率（例如 64 倍）下无法保持令人满意的重建精度。我们通过引入两项关键技术来应对这一挑战：(1) 残差自编码，我们设计模型以学习基于空间到通道转换特征的残差，以减轻高空间压缩自编码器的优化难度；(2) 解耦高分辨率自适应，一种高效的解耦三阶段训练策略，用于减轻高空间压缩自编码器的泛化惩罚。通过这些设计，我们在保持重建质量的同时，将自编码器的空间压缩率提高到 128 倍。将我们的 DC-AE 应用于潜在扩散模型，我们在不降低精度的前提下实现了显著的加速。例如，在 ImageNet 512x512 上，与广泛使用的 SD-VAE-f8 自编码器相比，我们的 DC-AE 在 H100 GPU 上为 UViT-H 提供了 19.1 倍的推理加速和 17.9 倍的训练加速，同时实现了更好的 FID。我们的代码可在此 https URL 获取。

以下 DCAE 模型已发布并在 Diffusers 中受支持。

Diffusers 格式	原始格式
`mit-han-lab/dc-ae-f32c32-sana-1.0-diffusers`	`mit-han-lab/dc-ae-f32c32-sana-1.0`
`mit-han-lab/dc-ae-f32c32-in-1.0-diffusers`	`mit-han-lab/dc-ae-f32c32-in-1.0`
`mit-han-lab/dc-ae-f32c32-mix-1.0-diffusers`	`mit-han-lab/dc-ae-f32c32-mix-1.0`
`mit-han-lab/dc-ae-f64c128-in-1.0-diffusers`	`mit-han-lab/dc-ae-f64c128-in-1.0`
`mit-han-lab/dc-ae-f64c128-mix-1.0-diffusers`	`mit-han-lab/dc-ae-f64c128-mix-1.0`
`mit-han-lab/dc-ae-f128c512-in-1.0-diffusers`	`mit-han-lab/dc-ae-f128c512-in-1.0`
`mit-han-lab/dc-ae-f128c512-mix-1.0-diffusers`	`mit-han-lab/dc-ae-f128c512-mix-1.0`

此模型由 lawrence-cj 贡献。

使用 from_pretrained() 加载 Diffusers 格式的模型。

from diffusers import AutoencoderDC

ae = AutoencoderDC.from_pretrained("mit-han-lab/dc-ae-f32c32-sana-1.0-diffusers", torch_dtype=torch.float32).to("cuda")

通过 from_single_file 加载 Diffusers 中的模型

from difusers import AutoencoderDC

ckpt_path = "https://huggingface.co/mit-han-lab/dc-ae-f32c32-sana-1.0/blob/main/model.safetensors"
model = AutoencoderDC.from_single_file(ckpt_path)

AutoencoderDC 模型具有 in 和 mix 单文件检查点变体，它们具有匹配的检查点键，但使用不同的缩放因子。Diffusers 无法仅根据检查点自动推断要与模型一起使用的正确配置文件，并将默认使用 mix 变体配置文件来配置模型。要覆盖自动确定的配置，请在使用带有 in 变体检查点的单文件加载时使用 config 参数。

from diffusers import AutoencoderDC

ckpt_path = "https://huggingface.co/mit-han-lab/dc-ae-f128c512-in-1.0/blob/main/model.safetensors"
model = AutoencoderDC.from_single_file(ckpt_path, config="mit-han-lab/dc-ae-f128c512-in-1.0-diffusers")

AutoencoderDC

class diffusers.AutoencoderDC

< source >

( in_channels: int = 3 latent_channels: int = 32 attention_head_dim: int = 32 encoder_block_types: typing.Union[str, typing.Tuple[str]] = 'ResBlock' decoder_block_types: typing.Union[str, typing.Tuple[str]] = 'ResBlock' encoder_block_out_channels: typing.Tuple[int, ...] = (128, 256, 512, 512, 1024, 1024) decoder_block_out_channels: typing.Tuple[int, ...] = (128, 256, 512, 512, 1024, 1024) encoder_layers_per_block: typing.Tuple[int] = (2, 2, 2, 3, 3, 3) decoder_layers_per_block: typing.Tuple[int] = (3, 3, 3, 3, 3, 3) encoder_qkv_multiscales: typing.Tuple[typing.Tuple[int, ...], ...] = ((), (), (), (5,), (5,), (5,)) decoder_qkv_multiscales: typing.Tuple[typing.Tuple[int, ...], ...] = ((), (), (), (5,), (5,), (5,)) upsample_block_type: str = 'pixel_shuffle' downsample_block_type: str = 'pixel_unshuffle' decoder_norm_types: typing.Union[str, typing.Tuple[str]] = 'rms_norm' decoder_act_fns: typing.Union[str, typing.Tuple[str]] = 'silu' scaling_factor: float = 1.0 )

参数

in_channels (int, 默认为 3) — 样本中的输入通道数。
latent_channels (int, 默认为 32) — 潜在空间表示中的通道数。
encoder_block_types (Union[str, Tuple[str]], defaults to "ResBlock") — 编码器中使用的块类型。
decoder_block_types (Union[str, Tuple[str]], defaults to "ResBlock") — 解码器中使用的块类型。
encoder_block_out_channels (Tuple[int, ...], defaults to (128, 256, 512, 512, 1024, 1024)) — 编码器中每个块的输出通道数。
decoder_block_out_channels (Tuple[int, ...], defaults to (128, 256, 512, 512, 1024, 1024)) — 解码器中每个块的输出通道数。
encoder_layers_per_block (Tuple[int], defaults to (2, 2, 2, 3, 3, 3)) — 编码器中每个块的层数。
decoder_layers_per_block (Tuple[int], defaults to (3, 3, 3, 3, 3, 3)) — 解码器中每个块的层数。
encoder_qkv_multiscales (Tuple[Tuple[int, ...], ...], defaults to ((), (), (), (5,), (5,), (5,))) — 编码器 QKV（查询-键-值）转换的多尺度配置。
decoder_qkv_multiscales (Tuple[Tuple[int, ...], ...], defaults to ((), (), (), (5,), (5,), (5,))) — 解码器 QKV（查询-键-值）转换的多尺度配置。
upsample_block_type (str, defaults to "pixel_shuffle") — 解码器中用于上采样的块类型。
downsample_block_type (str, defaults to "pixel_unshuffle") — 编码器中用于下采样的块类型。
decoder_norm_types (Union[str, Tuple[str]], defaults to "rms_norm") — 解码器中使用的归一化类型。
decoder_act_fns (Union[str, Tuple[str]], defaults to "silu") — 解码器中使用的激活函数。
scaling_factor (float, defaults to 1.0) — 潜在特征均方根倒数的乘法逆元。这用于在训练扩散模型时，将潜在空间缩放到单位方差。潜在变量在传递到扩散模型之前，会使用公式 z = z * scaling_factor 进行缩放。解码时，潜在变量会使用公式：z = 1 / scaling_factor * z 缩放回原始尺度。

在 DCAE 中引入并在 SANA 中使用的自动编码器模型。

此模型继承自 ModelMixin。查看超类文档，了解为所有模型实现的通用方法（例如下载或保存）。

wrapper

< source >

( *args **kwargs )

wrapper

< source >

( *args **kwargs )

disable_slicing

< source >

( )

禁用切片 AE 解码。如果之前启用了 enable_slicing，此方法将恢复为一步计算解码。

disable_tiling

< source >

( )

禁用平铺 AE 解码。如果之前启用了 enable_tiling，此方法将恢复为一步计算解码。

enable_slicing

< source >

( )

启用切片 AE 解码。启用此选项后，AE 将输入张量分割成切片，分步计算解码。这有助于节省一些内存并允许更大的批大小。

enable_tiling

< source >

( tile_sample_min_height: typing.Optional[int] = None tile_sample_min_width: typing.Optional[int] = None tile_sample_stride_height: typing.Optional[float] = None tile_sample_stride_width: typing.Optional[float] = None )

参数

tile_sample_min_height (int, optional) — 样本在高度维度上被分成瓦片所需的最小高度。
tile_sample_min_width (int, optional) — 样本在宽度维度上被分成瓦片所需的最小宽度。
tile_sample_stride_height (int, optional) — 两个连续的垂直瓦片之间所需的最小重叠量。这是为了确保在高度维度上不会产生平铺伪影。
tile_sample_stride_width (int, optional) — 两个连续的水平瓦片之间的步幅。这是为了确保在宽度维度上不会产生平铺伪影。

启用平铺 AE 解码。启用此选项后，AE 将输入张量分割成瓦片，分步计算解码和编码。这对于节省大量内存并允许处理更大的图像非常有用。

DecoderOutput

class diffusers.models.autoencoders.vae.DecoderOutput

< source >

( sample: Tensor commit_loss: typing.Optional[torch.FloatTensor] = None )

参数

sample (torch.Tensor of shape (batch_size, num_channels, height, width)) — 来自模型最后一层的解码输出样本，形状为 (batch_size, num_channels, height, width) 的 torch.Tensor。

解码方法的输出。

< > 更新 on GitHub

←AsymmetricAutoencoderKL ConsistencyDecoderVAE→