Diffusers 文档

AutoencoderKL

Hugging Face's logo
加入 Hugging Face 社区

并获得增强的文档体验

开始

AutoencoderKL

变分自编码器 (VAE) 模型与 KL 损失在 Diederik P. Kingma 和 Max Welling 的 Auto-Encoding Variational Bayes 中被提出。该模型在 🤗 Diffusers 中用于将图像编码为潜在表示,并将潜在表示解码为图像。

论文摘要如下:

在存在具有难以处理的后验分布的连续潜在变量和大型数据集的情况下,我们如何在有向概率模型中执行有效的推理和学习? 我们引入了一种随机变分推理和学习算法,该算法可以扩展到大型数据集,并且在一些温和的可微性条件下,甚至可以在难以处理的情况下工作。 我们的贡献是双重的。 首先,我们表明,变分下界的重新参数化产生了一个下界估计器,可以使用标准随机梯度方法直接对其进行优化。 其次,我们表明,对于每个数据点都有连续潜在变量的 i.i.d. 数据集,通过使用提出的下界估计器将近似推理模型(也称为识别模型)拟合到难以处理的后验,可以使后验推理特别有效。 理论优势反映在实验结果中。

从原始格式加载

默认情况下,AutoencoderKL 应该使用 from_pretrained() 加载,但也可以使用 FromOriginalModelMixin.from_single_file 从原始格式加载,如下所示

from diffusers import AutoencoderKL

url = "https://huggingface.co/stabilityai/sd-vae-ft-mse-original/blob/main/vae-ft-mse-840000-ema-pruned.safetensors"  # can also be a local file
model = AutoencoderKL.from_single_file(url)

AutoencoderKL

class diffusers.AutoencoderKL

< >

( in_channels: int = 3 out_channels: int = 3 down_block_types: typing.Tuple[str] = ('DownEncoderBlock2D',) up_block_types: typing.Tuple[str] = ('UpDecoderBlock2D',) block_out_channels: typing.Tuple[int] = (64,) layers_per_block: int = 1 act_fn: str = 'silu' latent_channels: int = 4 norm_num_groups: int = 32 sample_size: int = 32 scaling_factor: float = 0.18215 shift_factor: typing.Optional[float] = None latents_mean: typing.Optional[typing.Tuple[float]] = None latents_std: typing.Optional[typing.Tuple[float]] = None force_upcast: float = True use_quant_conv: bool = True use_post_quant_conv: bool = True mid_block_add_attention: bool = True )

参数

  • in_channels (int, 可选, 默认为 3) — 输入图像中的通道数。
  • out_channels (int, 可选, 默认为 3) — 输出中的通道数。
  • down_block_types (Tuple[str], 可选, 默认为 ("DownEncoderBlock2D",)) — 下采样块类型元组。
  • up_block_types (Tuple[str], 可选, 默认为 ("UpDecoderBlock2D",)) — 上采样块类型元组。
  • block_out_channels (Tuple[int], optional, defaults to (64,)) — Tuple of block output channels.
  • act_fn (str, optional, defaults to "silu") — The activation function to use.
  • latent_channels (int, optional, defaults to 4) — Number of channels in the latent space.
  • sample_size (int, optional, defaults to 32) — Sample input size.
  • scaling_factor (float, optional, defaults to 0.18215) — The component-wise standard deviation of the trained latent space computed using the first batch of the training set. This is used to scale the latent space to have unit variance when training the diffusion model. The latents are scaled with the formula z = z * scaling_factor before being passed to the diffusion model. When decoding, the latents are scaled back to the original scale with the formula: z = 1 / scaling_factor * z. For more details, refer to sections 4.3.2 and D.1 of the High-Resolution Image Synthesis with Latent Diffusion Models paper.
  • force_upcast (bool, optional, default to True) — If enabled it will force the VAE to run in float32 for high image resolution pipelines, such as SD-XL. VAE can be fine-tuned / trained to a lower range without loosing too much precision in which case force_upcast can be set to False - see: https://huggingface.co/madebyollin/sdxl-vae-fp16-fix
  • mid_block_add_attention (bool, optional, default to True) — If enabled, the mid_block of the Encoder and Decoder will have attention blocks. If set to false, the mid_block will only have resnet blocks

A VAE model with KL loss for encoding images into latents and decoding latent representations into images.

This model inherits from ModelMixin. Check the superclass documentation for it’s generic methods implemented for all models (such as downloading or saving).

wrapper

< >

( *args **kwargs )

wrapper

< >

( *args **kwargs )

disable_slicing

< >

( )

Disable sliced VAE decoding. If enable_slicing was previously enabled, this method will go back to computing decoding in one step.

disable_tiling

< >

( )

Disable tiled VAE decoding. If enable_tiling was previously enabled, this method will go back to computing decoding in one step.

enable_slicing

< >

( )

Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.

enable_tiling

< >

( use_tiling: bool = True )

Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow processing larger images.

forward

< >

( sample: Tensor sample_posterior: bool = False return_dict: bool = True generator: typing.Optional[torch._C.Generator] = None )

参数

  • sample (torch.Tensor) — Input sample.
  • sample_posterior (bool, optional, defaults to False) — Whether to sample from the posterior.
  • return_dict (bool, optional, defaults to True) — Whether or not to return a DecoderOutput instead of a plain tuple.

fuse_qkv_projections

< >

( )

Enables fused QKV projections. For self-attention modules, all projection matrices (i.e., query, key, value) are fused. For cross-attention modules, key and value projection matrices are fused.

This API is 🧪 experimental.

set_attn_processor

< >

( processor: typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.JointAttnProcessor2_0, diffusers.models.attention_processor.PAGJointAttnProcessor2_0, diffusers.models.attention_processor.PAGCFGJointAttnProcessor2_0, diffusers.models.attention_processor.FusedJointAttnProcessor2_0, diffusers.models.attention_processor.AllegroAttnProcessor2_0, diffusers.models.attention_processor.AuraFlowAttnProcessor2_0, diffusers.models.attention_processor.FusedAuraFlowAttnProcessor2_0, diffusers.models.attention_processor.FluxAttnProcessor2_0, diffusers.models.attention_processor.FluxAttnProcessor2_0_NPU, diffusers.models.attention_processor.FusedFluxAttnProcessor2_0, diffusers.models.attention_processor.FusedFluxAttnProcessor2_0_NPU, diffusers.models.attention_processor.CogVideoXAttnProcessor2_0, diffusers.models.attention_processor.FusedCogVideoXAttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.XLAFlashAttnProcessor2_0, diffusers.models.attention_processor.AttnProcessorNPU, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.MochiVaeAttnProcessor2_0, diffusers.models.attention_processor.MochiAttnProcessor2_0, diffusers.models.attention_processor.StableAudioAttnProcessor2_0, diffusers.models.attention_processor.HunyuanAttnProcessor2_0, diffusers.models.attention_processor.FusedHunyuanAttnProcessor2_0, diffusers.models.attention_processor.PAGHunyuanAttnProcessor2_0, diffusers.models.attention_processor.PAGCFGHunyuanAttnProcessor2_0, diffusers.models.attention_processor.LuminaAttnProcessor2_0, diffusers.models.attention_processor.FusedAttnProcessor2_0, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor2_0, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.SanaLinearAttnProcessor2_0, diffusers.models.attention_processor.PAGCFGSanaLinearAttnProcessor2_0, diffusers.models.attention_processor.PAGIdentitySanaLinearAttnProcessor2_0, diffusers.models.attention_processor.SanaMultiscaleLinearAttention, diffusers.models.attention_processor.SanaMultiscaleAttnProcessor2_0, diffusers.models.attention_processor.SanaMultiscaleAttentionProjection, diffusers.models.attention_processor.IPAdapterAttnProcessor, diffusers.models.attention_processor.IPAdapterAttnProcessor2_0, diffusers.models.attention_processor.IPAdapterXFormersAttnProcessor, diffusers.models.attention_processor.SD3IPAdapterJointAttnProcessor2_0, diffusers.models.attention_processor.PAGIdentitySelfAttnProcessor2_0, diffusers.models.attention_processor.PAGCFGIdentitySelfAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor]]] )

参数

  • processor (dict of AttentionProcessor or only AttentionProcessor) — The instantiated processor class or a dictionary of processor classes that will be set as the processor for all Attention layers.

    If processor is a dict, the key needs to define the path to the corresponding cross attention processor. This is strongly recommended when setting trainable attention processors.

Sets the attention processor to use to compute attention.

set_default_attn_processor

< >

( )

Disables custom attention processors and sets the default attention implementation.

tiled_decode

< >

( z: Tensor return_dict: bool = True ) ~models.vae.DecoderOutput or tuple

参数

  • z (torch.Tensor) — Input batch of latent vectors.
  • return_dict (bool, 可选, 默认为 True) — 是否返回 ~models.vae.DecoderOutput 而不是一个普通的元组。

返回值

~models.vae.DecoderOutputtuple

如果 return_dict 为 True,则返回 ~models.vae.DecoderOutput,否则返回一个普通的 tuple

使用平铺解码器解码一批图像。

tiled_encode

< >

( x: Tensor return_dict: bool = True ) ~models.autoencoder_kl.AutoencoderKLOutputtuple

参数

  • x (torch.Tensor) — 输入图像批次。
  • return_dict (bool, 可选, 默认为 True) — 是否返回 ~models.autoencoder_kl.AutoencoderKLOutput 而不是一个普通的元组。

返回值

~models.autoencoder_kl.AutoencoderKLOutputtuple

如果 return_dict 为 True,则返回 ~models.autoencoder_kl.AutoencoderKLOutput,否则返回一个普通的 tuple

使用平铺编码器编码一批图像。

当启用此选项时,VAE 将输入张量分割成瓦片,分步计算编码。这对于保持内存使用恒定,而与图像大小无关非常有用。平铺编码的最终结果与非平铺编码不同,因为每个瓦片都使用不同的编码器。为了避免平铺伪影,瓦片会重叠并混合在一起,形成平滑的输出。您可能仍然会在输出中看到瓦片大小的变化,但它们应该不那么明显。

unfuse_qkv_projections

< >

( )

如果启用了融合 QKV 投影,则禁用它。

This API is 🧪 experimental.

AutoencoderKLOutput

class diffusers.models.modeling_outputs.AutoencoderKLOutput

< >

( latent_dist: DiagonalGaussianDistribution )

参数

  • latent_dist (DiagonalGaussianDistribution) — Encoder 的编码输出,表示为 DiagonalGaussianDistribution 的均值和对数方差。DiagonalGaussianDistribution 允许从分布中采样潜在变量。

AutoencoderKL 编码方法的输出。

DecoderOutput

class diffusers.models.autoencoders.vae.DecoderOutput

< >

( sample: Tensor commit_loss: typing.Optional[torch.FloatTensor] = None )

参数

  • sample (形状为 (batch_size, num_channels, height, width)torch.Tensor) — 来自模型最后一层的解码输出样本。

解码方法的输出。

FlaxAutoencoderKL

class diffusers.FlaxAutoencoderKL

< >

( in_channels: int = 3 out_channels: int = 3 down_block_types: typing.Tuple[str] = ('DownEncoderBlock2D',) up_block_types: typing.Tuple[str] = ('UpDecoderBlock2D',) block_out_channels: typing.Tuple[int] = (64,) layers_per_block: int = 1 act_fn: str = 'silu' latent_channels: int = 4 norm_num_groups: int = 32 sample_size: int = 32 scaling_factor: float = 0.18215 dtype: dtype = <class 'jax.numpy.float32'> parent: typing.Union[flax.linen.module.Module, flax.core.scope.Scope, flax.linen.module._Sentinel, NoneType] = <flax.linen.module._Sentinel object at 0x7f484d8851e0> name: typing.Optional[str] = None )

参数

  • in_channels (int, 可选, 默认为 3) — 输入图像中的通道数。
  • out_channels (int, 可选, 默认为 3) — 输出中的通道数。
  • down_block_types (Tuple[str], 可选, 默认为 (DownEncoderBlock2D)) — 下采样块类型的元组。
  • up_block_types (Tuple[str], 可选, 默认为 (UpDecoderBlock2D)) — 上采样块类型的元组。
  • block_out_channels (Tuple[str], 可选, 默认为 (64,)) — 块输出通道的元组。
  • layers_per_block (int, 可选, 默认为 2) — 每个块的 ResNet 层数。
  • act_fn (str, 可选, 默认为 silu) — 要使用的激活函数。
  • latent_channels (int, 可选, 默认为 4) — 潜在空间中的通道数。
  • norm_num_groups (int, 可选, 默认为 32) — 归一化的组数。
  • sample_size (int, 可选, 默认为 32) — 样本输入大小。
  • scaling_factor (float, 可选, 默认为 0.18215) — 使用训练集的第一批数据计算出的训练后潜在空间的组件标准差。这用于在训练扩散模型时缩放潜在空间以使其具有单位方差。潜在变量在传递到扩散模型之前使用公式 z = z * scaling_factor 进行缩放。解码时,潜在变量使用公式 z = 1 / scaling_factor * z 缩放回原始比例。有关更多详细信息,请参阅 High-Resolution Image Synthesis with Latent Diffusion Models 论文的 4.3.2 和 D.1 节。
  • dtype (jnp.dtype, 可选, 默认为 jnp.float32) — 参数的 dtype

用于解码潜在表示的具有 KL 损失的 VAE 模型的 Flax 实现。

此模型继承自 FlaxModelMixin。查看超类文档,了解为所有模型实现的通用方法(例如下载或保存)。

此模型是一个 Flax Linen flax.linen.Module 子类。将其用作常规 Flax Linen 模块,并参考 Flax 文档以了解与其一般用法和行为相关的所有事项。

支持固有的 JAX 功能,例如以下内容:

FlaxAutoencoderKLOutput

class diffusers.models.vae_flax.FlaxAutoencoderKLOutput

< >

( latent_dist: FlaxDiagonalGaussianDistribution )

参数

  • latent_dist (FlaxDiagonalGaussianDistribution) — Encoder 的编码输出,表示为 FlaxDiagonalGaussianDistribution 的均值和对数方差。FlaxDiagonalGaussianDistribution 允许从分布中采样潜在变量。

AutoencoderKL 编码方法的输出。

replace

< >

( **更新 )

返回一个新对象,该对象使用新值替换指定的字段。

FlaxDecoderOutput

diffusers.models.vae_flax.FlaxDecoderOutput

< >

( 样本: Array )

参数

  • 样本 (jnp.ndarray,形状为 (batch_size, num_channels, height, width)) — 模型最后一层的解码输出样本。
  • dtype (jnp.dtype, 可选, 默认为 jnp.float32) — 参数的 dtype

解码方法的输出。

replace

< >

( **更新 )

返回一个新对象,该对象使用新值替换指定的字段。

< > 在 GitHub 上更新