Diffusers 文档
AutoencoderKL
并获得增强的文档体验
开始
AutoencoderKL
变分自编码器 (VAE) 模型与 KL 损失在 Diederik P. Kingma 和 Max Welling 的 Auto-Encoding Variational Bayes 中被提出。该模型在 🤗 Diffusers 中用于将图像编码为潜在表示,并将潜在表示解码为图像。
论文摘要如下:
在存在具有难以处理的后验分布的连续潜在变量和大型数据集的情况下,我们如何在有向概率模型中执行有效的推理和学习? 我们引入了一种随机变分推理和学习算法,该算法可以扩展到大型数据集,并且在一些温和的可微性条件下,甚至可以在难以处理的情况下工作。 我们的贡献是双重的。 首先,我们表明,变分下界的重新参数化产生了一个下界估计器,可以使用标准随机梯度方法直接对其进行优化。 其次,我们表明,对于每个数据点都有连续潜在变量的 i.i.d. 数据集,通过使用提出的下界估计器将近似推理模型(也称为识别模型)拟合到难以处理的后验,可以使后验推理特别有效。 理论优势反映在实验结果中。
从原始格式加载
默认情况下,AutoencoderKL 应该使用 from_pretrained() 加载,但也可以使用 FromOriginalModelMixin.from_single_file
从原始格式加载,如下所示
from diffusers import AutoencoderKL
url = "https://huggingface.co/stabilityai/sd-vae-ft-mse-original/blob/main/vae-ft-mse-840000-ema-pruned.safetensors" # can also be a local file
model = AutoencoderKL.from_single_file(url)
AutoencoderKL
class diffusers.AutoencoderKL
< 源码 >( in_channels: int = 3 out_channels: int = 3 down_block_types: typing.Tuple[str] = ('DownEncoderBlock2D',) up_block_types: typing.Tuple[str] = ('UpDecoderBlock2D',) block_out_channels: typing.Tuple[int] = (64,) layers_per_block: int = 1 act_fn: str = 'silu' latent_channels: int = 4 norm_num_groups: int = 32 sample_size: int = 32 scaling_factor: float = 0.18215 shift_factor: typing.Optional[float] = None latents_mean: typing.Optional[typing.Tuple[float]] = None latents_std: typing.Optional[typing.Tuple[float]] = None force_upcast: float = True use_quant_conv: bool = True use_post_quant_conv: bool = True mid_block_add_attention: bool = True )
参数
- in_channels (int, 可选, 默认为 3) — 输入图像中的通道数。
- out_channels (int, 可选, 默认为 3) — 输出中的通道数。
- down_block_types (
Tuple[str]
, 可选, 默认为("DownEncoderBlock2D",)
) — 下采样块类型元组。 - up_block_types (
Tuple[str]
, 可选, 默认为("UpDecoderBlock2D",)
) — 上采样块类型元组。 - block_out_channels (
Tuple[int]
, optional, defaults to(64,)
) — Tuple of block output channels. - act_fn (
str
, optional, defaults to"silu"
) — The activation function to use. - latent_channels (
int
, optional, defaults to 4) — Number of channels in the latent space. - sample_size (
int
, optional, defaults to32
) — Sample input size. - scaling_factor (
float
, optional, defaults to 0.18215) — The component-wise standard deviation of the trained latent space computed using the first batch of the training set. This is used to scale the latent space to have unit variance when training the diffusion model. The latents are scaled with the formulaz = z * scaling_factor
before being passed to the diffusion model. When decoding, the latents are scaled back to the original scale with the formula:z = 1 / scaling_factor * z
. For more details, refer to sections 4.3.2 and D.1 of the High-Resolution Image Synthesis with Latent Diffusion Models paper. - force_upcast (
bool
, optional, default toTrue
) — If enabled it will force the VAE to run in float32 for high image resolution pipelines, such as SD-XL. VAE can be fine-tuned / trained to a lower range without loosing too much precision in which caseforce_upcast
can be set toFalse
- see: https://huggingface.co/madebyollin/sdxl-vae-fp16-fix - mid_block_add_attention (
bool
, optional, default toTrue
) — If enabled, the mid_block of the Encoder and Decoder will have attention blocks. If set to false, the mid_block will only have resnet blocks
A VAE model with KL loss for encoding images into latents and decoding latent representations into images.
This model inherits from ModelMixin. Check the superclass documentation for it’s generic methods implemented for all models (such as downloading or saving).
Disable sliced VAE decoding. If enable_slicing
was previously enabled, this method will go back to computing decoding in one step.
Disable tiled VAE decoding. If enable_tiling
was previously enabled, this method will go back to computing decoding in one step.
Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow processing larger images.
forward
< source >( sample: Tensor sample_posterior: bool = False return_dict: bool = True generator: typing.Optional[torch._C.Generator] = None )
Enables fused QKV projections. For self-attention modules, all projection matrices (i.e., query, key, value) are fused. For cross-attention modules, key and value projection matrices are fused.
This API is 🧪 experimental.
set_attn_processor
< source >( processor: typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.JointAttnProcessor2_0, diffusers.models.attention_processor.PAGJointAttnProcessor2_0, diffusers.models.attention_processor.PAGCFGJointAttnProcessor2_0, diffusers.models.attention_processor.FusedJointAttnProcessor2_0, diffusers.models.attention_processor.AllegroAttnProcessor2_0, diffusers.models.attention_processor.AuraFlowAttnProcessor2_0, diffusers.models.attention_processor.FusedAuraFlowAttnProcessor2_0, diffusers.models.attention_processor.FluxAttnProcessor2_0, diffusers.models.attention_processor.FluxAttnProcessor2_0_NPU, diffusers.models.attention_processor.FusedFluxAttnProcessor2_0, diffusers.models.attention_processor.FusedFluxAttnProcessor2_0_NPU, diffusers.models.attention_processor.CogVideoXAttnProcessor2_0, diffusers.models.attention_processor.FusedCogVideoXAttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.XLAFlashAttnProcessor2_0, diffusers.models.attention_processor.AttnProcessorNPU, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.MochiVaeAttnProcessor2_0, diffusers.models.attention_processor.MochiAttnProcessor2_0, diffusers.models.attention_processor.StableAudioAttnProcessor2_0, diffusers.models.attention_processor.HunyuanAttnProcessor2_0, diffusers.models.attention_processor.FusedHunyuanAttnProcessor2_0, diffusers.models.attention_processor.PAGHunyuanAttnProcessor2_0, diffusers.models.attention_processor.PAGCFGHunyuanAttnProcessor2_0, diffusers.models.attention_processor.LuminaAttnProcessor2_0, diffusers.models.attention_processor.FusedAttnProcessor2_0, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor2_0, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.SanaLinearAttnProcessor2_0, diffusers.models.attention_processor.PAGCFGSanaLinearAttnProcessor2_0, diffusers.models.attention_processor.PAGIdentitySanaLinearAttnProcessor2_0, diffusers.models.attention_processor.SanaMultiscaleLinearAttention, diffusers.models.attention_processor.SanaMultiscaleAttnProcessor2_0, diffusers.models.attention_processor.SanaMultiscaleAttentionProjection, diffusers.models.attention_processor.IPAdapterAttnProcessor, diffusers.models.attention_processor.IPAdapterAttnProcessor2_0, diffusers.models.attention_processor.IPAdapterXFormersAttnProcessor, diffusers.models.attention_processor.SD3IPAdapterJointAttnProcessor2_0, diffusers.models.attention_processor.PAGIdentitySelfAttnProcessor2_0, diffusers.models.attention_processor.PAGCFGIdentitySelfAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor]]] )
参数
- processor (
dict
ofAttentionProcessor
or onlyAttentionProcessor
) — The instantiated processor class or a dictionary of processor classes that will be set as the processor for allAttention
layers.If
processor
is a dict, the key needs to define the path to the corresponding cross attention processor. This is strongly recommended when setting trainable attention processors.
Sets the attention processor to use to compute attention.
Disables custom attention processors and sets the default attention implementation.
tiled_decode
< source >( z: Tensor return_dict: bool = True ) → ~models.vae.DecoderOutput
or tuple
使用平铺解码器解码一批图像。
tiled_encode
< source >( x: Tensor return_dict: bool = True ) → ~models.autoencoder_kl.AutoencoderKLOutput
或 tuple
使用平铺编码器编码一批图像。
当启用此选项时,VAE 将输入张量分割成瓦片,分步计算编码。这对于保持内存使用恒定,而与图像大小无关非常有用。平铺编码的最终结果与非平铺编码不同,因为每个瓦片都使用不同的编码器。为了避免平铺伪影,瓦片会重叠并混合在一起,形成平滑的输出。您可能仍然会在输出中看到瓦片大小的变化,但它们应该不那么明显。
AutoencoderKLOutput
class diffusers.models.modeling_outputs.AutoencoderKLOutput
< source >( latent_dist: DiagonalGaussianDistribution )
AutoencoderKL 编码方法的输出。
DecoderOutput
class diffusers.models.autoencoders.vae.DecoderOutput
< source >( sample: Tensor commit_loss: typing.Optional[torch.FloatTensor] = None )
解码方法的输出。
FlaxAutoencoderKL
class diffusers.FlaxAutoencoderKL
< source >( in_channels: int = 3 out_channels: int = 3 down_block_types: typing.Tuple[str] = ('DownEncoderBlock2D',) up_block_types: typing.Tuple[str] = ('UpDecoderBlock2D',) block_out_channels: typing.Tuple[int] = (64,) layers_per_block: int = 1 act_fn: str = 'silu' latent_channels: int = 4 norm_num_groups: int = 32 sample_size: int = 32 scaling_factor: float = 0.18215 dtype: dtype = <class 'jax.numpy.float32'> parent: typing.Union[flax.linen.module.Module, flax.core.scope.Scope, flax.linen.module._Sentinel, NoneType] = <flax.linen.module._Sentinel object at 0x7f484d8851e0> name: typing.Optional[str] = None )
参数
- in_channels (
int
, 可选, 默认为 3) — 输入图像中的通道数。 - out_channels (
int
, 可选, 默认为 3) — 输出中的通道数。 - down_block_types (
Tuple[str]
, 可选, 默认为(DownEncoderBlock2D)
) — 下采样块类型的元组。 - up_block_types (
Tuple[str]
, 可选, 默认为(UpDecoderBlock2D)
) — 上采样块类型的元组。 - block_out_channels (
Tuple[str]
, 可选, 默认为(64,)
) — 块输出通道的元组。 - layers_per_block (
int
, 可选, 默认为2
) — 每个块的 ResNet 层数。 - act_fn (
str
, 可选, 默认为silu
) — 要使用的激活函数。 - latent_channels (
int
, 可选, 默认为4
) — 潜在空间中的通道数。 - norm_num_groups (
int
, 可选, 默认为32
) — 归一化的组数。 - sample_size (
int
, 可选, 默认为 32) — 样本输入大小。 - scaling_factor (
float
, 可选, 默认为 0.18215) — 使用训练集的第一批数据计算出的训练后潜在空间的组件标准差。这用于在训练扩散模型时缩放潜在空间以使其具有单位方差。潜在变量在传递到扩散模型之前使用公式z = z * scaling_factor
进行缩放。解码时,潜在变量使用公式z = 1 / scaling_factor * z
缩放回原始比例。有关更多详细信息,请参阅 High-Resolution Image Synthesis with Latent Diffusion Models 论文的 4.3.2 和 D.1 节。 - dtype (
jnp.dtype
, 可选, 默认为jnp.float32
) — 参数的dtype
。
用于解码潜在表示的具有 KL 损失的 VAE 模型的 Flax 实现。
此模型继承自 FlaxModelMixin。查看超类文档,了解为所有模型实现的通用方法(例如下载或保存)。
此模型是一个 Flax Linen flax.linen.Module 子类。将其用作常规 Flax Linen 模块,并参考 Flax 文档以了解与其一般用法和行为相关的所有事项。
支持固有的 JAX 功能,例如以下内容:
FlaxAutoencoderKLOutput
class diffusers.models.vae_flax.FlaxAutoencoderKLOutput
< source >( latent_dist: FlaxDiagonalGaussianDistribution )
AutoencoderKL 编码方法的输出。
返回一个新对象,该对象使用新值替换指定的字段。
FlaxDecoderOutput
类 diffusers.models.vae_flax.FlaxDecoderOutput
< 源代码 >( 样本: Array )
解码方法的输出。
返回一个新对象,该对象使用新值替换指定的字段。