Diffusers 文档

AutoencoderKLCosmos

Diffusers

加入 Hugging Face 社区

并获得增强的文档体验

在模型、数据集和 Spaces 上进行协作

通过加速推理获得更快的示例

切换文档主题

开始使用

AutoencoderKLCosmos

Cosmos Tokenizers.

支持的模型

nvidia/Cosmos-1.0-Tokenizer-CV8x8x8

该模型可以通过以下代码片段加载。

from diffusers import AutoencoderKLCosmos

vae = AutoencoderKLCosmos.from_pretrained("nvidia/Cosmos-1.0-Tokenizer-CV8x8x8", subfolder="vae")

AutoencoderKLCosmos

class diffusers.AutoencoderKLCosmos

< 来源 >

( in_channels: int = 3 out_channels: int = 3 latent_channels: int = 16 encoder_block_out_channels: typing.Tuple[int, ...] = (128, 256, 512, 512) decode_block_out_channels: typing.Tuple[int, ...] = (256, 512, 512, 512) attention_resolutions: typing.Tuple[int, ...] = (32,) resolution: int = 1024 num_layers: int = 2 patch_size: int = 4 patch_type: str = 'haar' scaling_factor: float = 1.0 spatial_compression_ratio: int = 8 temporal_compression_ratio: int = 8 latents_mean: typing.Optional[typing.List[float]] = [0.11362758, -0.0171717, 0.03071163, 0.02046862, 0.01931456, 0.02138567, 0.01999342, 0.02189187, 0.02011935, 0.01872694, 0.02168613, 0.02207148, 0.01986941, 0.01770413, 0.02067643, 0.02028245, 0.19125476, 0.04556972, 0.0595558, 0.05315534, 0.05496629, 0.05356264, 0.04856596, 0.05327453, 0.05410472, 0.05597149, 0.05524866, 0.05181874, 0.05071663, 0.05204537, 0.0564108, 0.05518042, 0.01306714, 0.03341161, 0.03847246, 0.02810185, 0.02790166, 0.02920026, 0.02823597, 0.02631033, 0.0278531, 0.02880507, 0.02977769, 0.03145441, 0.02888389, 0.03280773, 0.03484927, 0.03049198, -0.00197727, 0.07534957, 0.04963879, 0.05530893, 0.05410828, 0.05252541, 0.05029899, 0.05321025, 0.05149245, 0.0511921, 0.04643495, 0.04604527, 0.04631618, 0.04404101, 0.04403536, 0.04499495, -0.02994183, -0.04787003, -0.01064558, -0.01779824, -0.01490502, -0.02157517, -0.0204778, -0.02180816, -0.01945375, -0.02062863, -0.02192209, -0.02520639, -0.02246656, -0.02427533, -0.02683363, -0.02762006, 0.08019473, -0.13005368, -0.07568636, -0.06082374, -0.06036175, -0.05875364, -0.05921887, -0.05869788, -0.05273941, -0.052565, -0.05346428, -0.05456541, -0.053657, -0.05656897, -0.05728589, -0.05321847, 0.16718403, -0.00390146, 0.0379406, 0.0356561, 0.03554131, 0.03924074, 0.03873615, 0.04187329, 0.04226924, 0.04378717, 0.04684274, 0.05117614, 0.04547792, 0.05251586, 0.05048339, 0.04950784, 0.09564418, 0.0547128, 0.08183969, 0.07978633, 0.08076023, 0.08108605, 0.08011818, 0.07965573, 0.08187773, 0.08350263, 0.08101469, 0.0786941, 0.0774442, 0.07724521, 0.07830418, 0.07599796, -0.04987567, 0.05923908, -0.01058746, -0.01177603, -0.01116162, -0.01364149, -0.01546014, -0.0117213, -0.01780043, -0.01648314, -0.02100247, -0.02104417, -0.02482123, -0.02611689, -0.02561143, -0.02597336, -0.05364667, 0.08211684, 0.04686937, 0.04605641, 0.04304186, 0.0397355, 0.03686767, 0.04087112, 0.03704741, 0.03706401, 0.03120073, 0.03349091, 0.03319963, 0.03205781, 0.03195127, 0.03180481, 0.16427967, -0.11048453, -0.04595276, -0.04982893, -0.05213465, -0.04809378, -0.05080318, -0.04992863, -0.04493337, -0.0467619, -0.04884703, -0.04627892, -0.04913311, -0.04955709, -0.04533982, -0.04570218, -0.10612928, -0.05121198, -0.06761009, -0.07251801, -0.07265285, -0.07417855, -0.07202412, -0.07499027, -0.07625481, -0.07535747, -0.07638787, -0.07920305, -0.07596069, -0.07959418, -0.08265036, -0.07955471, -0.16888915, 0.0753242, 0.04062594, 0.03375093, 0.03337452, 0.03699376, 0.03651138, 0.03611023, 0.03555622, 0.03378554, 0.0300498, 0.03395559, 0.02941847, 0.03156432, 0.03431173, 0.03016853, -0.03415358, -0.01699573, -0.04029295, -0.04912157, -0.0498858, -0.04917918, -0.04918056, -0.0525189, -0.05325506, -0.05341973, -0.04983329, -0.04883146, -0.04985548, -0.04736718, -0.0462027, -0.04836091, 0.02055675, 0.03419799, -0.02907669, -0.04350509, -0.04156144, -0.04234421, -0.04446109, -0.04461774, -0.04882839, -0.04822346, -0.04502493, -0.0506244, -0.05146913, -0.04655267, -0.04862994, -0.04841615, 0.20312774, -0.07208502, -0.03635615, -0.03556088, -0.04246174, -0.04195838, -0.04293778, -0.04071276, -0.04240569, -0.04125213, -0.04395144, -0.03959096, -0.04044993, -0.04015875, -0.04088107, -0.03885176] latents_std: typing.Optional[typing.List[float]] = [0.56700271, 0.65488982, 0.65589428, 0.66524369, 0.66619784, 0.6666382, 0.6720838, 0.66955978, 0.66928875, 0.67108786, 0.67092526, 0.67397463, 0.67894882, 0.67668313, 0.67769569, 0.67479557, 0.85245121, 0.8688373, 0.87348086, 0.88459337, 0.89135885, 0.8910504, 0.89714909, 0.89947474, 0.90201765, 0.90411824, 0.90692616, 0.90847772, 0.90648711, 0.91006982, 0.91033435, 0.90541548, 0.84960359, 0.85863352, 0.86895317, 0.88460612, 0.89245003, 0.89451706, 0.89931005, 0.90647358, 0.90338236, 0.90510076, 0.91008312, 0.90961218, 0.9123717, 0.91313171, 0.91435546, 0.91565102, 0.91877103, 0.85155135, 0.857804, 0.86998034, 0.87365264, 0.88161767, 0.88151032, 0.88758916, 0.89015514, 0.89245576, 0.89276224, 0.89450496, 0.90054202, 0.89994133, 0.90136105, 0.90114892, 0.77755755, 0.81456852, 0.81911844, 0.83137071, 0.83820474, 0.83890373, 0.84401101, 0.84425181, 0.84739357, 0.84798753, 0.85249585, 0.85114998, 0.85160935, 0.85626358, 0.85677862, 0.85641026, 0.69903517, 0.71697885, 0.71696913, 0.72583169, 0.72931731, 0.73254126, 0.73586977, 0.73734969, 0.73664582, 0.74084908, 0.74399322, 0.74471819, 0.74493188, 0.74824578, 0.75024873, 0.75274801, 0.8187142, 0.82251883, 0.82616025, 0.83164483, 0.84072375, 0.8396467, 0.84143305, 0.84880769, 0.8503468, 0.85196948, 0.85211051, 0.85386664, 0.85410017, 0.85439342, 0.85847849, 0.85385275, 0.67583984, 0.68259847, 0.69198853, 0.69928843, 0.70194328, 0.70467001, 0.70755547, 0.70917857, 0.71007699, 0.70963502, 0.71064079, 0.71027333, 0.71291167, 0.71537536, 0.71902508, 0.71604162, 0.72450989, 0.71979928, 0.72057378, 0.73035461, 0.73329622, 0.73660028, 0.73891461, 0.74279994, 0.74105692, 0.74002433, 0.74257588, 0.74416119, 0.74543899, 0.74694443, 0.74747062, 0.74586403, 0.90176988, 0.90990674, 0.91106802, 0.92163783, 0.92390233, 0.93056196, 0.93482202, 0.93642414, 0.93858379, 0.94064975, 0.94078934, 0.94325715, 0.94955301, 0.94814706, 0.95144123, 0.94923073, 0.49853548, 0.64968109, 0.6427654, 0.64966393, 0.6487664, 0.65203559, 0.6584242, 0.65351611, 0.65464371, 0.6574859, 0.65626335, 0.66123748, 0.66121179, 0.66077942, 0.66040152, 0.66474909, 0.61986589, 0.69138134, 0.6884557, 0.6955843, 0.69765401, 0.70015347, 0.70529598, 0.70468754, 0.70399523, 0.70479989, 0.70887572, 0.71126866, 0.7097227, 0.71249932, 0.71231949, 0.71175605, 0.35586974, 0.68723857, 0.68973219, 0.69958478, 0.6943453, 0.6995818, 0.70980215, 0.69899458, 0.70271689, 0.70095056, 0.69912851, 0.70522696, 0.70392174, 0.70916915, 0.70585734, 0.70373541, 0.98101336, 0.89024764, 0.89607251, 0.90678179, 0.91308665, 0.91812348, 0.91980827, 0.92480654, 0.92635667, 0.92887944, 0.93338072, 0.93468094, 0.93619436, 0.93906063, 0.94191772, 0.94471723, 0.83202779, 0.84106231, 0.84463632, 0.85829508, 0.86319661, 0.86751342, 0.86914337, 0.87085921, 0.87286359, 0.87537396, 0.87931138, 0.88054478, 0.8811838, 0.88872558, 0.88942474, 0.88934827, 0.44025335, 0.63061613, 0.63110614, 0.63601959, 0.6395812, 0.64104342, 0.65019929, 0.6502797, 0.64355946, 0.64657205, 0.64847094, 0.64728117, 0.64972943, 0.65162975, 0.65328044, 0.64914775] )

参数

in_channels (int, 默认值为 3) — 输入通道数。
out_channels (int, 默认值为 3) — 输出通道数。
latent_channels (int, 默认值为 16) — 潜在通道数。
encoder_block_out_channels (Tuple[int, ...], 默认值为 (128, 256, 512, 512)) — 每个编码器下采样块的输出通道数。
decode_block_out_channels (Tuple[int, ...], 默认值为 (256, 512, 512, 512)) — 每个解码器上采样块的输出通道数。
attention_resolutions (Tuple[int, ...], 默认值为 (32,)) — 应用注意力机制的图像/视频分辨率列表。
resolution (int, 默认值为 1024) — 用于计算块是否应具有注意力层的基本图像/视频分辨率。
num_layers (int, 默认值为 2) — 每个编码器/解码器块中的残差网络块数。
patch_size (int, 默认值为 4) — 用于对输入图像/视频进行分块的补丁大小。
patch_type (str, 默认值为 haar) — 用于对输入图像/视频进行分块的补丁类型。可以是 haar 或 rearrange。
scaling_factor (float, 默认值为 1.0) — 使用训练集的第一批数据计算出的训练潜在空间的逐分量标准差。这用于在训练扩散模型时将潜在空间缩放到单位方差。在传递给扩散模型之前，潜在值通过公式 z = z * scaling_factor 进行缩放。解码时，潜在值通过公式 z = 1 / scaling_factor * z 缩放回原始比例。有关更多详细信息，请参阅《使用潜在扩散模型进行高分辨率图像合成》论文的 4.3.2 节和 D.1 节。不适用于 Cosmos，但为了保持一致性，我们默认设置为 1.0。
spatial_compression_ratio (int, 默认值为 8) — 要在 VAE 中应用的空间压缩比。下采样块的数量由此确定。
temporal_compression_ratio (int, 默认值为 8) — 要在 VAE 中应用的时间压缩比。下采样块的数量由此确定。

在 Cosmos 中使用的自动编码器。

包装器

< 来源 >

( *args **kwargs )

包装器

< 来源 >

( *args **kwargs )

禁用切片

< 来源 >

( )

禁用切片 VAE 解码。如果之前启用了 enable_slicing，此方法将恢复一步计算解码。

禁用平铺

< 来源 >

( )

禁用平铺 VAE 解码。如果之前启用了 enable_tiling，此方法将恢复一步计算解码。

启用切片

< 来源 >

( )

启用切片 VAE 解码。启用此选项后，VAE 会将输入张量分片，分步计算解码。这有助于节省一些内存并允许更大的批次大小。

启用平铺

< 来源 >

( tile_sample_min_height: typing.Optional[int] = None tile_sample_min_width: typing.Optional[int] = None tile_sample_min_num_frames: typing.Optional[int] = None tile_sample_stride_height: typing.Optional[float] = None tile_sample_stride_width: typing.Optional[float] = None tile_sample_stride_num_frames: typing.Optional[float] = None )

参数

tile_sample_min_height (int, 可选) — 样本在高度维度上被分割成切片的最小高度。
tile_sample_min_width (int, 可选) — 样本在宽度维度上被分割成切片的最小宽度。
tile_sample_stride_height (int, 可选) — 两个连续垂直切片之间的最小重叠量。这用于确保在高度维度上不会产生平铺伪影。
tile_sample_stride_width (int, 可选) — 两个连续水平切片之间的步长。这用于确保在宽度维度上不会产生平铺伪影。

启用平铺 VAE 解码。启用此选项后，VAE 将把输入张量分割成瓦片，分多步计算编码和解码。这对于节省大量内存和处理更大的图像非常有用。

AutoencoderKLOutput

class diffusers.models.modeling_outputs.AutoencoderKLOutput

< 来源 >

( latent_dist: DiagonalGaussianDistribution )

参数

latent_dist (DiagonalGaussianDistribution) — Encoder 的编码输出，表示为 DiagonalGaussianDistribution 的均值和对数方差。DiagonalGaussianDistribution 允许从分布中采样潜在变量。

AutoencoderKL 编码方法的输出。

DecoderOutput

class diffusers.models.autoencoders.vae.DecoderOutput

< 来源 >

( sample: Tensor commit_loss: typing.Optional[torch.FloatTensor] = None )

参数

sample (torch.Tensor，形状为 (batch_size, num_channels, height, width)) — 从模型最后一层解码的输出样本。

解码方法的输出。

< > 在 GitHub 上更新

←AutoencoderKLCogVideoX AutoencoderKLHunyuanVideo→