Diffusers

加入 Hugging Face 社区

并获得增强的文档体验

在模型、数据集和 Spaces 上进行协作

通过加速推理获得更快的示例

切换文档主题

开始使用

Transformer2DModel

来自 CompVis 的图像数据 Transformer 模型，基于 Dosovitskiy et al. 引入的 Vision Transformer。Transformer2DModel 接受离散（向量嵌入的类别）或连续（实际嵌入）输入。

当输入为 **连续** 时

将输入投影并重塑为 `(batch_size, sequence_length, feature_dimension)`。
以标准方式应用 Transformer 块。
重塑为图像。

当输入为 **离散** 时

假定其中一个输入类别是遮罩的潜在像素。未加噪声图像的预测类别不包含遮罩像素的预测，因为未加噪声图像不能被遮罩。

将输入（潜在像素的类别）转换为嵌入并应用位置嵌入。
以标准方式应用 Transformer 块。
预测未加噪声图像的类别。

Transformer2DModel

class diffusers.Transformer2DModel

< 源文件 >

( num_attention_heads: int = 16 attention_head_dim: int = 88 in_channels: typing.Optional[int] = None out_channels: typing.Optional[int] = None num_layers: int = 1 dropout: float = 0.0 norm_num_groups: int = 32 cross_attention_dim: typing.Optional[int] = None attention_bias: bool = False sample_size: typing.Optional[int] = None num_vector_embeds: typing.Optional[int] = None patch_size: typing.Optional[int] = None activation_fn: str = 'geglu' num_embeds_ada_norm: typing.Optional[int] = None use_linear_projection: bool = False only_cross_attention: bool = False double_self_attention: bool = False upcast_attention: bool = False norm_type: str = 'layer_norm' norm_elementwise_affine: bool = True norm_eps: float = 1e-05 attention_type: str = 'default' caption_channels: int = None interpolation_scale: float = None use_additional_conditions: typing.Optional[bool] = None )

参数

num_attention_heads (int, 可选, 默认为 16) — 用于多头注意力机制的头数。
attention_head_dim (int, 可选, 默认为 88) — 每个头的通道数。
in_channels (int, 可选) — 输入和输出中的通道数（如果输入为 **连续** 则指定）。
num_layers (int, 可选, 默认为 1) — 要使用的 Transformer 块层数。
dropout (float, 可选, 默认为 0.0) — 要使用的 dropout 概率。
cross_attention_dim (int, 可选) — 要使用的 `encoder_hidden_states` 维度数。
sample_size (int, 可选) — 潜在图像的宽度（如果输入为 **离散** 则指定）。这在训练期间是固定的，因为它用于学习多个位置嵌入。
num_vector_embeds (int, 可选) — 潜在像素的向量嵌入类别数（如果输入为 **离散** 则指定）。包括遮罩潜在像素的类别。
activation_fn (str, 可选, 默认为 "geglu") — 前馈网络中使用的激活函数。
num_embeds_ada_norm ( int, 可选) — 训练期间使用的扩散步数。如果至少有一个归一化层是 `AdaLayerNorm`，则传入此参数。这在训练期间是固定的，因为它用于学习添加到隐藏状态的嵌入数量。

在推理期间，您可以去噪多达（但不超过）`num_embeds_ada_norm` 步。
attention_bias (bool, 可选) — 配置 `TransformerBlocks` 注意力机制是否包含偏置参数。

用于图像数据的 2D Transformer 模型。

前向传播

< 源文件 >

( hidden_states: Tensor encoder_hidden_states: typing.Optional[torch.Tensor] = None timestep: typing.Optional[torch.LongTensor] = None added_cond_kwargs: typing.Dict[str, torch.Tensor] = None class_labels: typing.Optional[torch.LongTensor] = None cross_attention_kwargs: typing.Dict[str, typing.Any] = None attention_mask: typing.Optional[torch.Tensor] = None encoder_attention_mask: typing.Optional[torch.Tensor] = None return_dict: bool = True )

参数

hidden_states (形状为 `(batch size, num latent pixels)` 的 torch.LongTensor，如果为离散；形状为 `(batch size, channel, height, width)` 的 torch.Tensor，如果为连续) — 输入的 `hidden_states`。
encoder_hidden_states ( 形状为 `(batch size, sequence len, embed dims)` 的 torch.Tensor, 可选) — 用于交叉注意力层的条件嵌入。如果未给出，交叉注意力默认为自注意力。
timestep ( torch.LongTensor, 可选) — 用于指示去噪步骤。作为嵌入应用于 `AdaLayerNorm` 的可选时间步。
class_labels ( 形状为 `(batch size, num classes)` 的 torch.LongTensor, 可选) — 用于指示类别标签条件。作为嵌入应用于 `AdaLayerZeroNorm` 的可选类别标签。
cross_attention_kwargs ( Dict[str, Any], 可选) — 一个 kwargs 字典，如果指定，则传递给 diffusers.models.attention_processor 中定义的 `self.processor` 的 `AttentionProcessor`。
attention_mask ( torch.Tensor, 可选) — 形状为 `(batch, key_tokens)` 的注意力掩码应用于 `encoder_hidden_states`。如果为 `1`，则保留掩码；如果为 `0`，则丢弃。掩码将转换为偏置，这将向与“丢弃”token 对应的注意力分数添加大的负值。
encoder_attention_mask ( torch.Tensor, 可选) — 应用于 `encoder_hidden_states` 的交叉注意力掩码。支持两种格式：
- 掩码 `(batch, sequence_length)` True = 保留，False = 丢弃。
- 偏置 `(batch, 1, sequence_length)` 0 = 保留，-10000 = 丢弃。
如果 `ndim == 2`：将被解释为掩码，然后转换为与上述格式一致的偏置。此偏置将添加到交叉注意力分数中。
return_dict (bool, 可选, 默认为 True) — 是否返回 UNet2DConditionOutput 而不是纯元组。

Transformer2DModel 的前向传播方法。

Transformer2DModelOutput

class diffusers.models.modeling_outputs.Transformer2DModelOutput

< 源文件 >

( sample: torch.Tensor )

参数

sample (形状为 `(batch_size, num_channels, height, width)` 的 torch.Tensor，或如果 Transformer2DModel 为离散，则为形状 `(batch size, num_vector_embeds - 1, num_latent_pixels)` ) — 根据 `encoder_hidden_states` 输入条件化的隐藏状态输出。如果为离散，则返回未加噪声潜在像素的概率分布。

Transformer2DModel 的输出。

< > 在 GitHub 上更新

←StableAudioDiTModel TransformerTemporalModel→