Diffusers 文档
PriorTransformer
并获得增强的文档体验
开始使用
PriorTransformer
Prior Transformer 最初在 Ramesh 等人的 Hierarchical Text-Conditional Image Generation with CLIP Latents 中被提出。它用于从 CLIP 文本嵌入预测 CLIP 图像嵌入;图像嵌入通过去噪扩散过程进行预测。
该论文的摘要是
像 CLIP 这样的对比模型已被证明可以学习图像的鲁棒表示,捕捉语义和风格。为了利用这些表示进行图像生成,我们提出了一个两阶段模型:一个先验模型,它根据文本标题生成 CLIP 图像嵌入,以及一个解码器,它根据图像嵌入生成图像。我们表明,显式生成图像表示可以提高图像多样性,同时最大限度地减少照片真实感和标题相似性的损失。我们以图像表示为条件的解码器还可以生成图像的变体,这些变体保留了其语义和风格,同时改变了图像表示中不存在的非必要细节。此外,CLIP 的联合嵌入空间支持以零样本方式进行语言引导的图像操作。我们使用扩散模型作为解码器,并对先验模型尝试了自回归模型和扩散模型,发现后者在计算上更高效,并产生更高质量的样本。
PriorTransformer
class diffusers.PriorTransformer
< 源代码 >( num_attention_heads: int = 32 attention_head_dim: int = 64 num_layers: int = 20 embedding_dim: int = 768 num_embeddings = 77 additional_embeddings = 4 dropout: float = 0.0 time_embed_act_fn: str = 'silu' norm_in_type: typing.Optional[str] = None embedding_proj_norm_type: typing.Optional[str] = None encoder_hid_proj_type: typing.Optional[str] = 'linear' added_emb_type: typing.Optional[str] = 'prd' time_embed_dim: typing.Optional[int] = None embedding_proj_dim: typing.Optional[int] = None clip_embed_dim: typing.Optional[int] = None )
参数
- num_attention_heads (
int
, 可选, 默认为 32) — 用于多头注意力的头的数量。 - attention_head_dim (
int
, 可选, 默认为 64) — 每个头中的通道数。 - num_layers (
int
, 可选, 默认为 20) — 要使用的 Transformer 块的层数。 - embedding_dim (
int
, 可选, 默认为 768) — 模型输入hidden_states
的维度 - num_embeddings (
int
, 可选, 默认为 77) — 模型输入hidden_states
的嵌入数量 - additional_embeddings (
int
, 可选, 默认为 4) — 附加到 projectedhidden_states
的额外 token 数量。实际使用的hidden_states
长度为num_embeddings + additional_embeddings
。 - dropout (
float
, 可选, 默认为 0.0) — 要使用的 dropout 概率。 - time_embed_act_fn (
str
, 可选, 默认为 ‘silu’) — 用于创建时间步嵌入的激活函数。 - norm_in_type (
str
, 可选, 默认为 None) — 要在 Transformer 块之前应用于隐藏状态的归一化层。如果不需要归一化,请将其设置为None
。 - embedding_proj_norm_type (
str
, 可选, 默认为 None) — 要应用于输入proj_embedding
的归一化层。如果不需要归一化,请将其设置为None
。 - encoder_hid_proj_type (
str
, 可选, 默认为linear
) — 要应用于输入encoder_hidden_states
的投影层。如果encoder_hidden_states
为None
,请将其设置为None
。 - added_emb_type (
str
, 可选, 默认为prd
) — 用于调节模型的附加嵌入。从prd
或None
中选择。如果选择prd
,它将预先添加一个 token,指示文本嵌入和图像嵌入之间(量化的)点积,如 unclip 论文 https://arxiv.org/abs/2204.06125 中提出的那样。如果为None
,则不会预先添加任何附加嵌入。 - time_embed_dim (
int, *可选*, 默认为 None) -- 时间步嵌入的维度。如果为 None,则将设置为
num_attention_heads * attention_head_dim` - embedding_proj_dim (
int
, 可选, 默认为 None) —proj_embedding
的维度。如果为 None,则将设置为embedding_dim
。 - clip_embed_dim (
int
, 可选, 默认为 None) — 输出的维度。如果为 None,则将设置为embedding_dim
。
一个 Prior Transformer 模型。
forward
< source >( hidden_states timestep: typing.Union[torch.Tensor, float, int] proj_embedding: Tensor encoder_hidden_states: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.BoolTensor] = None return_dict: bool = True ) → PriorTransformerOutput 或 tuple
参数
- hidden_states (形状为
(batch_size, embedding_dim)
的torch.Tensor
) — 当前预测的图像嵌入。 - timestep (
torch.LongTensor
) — 当前去噪步骤。 - proj_embedding (形状为
(batch_size, embedding_dim)
的torch.Tensor
) — 去噪过程以此为条件的投影嵌入向量。 - encoder_hidden_states (形状为
(batch_size, num_embeddings, embedding_dim)
的torch.Tensor
) — 文本嵌入的隐藏状态,去噪过程以此为条件。 - attention_mask (形状为
(batch_size, num_embeddings)
的torch.BoolTensor
) — 文本嵌入的文本掩码。 - return_dict (
bool
, 可选, 默认为True
) — 是否返回 PriorTransformerOutput 而不是普通元组。
返回值
PriorTransformerOutput 或 tuple
如果 return_dict 为 True,则返回 PriorTransformerOutput,否则返回一个元组,其中第一个元素是 sample tensor。
PriorTransformer 的 forward 方法。
set_attn_processor
< source >( processor: typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.JointAttnProcessor2_0, diffusers.models.attention_processor.PAGJointAttnProcessor2_0, diffusers.models.attention_processor.PAGCFGJointAttnProcessor2_0, diffusers.models.attention_processor.FusedJointAttnProcessor2_0, diffusers.models.attention_processor.AllegroAttnProcessor2_0, diffusers.models.attention_processor.AuraFlowAttnProcessor2_0, diffusers.models.attention_processor.FusedAuraFlowAttnProcessor2_0, diffusers.models.attention_processor.FluxAttnProcessor2_0, diffusers.models.attention_processor.FluxAttnProcessor2_0_NPU, diffusers.models.attention_processor.FusedFluxAttnProcessor2_0, diffusers.models.attention_processor.FusedFluxAttnProcessor2_0_NPU, diffusers.models.attention_processor.CogVideoXAttnProcessor2_0, diffusers.models.attention_processor.FusedCogVideoXAttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.XLAFlashAttnProcessor2_0, diffusers.models.attention_processor.AttnProcessorNPU, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.MochiVaeAttnProcessor2_0, diffusers.models.attention_processor.MochiAttnProcessor2_0, diffusers.models.attention_processor.StableAudioAttnProcessor2_0, diffusers.models.attention_processor.HunyuanAttnProcessor2_0, diffusers.models.attention_processor.FusedHunyuanAttnProcessor2_0, diffusers.models.attention_processor.PAGHunyuanAttnProcessor2_0, diffusers.models.attention_processor.PAGCFGHunyuanAttnProcessor2_0, diffusers.models.attention_processor.LuminaAttnProcessor2_0, diffusers.models.attention_processor.FusedAttnProcessor2_0, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor2_0, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.SanaLinearAttnProcessor2_0, diffusers.models.attention_processor.PAGCFGSanaLinearAttnProcessor2_0, diffusers.models.attention_processor.PAGIdentitySanaLinearAttnProcessor2_0, diffusers.models.attention_processor.SanaMultiscaleLinearAttention, diffusers.models.attention_processor.SanaMultiscaleAttnProcessor2_0, diffusers.models.attention_processor.SanaMultiscaleAttentionProjection, diffusers.models.attention_processor.IPAdapterAttnProcessor, diffusers.models.attention_processor.IPAdapterAttnProcessor2_0, diffusers.models.attention_processor.IPAdapterXFormersAttnProcessor, diffusers.models.attention_processor.SD3IPAdapterJointAttnProcessor2_0, diffusers.models.attention_processor.PAGIdentitySelfAttnProcessor2_0, diffusers.models.attention_processor.PAGCFGIdentitySelfAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, typing.Dict[str, typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.JointAttnProcessor2_0, diffusers.models.attention_processor.PAGJointAttnProcessor2_0, diffusers.models.attention_processor.PAGCFGJointAttnProcessor2_0, diffusers.models.attention_processor.FusedJointAttnProcessor2_0, diffusers.models.attention_processor.AllegroAttnProcessor2_0, diffusers.models.attention_processor.AuraFlowAttnProcessor2_0, diffusers.models.attention_processor.FusedAuraFlowAttnProcessor2_0, diffusers.models.attention_processor.FluxAttnProcessor2_0, diffusers.models.attention_processor.FluxAttnProcessor2_0_NPU, diffusers.models.attention_processor.FusedFluxAttnProcessor2_0, diffusers.models.attention_processor.FusedFluxAttnProcessor2_0_NPU, diffusers.models.attention_processor.CogVideoXAttnProcessor2_0, diffusers.models.attention_processor.FusedCogVideoXAttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.XLAFlashAttnProcessor2_0, diffusers.models.attention_processor.AttnProcessorNPU, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.MochiVaeAttnProcessor2_0, diffusers.models.attention_processor.MochiAttnProcessor2_0, diffusers.models.attention_processor.StableAudioAttnProcessor2_0, diffusers.models.attention_processor.HunyuanAttnProcessor2_0, diffusers.models.attention_processor.FusedHunyuanAttnProcessor2_0, diffusers.models.attention_processor.PAGHunyuanAttnProcessor2_0, diffusers.models.attention_processor.PAGCFGHunyuanAttnProcessor2_0, diffusers.models.attention_processor.LuminaAttnProcessor2_0, diffusers.models.attention_processor.FusedAttnProcessor2_0, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor2_0, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.SanaLinearAttnProcessor2_0, diffusers.models.attention_processor.PAGCFGSanaLinearAttnProcessor2_0, diffusers.models.attention_processor.PAGIdentitySanaLinearAttnProcessor2_0, diffusers.models.attention_processor.SanaMultiscaleLinearAttention, diffusers.models.attention_processor.SanaMultiscaleAttnProcessor2_0, diffusers.models.attention_processor.SanaMultiscaleAttentionProjection, diffusers.models.attention_processor.IPAdapterAttnProcessor, diffusers.models.attention_processor.IPAdapterAttnProcessor2_0, diffusers.models.attention_processor.IPAdapterXFormersAttnProcessor, diffusers.models.attention_processor.SD3IPAdapterJointAttnProcessor2_0, diffusers.models.attention_processor.PAGIdentitySelfAttnProcessor2_0, diffusers.models.attention_processor.PAGCFGIdentitySelfAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor]]] )
设置用于计算注意力的注意力处理器。
禁用自定义注意力处理器并设置默认注意力实现。
PriorTransformerOutput
class diffusers.models.transformers.prior_transformer.PriorTransformerOutput
< source >( predicted_image_embedding: Tensor )
PriorTransformer 的输出。