Diffusers 文档

SanaPipeline

Diffusers

加入 Hugging Face 社区

并获得增强的文档体验

协作处理模型、数据集和 Spaces

通过加速推理获得更快的示例

切换文档主题

开始使用

SanaPipeline

SANA：使用线性扩散 Transformer 的高效高分辨率图像合成，来自 NVIDIA 和 MIT HAN Lab，作者：Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, Song Han。

论文摘要如下：

我们介绍了 Sana，一个文本到图像框架，可以高效生成高达 4096×4096 分辨率的图像。Sana 能够以惊人的速度合成高分辨率、高质量的图像，并具有强大的文本-图像对齐能力，可在笔记本电脑 GPU 上部署。核心设计包括：（1）深度压缩自编码器：与传统的 AE 不同，后者仅压缩图像 8 倍，我们训练了一个 AE，可以将图像压缩 32 倍，有效地减少了潜在 tokens 的数量。（2）线性 DiT：我们用线性注意力取代了 DiT 中所有原始的注意力机制，这在高分辨率下更有效率，且不牺牲质量。（3）仅解码器文本编码器：我们用现代仅解码器的小型 LLM 取代了 T5 作为文本编码器，并设计了复杂的人工指令以及上下文学习，以增强图像-文本对齐。（4）高效的训练和采样：我们提出了 Flow-DPM-Solver 来减少采样步骤，并结合高效的标题标注和选择来加速收敛。结果表明，Sana-0.6B 在现代大型扩散模型（例如 Flux-12B）中非常有竞争力，体积小 20 倍，测量吞吐量快 100 倍以上。此外，Sana-0.6B 可以部署在 16GB 笔记本电脑 GPU 上，生成 1024×1024 分辨率的图像不到 1 秒。Sana 实现了低成本的内容创作。代码和模型将公开发布。

请务必查看 Schedulers 指南，了解如何探索 scheduler 速度和质量之间的权衡，并查看跨 pipelines 重用组件部分，了解如何高效地将相同的组件加载到多个 pipelines 中。

此 pipeline 由 lawrence-cj 和 chenjy2003 贡献。原始代码库可以在这里找到。原始权重可以在 hf.co/Efficient-Large-Model 下找到。

可用模型

模型	推荐 dtype
`Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers`	`torch.bfloat16`
`Efficient-Large-Model/Sana_1600M_1024px_diffusers`	`torch.float16`
`Efficient-Large-Model/Sana_1600M_1024px_MultiLing_diffusers`	`torch.float16`
`Efficient-Large-Model/Sana_1600M_512px_diffusers`	`torch.float16`
`Efficient-Large-Model/Sana_1600M_512px_MultiLing_diffusers`	`torch.float16`
`Efficient-Large-Model/Sana_600M_1024px_diffusers`	`torch.float16`
`Efficient-Large-Model/Sana_600M_512px_diffusers`	`torch.float16`

请参考此合集以获取更多信息。

注意：提到的推荐 dtype 适用于 transformer 权重。文本编码器和 VAE 权重必须保持 torch.bfloat16 或 torch.float32 才能使模型正常工作。请参考下面的推理示例，了解如何使用推荐的 dtype 加载模型。

请确保传递已下载检查点的 variant 参数，以使用更少的磁盘空间。对于推荐 dtype 为 torch.float16 的模型，请将其设置为 "fp16"，对于推荐 dtype 为 torch.bfloat16 的模型，请将其设置为 "bf16"。默认情况下，会下载 torch.float32 权重，这会占用双倍的磁盘存储空间。此外，可以通过指定 torch_dtype 参数来动态地向下转换 torch.float32 权重。请在文档中阅读相关内容。

SanaPipeline

class diffusers.SanaPipeline

< source >

( tokenizer: AutoTokenizer text_encoder: AutoModelForCausalLM vae: AutoencoderDC transformer: SanaTransformer2DModel scheduler: DPMSolverMultistepScheduler )

使用 Sana 进行文本到图像生成的 Pipeline。

call

< source >

( prompt: typing.Union[str, typing.List[str]] = None negative_prompt: str = '' num_inference_steps: int = 20 timesteps: typing.List[int] = None sigmas: typing.List[float] = None guidance_scale: float = 4.5 num_images_per_prompt: typing.Optional[int] = 1 height: int = 1024 width: int = 1024 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True clean_caption: bool = True use_resolution_binning: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 300 complex_human_instruction: typing.List[str] = ["Given a user prompt, generate an 'Enhanced prompt' that provides detailed visual descriptions suitable for image generation. Evaluate the level of detail in the user prompt:", '- If the prompt is simple, focus on adding specifics about colors, shapes, sizes, textures, and spatial relationships to create vivid and concrete scenes.', '- If the prompt is already detailed, refine and enhance the existing details slightly without overcomplicating.', 'Here are examples of how to transform or refine prompts:', '- User Prompt: A cat sleeping -> Enhanced: A small, fluffy white cat curled up in a round shape, sleeping peacefully on a warm sunny windowsill, surrounded by pots of blooming red flowers.', '- User Prompt: A busy city street -> Enhanced: A bustling city street scene at dusk, featuring glowing street lamps, a diverse crowd of people in colorful clothing, and a double-decker bus passing by towering glass skyscrapers.', 'Please generate only the enhanced description for the prompt below and avoid including any additional commentary or evaluations:', 'User Prompt: '] ) → SanaPipelineOutput 或 tuple

参数

prompt (str 或 List[str], 可选) — 用于引导图像生成的 prompt 或 prompts。如果未定义，则必须传递 prompt_embeds。
negative_prompt (str 或 List[str], 可选) — 不用于引导图像生成的 prompt 或 prompts。如果未定义，则必须传递 negative_prompt_embeds。当不使用 guidance 时忽略（即，如果 guidance_scale 小于 1 则忽略）。
num_inference_steps (int, 可选, 默认为 20) — 去噪步骤的数量。更多的去噪步骤通常会带来更高质量的图像，但会牺牲更慢的推理速度。
timesteps (List[int], 可选) — 用于去噪过程的自定义 timesteps，适用于在其 set_timesteps 方法中支持 timesteps 参数的调度器。如果未定义，将使用传递 num_inference_steps 时的默认行为。必须以降序排列。
sigmas (List[float], 可选) — 用于去噪过程的自定义 sigmas，适用于在其 set_timesteps 方法中支持 sigmas 参数的调度器。如果未定义，将使用传递 num_inference_steps 时的默认行为。
guidance_scale (float, 可选, 默认为 4.5) — Guidance scale，定义在 Classifier-Free Diffusion Guidance 中。guidance_scale 定义为 Imagen Paper 等式 2 中的 w。通过设置 guidance_scale > 1 启用 Guidance scale。较高的 guidance scale 鼓励生成与文本 prompt 紧密相关的图像，但通常以降低图像质量为代价。
num_images_per_prompt (int, 可选, 默认为 1) — 每个 prompt 生成的图像数量。
height (int, 可选, 默认为 self.unet.config.sample_size) — 生成图像的像素高度。
width (int, 可选, 默认为 self.unet.config.sample_size) — 生成图像的像素宽度。
eta (float, 可选, 默认为 0.0) — 对应于 DDIM 论文中的参数 eta (η): https://arxiv.org/abs/2010.02502。仅适用于 schedulers.DDIMScheduler，对于其他调度器将被忽略。
generator (torch.Generator 或 List[torch.Generator], 可选) — 用于使生成确定性的一个或多个 torch generator。
latents (torch.Tensor, 可选) — 预生成的噪声 latents，从高斯分布中采样，用作图像生成的输入。可用于使用不同的 prompts 调整相同的生成。如果未提供，则将使用提供的随机 generator 采样生成 latents 张量。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本 embeddings。可用于轻松调整文本输入，例如 prompt 权重。如果未提供，将从 prompt 输入参数生成文本 embeddings。
prompt_attention_mask (torch.Tensor, 可选) — 文本 embeddings 的预生成 attention mask。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负文本 embeddings。对于 PixArt-Sigma，此 negative prompt 应为 ""。如果未提供，将从 negative_prompt 输入参数生成 negative_prompt_embeds。
negative_prompt_attention_mask (torch.Tensor, 可选) — 负文本 embeddings 的预生成 attention mask。
output_type (str, 可选, 默认为 "pil") — 生成图像的输出格式。在 PIL: PIL.Image.Image 或 np.array 之间选择。
return_dict (bool, 可选, 默认为 True) — 是否返回 ~pipelines.stable_diffusion.IFPipelineOutput 而不是普通元组。
attention_kwargs — 一个 kwargs 字典，如果指定，则会传递给 AttentionProcessor，如 diffusers.models.attention_processor 中的 self.processor 下定义。
clean_caption (bool, 可选, 默认为 True) — 是否在创建嵌入之前清理标题。需要安装 beautifulsoup4 和 ftfy。如果未安装依赖项，则将从原始提示创建嵌入。
use_resolution_binning (bool 默认为 True) — 如果设置为 True，则请求的高度和宽度首先使用 ASPECT_RATIO_1024_BIN 映射到最接近的分辨率。在生成的潜在变量被解码为图像后，它们会被调整回请求的分辨率。这对于生成非正方形图像很有用。
callback_on_step_end (Callable, 可选) — 在推理期间的每个去噪步骤结束时调用的函数。该函数使用以下参数调用： callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。 callback_kwargs 将包含由 callback_on_step_end_tensor_inputs 指定的所有张量的列表。
callback_on_step_end_tensor_inputs (List, 可选) — callback_on_step_end 函数的张量输入列表。列表中指定的张量将作为 callback_kwargs 参数传递。您将只能包含管道类 ._callback_tensor_inputs 属性中列出的变量。
max_sequence_length (int 默认为 300) — 与 prompt 一起使用的最大序列长度。
complex_human_instruction (List[str], 可选) — 用于复杂人类注意力的指令： https://github.com/NVlabs/Sana/blob/main/configs/sana_app_config/Sana_1600M_app.yaml#L55。

返回值

SanaPipelineOutput 或 tuple

如果 return_dict 为 True，则返回 SanaPipelineOutput，否则返回 tuple，其中第一个元素是包含生成的图像的列表

调用管道进行生成时调用的函数。

示例

>>> import torch
>>> from diffusers import SanaPipeline

>>> pipe = SanaPipeline.from_pretrained(
...     "Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers", torch_dtype=torch.float32
... )
>>> pipe.to("cuda")
>>> pipe.text_encoder.to(torch.bfloat16)
>>> pipe.transformer = pipe.transformer.to(torch.bfloat16)

>>> image = pipe(prompt='a cyberpunk cat with a neon sign that says "Sana"')[0]
>>> image[0].save("output.png")

encode_prompt

< 源代码 >

( prompt: typing.Union[str, typing.List[str]] do_classifier_free_guidance: bool = True negative_prompt: str = '' num_images_per_prompt: int = 1 device: typing.Optional[torch.device] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None clean_caption: bool = False max_sequence_length: int = 300 complex_human_instruction: typing.Optional[typing.List[str]] = None lora_scale: typing.Optional[float] = None )

参数

prompt (str 或 List[str], 可选) — 要编码的提示
negative_prompt (str 或 List[str], 可选) — 不用于引导图像生成的提示。如果未定义，则必须传递 negative_prompt_embeds 代替。当不使用引导时（即，如果 guidance_scale 小于 1），则忽略此参数。对于 PixArt-Alpha，这应该为 ""。
do_classifier_free_guidance (bool, 可选, 默认为 True) — 是否使用无分类器引导
num_images_per_prompt (int, 可选, 默认为 1) — 每个提示应生成的图像数量
device — (torch.device, 可选): 用于放置结果嵌入的 torch 设备
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，则将从 prompt 输入参数生成文本嵌入。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负文本嵌入。对于 Sana，它应该是 "" 字符串的嵌入。
clean_caption (bool, 默认为 False) — 如果为 True，该函数将在编码之前预处理和清理提供的标题。
max_sequence_length (int, 默认为 300) — 用于提示的最大序列长度。
complex_human_instruction (list[str], 默认为 complex_human_instruction) — 如果 complex_human_instruction 不为空，该函数将使用复杂的人类指令来处理提示。

将提示编码为文本编码器隐藏状态。

SanaPAGPipeline

class diffusers.SanaPAGPipeline

< 源代码 >

( tokenizer: AutoTokenizer text_encoder: AutoModelForCausalLM vae: AutoencoderDC transformer: SanaTransformer2DModel scheduler: FlowMatchEulerDiscreteScheduler pag_applied_layers: typing.Union[str, typing.List[str]] = 'transformer_blocks.0' )

使用 Sana 进行文本到图像生成的管道。此管道支持使用 Perturbed Attention Guidance (PAG)。

call

< 源代码 >

( prompt: typing.Union[str, typing.List[str]] = None negative_prompt: str = '' num_inference_steps: int = 20 timesteps: typing.List[int] = None sigmas: typing.List[float] = None guidance_scale: float = 4.5 num_images_per_prompt: typing.Optional[int] = 1 height: int = 1024 width: int = 1024 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True clean_caption: bool = True use_resolution_binning: bool = True callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 300 complex_human_instruction: typing.List[str] = ["给定用户提示，生成一个‘增强提示’，提供适合图像生成的详细视觉描述。评估用户提示的详细程度：", '- 如果提示很简单，请专注于添加关于颜色、形状、大小、纹理和空间关系的细节，以创建生动而具体的场景。', '- 如果提示已经很详细，请在不使之过于复杂的情况下，稍微细化和增强现有细节。', '以下是如何转换或改进提示的示例：', '- 用户提示：一只正在睡觉的猫 -> 增强提示：一只小的、蓬松的白色猫咪蜷缩成一团，在温暖的阳光明媚的窗台上平静地睡觉，周围环绕着几盆盛开的红色花朵。', '- 用户提示：一条繁忙的城市街道 -> 增强提示：黄昏时分熙熙攘攘的城市街道场景，以发光的街灯、穿着色彩鲜艳服装的各式人群以及一辆双层巴士驶过高耸的玻璃摩天大楼为特色。', '请仅为以下提示生成增强描述，避免包含任何额外的评论或评估：', '用户提示：'] pag_scale: float = 3.0 pag_adaptive_scale: float = 0.0 ) → ImagePipelineOutput 或 tuple

参数

prompt (str 或 List[str], 可选) — 用于引导图像生成的提示。如果未定义，则必须传递 prompt_embeds 代替。
negative_prompt (str 或 List[str], 可选) — 不用于引导图像生成的提示或提示列表。如果未定义，则必须传递 negative_prompt_embeds 代替。当不使用引导时（即，如果 guidance_scale 小于 1 时）将被忽略。
num_inference_steps (int, 可选, 默认为 20) — 去噪步骤的数量。更多的去噪步骤通常会带来更高质量的图像，但会牺牲推理速度。
timesteps (List[int], 可选) — 用于去噪过程的自定义时间步长，适用于调度器，这些调度器在其 set_timesteps 方法中支持 timesteps 参数。如果未定义，则将使用传递 num_inference_steps 时的默认行为。必须以降序排列。
sigmas (List[float], 可选) — 用于去噪过程的自定义 sigma 值，适用于调度器，这些调度器在其 set_timesteps 方法中支持 sigmas 参数。如果未定义，则将使用传递 num_inference_steps 时的默认行为。
guidance_scale (float, 可选, 默认为 4.5) — Guidance scale，定义于 Classifier-Free Diffusion Guidance。guidance_scale 定义为 Imagen Paper 的公式 2 中的 w。通过设置 guidance_scale > 1 启用 Guidance scale。更高的 guidance scale 鼓励生成与文本 prompt 紧密相关的图像，但通常以降低图像质量为代价。
num_images_per_prompt (int, 可选, 默认为 1) — 每个提示要生成的图像数量。
height (int, 可选, 默认为 self.unet.config.sample_size) — 生成图像的像素高度。
width (int, 可选, 默认为 self.unet.config.sample_size) — 生成图像的像素宽度。
eta (float, 可选, 默认为 0.0) — 对应于 DDIM 论文中的参数 eta (η)：https://arxiv.org/abs/2010.02502。仅适用于 schedulers.DDIMScheduler，对于其他调度器将被忽略。
generator (torch.Generator 或 List[torch.Generator], 可选) — 用于使生成具有确定性的一个或一组 torch 生成器。
latents (torch.Tensor, 可选) — 预生成的噪声潜变量，从高斯分布中采样，用作图像生成的输入。可用于使用不同的提示调整相同的生成。如果未提供，将通过使用提供的随机 generator 采样来生成潜变量张量。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，将从 prompt 输入参数生成文本嵌入。
prompt_attention_mask (torch.Tensor, 可选) — 预生成的文本嵌入的注意力掩码。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负面文本嵌入。对于 PixArt-Sigma，此负面提示应为 ""。如果未提供，将从 negative_prompt 输入参数生成 negative_prompt_embeds。
negative_prompt_attention_mask (torch.Tensor, 可选) — 预生成的负面文本嵌入的注意力掩码。
output_type (str, 可选, 默认为 "pil") — 生成图像的输出格式。在 PIL: PIL.Image.Image 或 np.array 之间选择。
return_dict (bool, 可选, 默认为 True) — 是否返回 ~pipelines.stable_diffusion.IFPipelineOutput 而不是普通元组。
clean_caption (bool, 可选, 默认为 True) — 是否在创建嵌入之前清理标题。需要安装 beautifulsoup4 和 ftfy。如果未安装依赖项，则将从原始提示创建嵌入。
use_resolution_binning (bool 默认为 True) — 如果设置为 True，则请求的高度和宽度首先使用 ASPECT_RATIO_1024_BIN 映射到最接近的分辨率。在生成的潜变量被解码为图像后，它们将被调整回请求的分辨率。对于生成非正方形图像很有用。
callback_on_step_end (Callable, 可选) — 在推理期间的每个去噪步骤结束时调用的函数。该函数使用以下参数调用：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。callback_kwargs 将包含由 callback_on_step_end_tensor_inputs 指定的所有张量列表。
callback_on_step_end_tensor_inputs (List, 可选) — callback_on_step_end 函数的张量输入列表。列表中指定的张量将作为 callback_kwargs 参数传递。您将只能包含在管道类的 ._callback_tensor_inputs 属性中列出的变量。
max_sequence_length (int 默认为 300) — 与 prompt 一起使用的最大序列长度。
complex_human_instruction (List[str], 可选) — 用于复杂人类注意力的指令：https://github.com/NVlabs/Sana/blob/main/configs/sana_app_config/Sana_1600M_app.yaml#L55。
pag_scale (float, 可选, 默认为 3.0) — 扰动注意力引导的比例因子。如果设置为 0.0，则不会使用扰动注意力引导。
pag_adaptive_scale (float, 可选, 默认为 0.0) — 扰动注意力引导的自适应比例因子。如果设置为 0.0，则使用 pag_scale。

返回值

ImagePipelineOutput 或 tuple

如果 return_dict 为 True，则返回 ImagePipelineOutput，否则返回 tuple，其中第一个元素是包含生成图像的列表

调用管道进行生成时调用的函数。

示例

>>> import torch
>>> from diffusers import SanaPAGPipeline

>>> pipe = SanaPAGPipeline.from_pretrained(
...     "Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers",
...     pag_applied_layers=["transformer_blocks.8"],
...     torch_dtype=torch.float32,
... )
>>> pipe.to("cuda")
>>> pipe.text_encoder.to(torch.bfloat16)
>>> pipe.transformer = pipe.transformer.to(torch.bfloat16)

>>> image = pipe(prompt='a cyberpunk cat with a neon sign that says "Sana"')[0]
>>> image[0].save("output.png")

encode_prompt

< source >

参数

prompt (str 或 List[str], 可选) — 要编码的提示
negative_prompt (str 或 List[str], 可选) — 不用于引导图像生成的提示。如果未定义，则必须传递 negative_prompt_embeds 代替。当不使用引导时（即，如果 guidance_scale 小于 1 时）将被忽略。对于 PixArt-Alpha，这应为 ""。
do_classifier_free_guidance (bool, 可选, 默认为 True) — 是否使用无分类器引导
num_images_per_prompt (int, 可选, 默认为 1) — 每个提示应生成的图像数量
device — (torch.device, 可选): 用于放置结果嵌入的 torch 设备
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，则将从 prompt 输入参数生成文本嵌入。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负面文本嵌入。对于 Sana，它应该是 "" 字符串的嵌入。
clean_caption (bool, 默认为 False) — 如果为 True，该函数将在编码前预处理和清理提供的标题。
max_sequence_length (int, 默认为 300) — 用于提示的最大序列长度。
complex_human_instruction (list[str], 默认为 complex_human_instruction) — 如果 complex_human_instruction 不为空，该函数将使用 complex Human 指令作为提示。

将提示编码为文本编码器隐藏状态。

SanaPipelineOutput

class diffusers.pipelines.sana.pipeline_output.SanaPipelineOutput

< 源代码 >

( images: typing.Union[typing.List[PIL.Image.Image], numpy.ndarray] )

参数

images (List[PIL.Image.Image] 或 np.ndarray) — 长度为 batch_size 的去噪 PIL 图像列表或形状为 (batch_size, height, width, num_channels) 的 numpy 数组。PIL 图像或 numpy 数组表示扩散管道的去噪图像。

Sana 管道的输出类。

< > 在 GitHub 上更新

←PixArt-Σ Self-Attention Guidance→

Diffusers

SanaPipeline

SanaPipeline

class diffusers.SanaPipeline

__call__

encode_prompt

SanaPAGPipeline

class diffusers.SanaPAGPipeline

__call__

encode_prompt

SanaPipelineOutput

class diffusers.pipelines.sana.pipeline_output.SanaPipelineOutput

call

call