Diffusers 文档

Stable Diffusion 3

Diffusers

加入 Hugging Face 社区

并获得增强的文档体验

在模型、数据集和 Spaces 上协作

通过加速推理获得更快的示例

切换文档主题

开始使用

Stable Diffusion 3

Stable Diffusion 3 (SD3) 是 Patrick Esser、Sumith Kulal、Andreas Blattmann、Rahim Entezari、Jonas Muller、Harry Saini、Yam Levi、Dominik Lorenz、Axel Sauer、Frederic Boesel、Dustin Podell、Tim Dockhorn、Zion English、Kyle Lacey、Alex Goodwin、Yannik Marek 和 Robin Rombach 在 Scaling Rectified Flow Transformers for High-Resolution Image Synthesis 中提出的。

论文的摘要如下：

扩散模型通过反转数据朝向噪声的正向路径，从噪声中创建数据，并已成为高维感知数据（如图像和视频）的强大生成建模技术。整流流是一种最近的生成模型公式，它以直线连接数据和噪声。尽管它具有更好的理论性质和概念上的简洁性，但尚未最终确立为标准实践。在这项工作中，我们通过将现有的噪声采样技术偏向于感知相关的尺度，改进了训练整流流模型的噪声采样技术。通过大规模研究，我们证明了与用于高分辨率文本到图像合成的已建立的扩散公式相比，这种方法的卓越性能。此外，我们提出了一种用于文本到图像生成的新型基于 Transformer 的架构，该架构为两种模态使用单独的权重，并实现图像和文本标记之间信息的双向流动，从而提高了文本理解排版和人类偏好评级。我们证明了这种架构遵循可预测的缩放趋势，并将较低的验证损失与通过各种指标和人类评估衡量的改进的文本到图像合成相关联。

使用示例

由于该模型是门控的，在使用 diffusers 之前，您首先需要访问 Stable Diffusion 3 Medium Hugging Face 页面，填写表格并接受门控。进入后，您需要登录，以便您的系统知道您已接受门控。

使用以下命令登录

huggingface-cli login

SD3 管道使用三个文本编码器来生成图像。为了使其在大多数通用硬件上运行，模型卸载是必要的。请使用 torch.float16 数据类型以节省更多内存。

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16)
pipe.to("cuda")

image = pipe(
    prompt="a photo of a cat holding a sign that says hello world",
    negative_prompt="",
    num_inference_steps=28,
    height=1024,
    width=1024,
    guidance_scale=7.0,
).images[0]

image.save("sd3_hello_world.png")

注意： Stable Diffusion 3.5 也可以使用 SD3 管道运行，并且所有提及的优化和技术也适用于它。SD3 系列共有三个官方模型

使用 IP 适配器进行图像提示

IP 适配器允许您使用图像以及文本提示来提示 SD3。当描述仅通过文本难以表达的复杂概念，并且您有参考图像时，这尤其有用。要加载和使用 IP 适配器，您需要

image_encoder：预训练的视觉模型，用于获取图像特征，通常是 CLIP 图像编码器。
feature_extractor：图像处理器，用于为选择的 image_encoder 准备输入图像。
ip_adapter_id：包含图像交叉注意力层和图像投影参数的检查点。

IP 适配器是为特定的模型架构训练的，因此它们也适用于基础模型的微调变体。您可以使用 ~SD3IPAdapterMixin.set_ip_adapter_scale 函数来调整输出与图像提示对齐的强度。值越高，模型越紧密地遵循图像提示。默认值 0.5 通常是一个很好的平衡，确保模型平等地考虑文本和图像提示。

import torch
from PIL import Image

from diffusers import StableDiffusion3Pipeline
from transformers import SiglipVisionModel, SiglipImageProcessor

image_encoder_id = "google/siglip-so400m-patch14-384"
ip_adapter_id = "InstantX/SD3.5-Large-IP-Adapter"

feature_extractor = SiglipImageProcessor.from_pretrained(
    image_encoder_id,
    torch_dtype=torch.float16
)
image_encoder = SiglipVisionModel.from_pretrained(
    image_encoder_id,
    torch_dtype=torch.float16
).to( "cuda")

pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3.5-large",
    torch_dtype=torch.float16,
    feature_extractor=feature_extractor,
    image_encoder=image_encoder,
).to("cuda")

pipe.load_ip_adapter(ip_adapter_id)
pipe.set_ip_adapter_scale(0.6)

ref_img = Image.open("image.jpg").convert('RGB')

image = pipe(
    width=1024,
    height=1024,
    prompt="a cat",
    negative_prompt="lowres, low quality, worst quality",
    num_inference_steps=24,
    guidance_scale=5.0,
    ip_adapter_image=ref_img
).images[0]

image.save("result.jpg")

使用提示“一只猫”的 IP 适配器示例

查看 IP 适配器以了解有关 IP 适配器工作原理的更多信息。

SD3 的内存优化

SD3 使用三个文本编码器，其中一个是超大型 T5-XXL 模型。即使使用 fp16 精度，这也使得在 VRAM 小于 24GB 的 GPU 上运行模型具有挑战性。以下部分概述了 Diffusers 中的一些内存优化，这些优化使在低资源硬件上运行 SD3 更加容易。

使用模型卸载运行推理

Diffusers 中最基本的内存优化允许您在推理期间将模型的组件卸载到 CPU 以节省内存，同时推理延迟略有增加。模型卸载仅在需要执行模型组件时才将其移动到 GPU 上，同时将剩余组件保留在 CPU 上。

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()

image = pipe(
    prompt="a photo of a cat holding a sign that says hello world",
    negative_prompt="",
    num_inference_steps=28,
    height=1024,
    width=1024,
    guidance_scale=7.0,
).images[0]

image.save("sd3_hello_world.png")

在推理期间删除 T5 文本编码器

在推理期间删除内存密集型的 4.7B 参数 T5-XXL 文本编码器可以显着降低 SD3 的内存需求，而性能仅略有下降。

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    text_encoder_3=None,
    tokenizer_3=None,
    torch_dtype=torch.float16
)
pipe.to("cuda")

image = pipe(
    prompt="a photo of a cat holding a sign that says hello world",
    negative_prompt="",
    num_inference_steps=28,
    height=1024,
    width=1024,
    guidance_scale=7.0,
).images[0]

image.save("sd3_hello_world-no-T5.png")

使用 T5 文本编码器的量化版本

我们可以利用 bitsandbytes 库来加载 T5-XXL 文本编码器并将其量化为 8 位精度。这使您可以继续使用所有三个文本编码器，同时仅略微影响性能。

首先安装 bitsandbytes 库。

pip install bitsandbytes

然后使用 BitsAndBytesConfig 加载 T5-XXL 模型。

import torch
from diffusers import StableDiffusion3Pipeline
from transformers import T5EncoderModel, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model_id = "stabilityai/stable-diffusion-3-medium-diffusers"
text_encoder = T5EncoderModel.from_pretrained(
    model_id,
    subfolder="text_encoder_3",
    quantization_config=quantization_config,
)
pipe = StableDiffusion3Pipeline.from_pretrained(
    model_id,
    text_encoder_3=text_encoder,
    device_map="balanced",
    torch_dtype=torch.float16
)

image = pipe(
    prompt="a photo of a cat holding a sign that says hello world",
    negative_prompt="",
    num_inference_steps=28,
    height=1024,
    width=1024,
    guidance_scale=7.0,
).images[0]

image.save("sd3_hello_world-8bit-T5.png")

您可以在此处找到端到端脚本。

SD3 的性能优化

使用 Torch Compile 加速推理

在 SD3 管道中使用编译组件可以将推理速度提高多达 4 倍。以下代码片段演示了如何编译 SD3 管道的 Transformer 和 VAE 组件。

import torch
from diffusers import StableDiffusion3Pipeline

torch.set_float32_matmul_precision("high")

torch._inductor.config.conv_1x1_as_mm = True
torch._inductor.config.coordinate_descent_tuning = True
torch._inductor.config.epilogue_fusion = False
torch._inductor.config.coordinate_descent_check_all_directions = True

pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    torch_dtype=torch.float16
).to("cuda")
pipe.set_progress_bar_config(disable=True)

pipe.transformer.to(memory_format=torch.channels_last)
pipe.vae.to(memory_format=torch.channels_last)

pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune", fullgraph=True)
pipe.vae.decode = torch.compile(pipe.vae.decode, mode="max-autotune", fullgraph=True)

# Warm Up
prompt = "a photo of a cat holding a sign that says hello world"
for _ in range(3):
    _ = pipe(prompt=prompt, generator=torch.manual_seed(1))

# Run Inference
image = pipe(prompt=prompt, generator=torch.manual_seed(1)).images[0]
image.save("sd3_hello_world.png")

在此处查看完整脚本。

将长提示与 T5 文本编码器一起使用

默认情况下，T5 文本编码器提示使用最大序列长度 256。可以通过设置 max_sequence_length 来调整此值，以接受更少或更多的标记。请记住，较长的序列需要额外的资源，并导致更长的生成时间，例如在批量推理期间。

prompt = "A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus, basking in a river of melted butter amidst a breakfast-themed landscape. It features the distinctive, bulky body shape of a hippo. However, instead of the usual grey skin, the creature’s body resembles a golden-brown, crispy waffle fresh off the griddle. The skin is textured with the familiar grid pattern of a waffle, each square filled with a glistening sheen of syrup. The environment combines the natural habitat of a hippo with elements of a breakfast table setting, a river of warm, melted butter, with oversized utensils or plates peeking out from the lush, pancake-like foliage in the background, a towering pepper mill standing in for a tree.  As the sun rises in this fantastical world, it casts a warm, buttery glow over the scene. The creature, content in its butter river, lets out a yawn. Nearby, a flock of birds take flight"

image = pipe(
    prompt=prompt,
    negative_prompt="",
    num_inference_steps=28,
    guidance_scale=4.5,
    max_sequence_length=512,
).images[0]

向 T5 文本编码器发送不同的提示

您可以向 CLIP 文本编码器和 T5 文本编码器发送不同的提示，以防止提示被 CLIP 文本编码器截断并改进生成效果。

CLIP 文本编码器的提示仍然被截断为 77 个标记限制。

prompt = "A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus, basking in a river of melted butter amidst a breakfast-themed landscape. A river of warm, melted butter, pancake-like foliage in the background, a towering pepper mill standing in for a tree."

prompt_3 = "A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus, basking in a river of melted butter amidst a breakfast-themed landscape. It features the distinctive, bulky body shape of a hippo. However, instead of the usual grey skin, the creature’s body resembles a golden-brown, crispy waffle fresh off the griddle. The skin is textured with the familiar grid pattern of a waffle, each square filled with a glistening sheen of syrup. The environment combines the natural habitat of a hippo with elements of a breakfast table setting, a river of warm, melted butter, with oversized utensils or plates peeking out from the lush, pancake-like foliage in the background, a towering pepper mill standing in for a tree.  As the sun rises in this fantastical world, it casts a warm, buttery glow over the scene. The creature, content in its butter river, lets out a yawn. Nearby, a flock of birds take flight"

image = pipe(
    prompt=prompt,
    prompt_3=prompt_3,
    negative_prompt="",
    num_inference_steps=28,
    guidance_scale=4.5,
    max_sequence_length=512,
).images[0]

Stable Diffusion 3 的微型自动编码器

Stable Diffusion 微型自动编码器 (TAESD3) 是 Ollin Boer Bohan 制作的 Stable Diffusion 3 VAE 的微型蒸馏版本，它可以几乎立即解码 StableDiffusion3Pipeline 潜在空间。

与 Stable Diffusion 3 一起使用

import torch
from diffusers import StableDiffusion3Pipeline, AutoencoderTiny

pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16
)
pipe.vae = AutoencoderTiny.from_pretrained("madebyollin/taesd3", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "slice of delicious New York-style berry cheesecake"
image = pipe(prompt, num_inference_steps=25).images[0]
image.save("cheesecake.png")

通过 from_single_file 加载原始检查点

SD3Transformer2DModel 和 StableDiffusion3Pipeline 类支持通过 from_single_file 方法加载原始检查点。此方法允许您加载用于训练模型的原始检查点文件。

加载 SD3Transformer2DModel 的原始检查点

from diffusers import SD3Transformer2DModel

model = SD3Transformer2DModel.from_single_file("https://huggingface.co/stabilityai/stable-diffusion-3-medium/blob/main/sd3_medium.safetensors")

加载 StableDiffusion3Pipeline 的单个检查点

加载不带 T5 的单个文件检查点

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_single_file(
    "https://huggingface.co/stabilityai/stable-diffusion-3-medium/blob/main/sd3_medium_incl_clips.safetensors",
    torch_dtype=torch.float16,
    text_encoder_3=None
)
pipe.enable_model_cpu_offload()

image = pipe("a picture of a cat holding a sign that says hello world").images[0]
image.save('sd3-single-file.png')

加载带 T5 的单个文件检查点

以下示例加载以 8 位浮点格式存储的检查点，这需要 PyTorch 2.3 或更高版本。

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_single_file(
    "https://huggingface.co/stabilityai/stable-diffusion-3-medium/blob/main/sd3_medium_incl_clips_t5xxlfp8.safetensors",
    torch_dtype=torch.float16,
)
pipe.enable_model_cpu_offload()

image = pipe("a picture of a cat holding a sign that says hello world").images[0]
image.save('sd3-single-file-t5-fp8.png')

加载 Stable Diffusion 3.5 Transformer 模型的单个文件检查点

import torch
from diffusers import SD3Transformer2DModel, StableDiffusion3Pipeline

transformer = SD3Transformer2DModel.from_single_file(
    "https://huggingface.co/stabilityai/stable-diffusion-3.5-large-turbo/blob/main/sd3.5_large.safetensors",
    torch_dtype=torch.bfloat16,
)
pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3.5-large",
    transformer=transformer,
    torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()
image = pipe("a cat holding a sign that says hello world").images[0]
image.save("sd35.png")

StableDiffusion3Pipeline

class diffusers.StableDiffusion3Pipeline

< source >

( transformer: SD3Transformer2DModel scheduler: FlowMatchEulerDiscreteScheduler vae: AutoencoderKL text_encoder: CLIPTextModelWithProjection tokenizer: CLIPTokenizer text_encoder_2: CLIPTextModelWithProjection tokenizer_2: CLIPTokenizer text_encoder_3: T5EncoderModel tokenizer_3: T5TokenizerFast image_encoder: PreTrainedModel = None feature_extractor: BaseImageProcessor = None )

参数

transformer (SD3Transformer2DModel) — 条件 Transformer (MMDiT) 架构，用于去噪编码的图像潜在空间。
scheduler (FlowMatchEulerDiscreteScheduler) — 与 transformer 结合使用的调度器，用于对编码后的图像潜在空间进行去噪。
vae (AutoencoderKL) — 变分自编码器 (VAE) 模型，用于将图像编码和解码为潜在表示形式。
text_encoder (CLIPTextModelWithProjection) — CLIP，特别是 clip-vit-large-patch14 变体，带有一个额外的投影层，该投影层使用以 hidden_size 为维度的对角矩阵初始化。
text_encoder_2 (CLIPTextModelWithProjection) — CLIP，特别是 laion/CLIP-ViT-bigG-14-laion2B-39B-b160k 变体。
text_encoder_3 (T5EncoderModel) — 冻结的文本编码器。 Stable Diffusion 3 使用 T5，特别是 t5-v1_1-xxl 变体。
tokenizer (CLIPTokenizer) — CLIPTokenizer 类的分词器。
tokenizer_2 (CLIPTokenizer) — CLIPTokenizer 类的第二个分词器。
tokenizer_3 (T5TokenizerFast) — T5Tokenizer 类的分词器。
image_encoder (PreTrainedModel, 可选) — 用于 IP 适配器的预训练视觉模型。
feature_extractor (BaseImageProcessor, 可选) — IP 适配器的图像处理器。

call

< source >

( prompt: typing.Union[str, typing.List[str]] = None prompt_2: typing.Union[str, typing.List[str], NoneType] = None prompt_3: typing.Union[str, typing.List[str], NoneType] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 28 sigmas: typing.Optional[typing.List[float]] = None guidance_scale: float = 7.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_2: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_3: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.FloatTensor] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None pooled_prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.FloatTensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True joint_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 256 skip_guidance_layers: typing.List[int] = None skip_layer_guidance_scale: float = 2.8 skip_layer_guidance_stop: float = 0.2 skip_layer_guidance_start: float = 0.01 mu: typing.Optional[float] = None ) → ~pipelines.stable_diffusion_3.StableDiffusion3PipelineOutput 或 tuple

参数

prompt (str 或 List[str], 可选) — 用于引导图像生成的提示。如果未定义，则必须传递 prompt_embeds。
prompt_2 (str 或 List[str], 可选) — 要发送到 tokenizer_2 和 text_encoder_2 的提示。如果未定义，则将使用 prompt。
prompt_3 (str 或 List[str], 可选) — 要发送到 tokenizer_3 和 text_encoder_3 的提示。如果未定义，则将使用 prompt。
height (int, 可选, 默认为 self.unet.config.sample_size * self.vae_scale_factor) — 生成图像的高度像素值。默认设置为 1024 以获得最佳效果。
width (int, 可选, 默认为 self.unet.config.sample_size * self.vae_scale_factor) — 生成图像的宽度像素值。默认设置为 1024 以获得最佳效果。
num_inference_steps (int, 可选, 默认为 50) — 去噪步骤的数量。更多的去噪步骤通常会带来更高质量的图像，但会以较慢的推理速度为代价。
sigmas (List[float], 可选) — 自定义 sigmas，用于支持在其 set_timesteps 方法中使用 sigmas 参数的调度器的去噪过程。如果未定义，则将使用传递 num_inference_steps 时的默认行为。
guidance_scale (float, 可选, 默认为 7.0) — Classifier-Free Diffusion Guidance 中定义的引导比例。 guidance_scale 定义为 Imagen Paper 的公式 2 中的 w。通过设置 guidance_scale > 1 启用引导比例。较高的引导比例鼓励生成与文本 prompt 紧密相关的图像，但通常以降低图像质量为代价。
negative_prompt (str 或 List[str], 可选) — 不用于引导图像生成的提示。如果未定义，则必须传递 negative_prompt_embeds。当不使用引导时忽略（即，如果 guidance_scale 小于 1 则忽略）。
negative_prompt_2 (str 或 List[str], 可选) — 不用于引导图像生成的提示，将发送到 tokenizer_2 和 text_encoder_2。如果未定义，则使用 negative_prompt。
negative_prompt_3 (str 或 List[str], 可选) — 不用于引导图像生成的提示，将发送到 tokenizer_3 和 text_encoder_3。如果未定义，则使用 negative_prompt。
num_images_per_prompt (int, 可选, 默认为 1) — 每个提示生成的图像数量。
generator (torch.Generator 或 List[torch.Generator], 可选) — 一个或一组 torch 生成器，用于使生成具有确定性。
latents (torch.FloatTensor, 可选) — 预生成的噪声潜变量，从高斯分布中采样，用作图像生成的输入。可用于使用不同的提示调整相同的生成结果。如果未提供，将使用提供的随机 generator 采样生成潜变量张量。
prompt_embeds (torch.FloatTensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，将从 prompt 输入参数生成文本嵌入。
negative_prompt_embeds (torch.FloatTensor, 可选) — 预生成的负面文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，将从 negative_prompt 输入参数生成 negative_prompt_embeds。
pooled_prompt_embeds (torch.FloatTensor, 可选) — 预生成的池化文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，将从 prompt 输入参数生成池化文本嵌入。
negative_pooled_prompt_embeds (torch.FloatTensor, 可选) — 预生成的负面池化文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，将从 negative_prompt 输入参数生成池化的 negative_prompt_embeds。
ip_adapter_image (PipelineImageInput, 可选) — 与 IP 适配器一起使用的可选图像输入。
ip_adapter_image_embeds (torch.Tensor, 可选) — IP 适配器的预生成图像嵌入。应该是形状为 (batch_size, num_images, emb_dim) 的张量。如果 do_classifier_free_guidance 设置为 True，则应包含负面图像嵌入。如果未提供，则从 ip_adapter_image 输入参数计算嵌入。
output_type (str, 可选, 默认为 "pil") — 生成图像的输出格式。在 PIL: PIL.Image.Image 或 np.array 之间选择。
return_dict (bool, 可选, 默认为 True) — 是否返回 ~pipelines.stable_diffusion_3.StableDiffusion3PipelineOutput 而不是普通元组。
joint_attention_kwargs (dict, 可选) — 一个 kwargs 字典，如果指定，则作为 self.processor 下定义的 AttentionProcessor 的参数传递到 diffusers.models.attention_processor。
callback_on_step_end (Callable, 可选) — 在推理期间，在每个去噪步骤结束时调用的函数。该函数使用以下参数调用： callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。 callback_kwargs 将包含由 callback_on_step_end_tensor_inputs 指定的所有张量列表。
callback_on_step_end_tensor_inputs (List, 可选) — callback_on_step_end 函数的张量输入列表。列表中指定的张量将作为 callback_kwargs 参数传递。您将只能包含管道类的 ._callback_tensor_inputs 属性中列出的变量。
max_sequence_length (int 默认为 256) — 与 prompt 一起使用的最大序列长度。
skip_guidance_layers (List[int], 可选) — 一个整数列表，指定在 guidance 期间要跳过的层。如果未提供，所有层都将用于 guidance。如果提供，guidance 将仅应用于列表中指定的层。 StabilityAI 针对 Stable Diffusion 3.5 Medium 推荐的值为 [7, 8, 9]。
skip_layer_guidance_scale (int, 可选) — skip_guidance_layers 中指定的层的 guidance 比例。 guidance 将应用于 skip_guidance_layers 中指定的层，比例为 skip_layer_guidance_scale。 guidance 将应用于其余层，比例为 1。
skip_layer_guidance_stop (int, 可选) — skip_guidance_layers 中指定的层的 guidance 将停止的步骤。 guidance 将应用于 skip_guidance_layers 中指定的层，直到 skip_layer_guidance_stop 中指定的分数。 StabilityAI 针对 Stable Diffusion 3.5 Medium 推荐的值为 0.2。
skip_layer_guidance_start (int, 可选) — skip_guidance_layers 中指定的层的 guidance 将开始的步骤。 guidance 将应用于 skip_guidance_layers 中指定的层，从 skip_layer_guidance_start 中指定的分数开始。 StabilityAI 针对 Stable Diffusion 3.5 Medium 推荐的值为 0.01。
mu (float, 可选) — 用于 dynamic_shifting 的 mu 值。

~pipelines.stable_diffusion_3.StableDiffusion3PipelineOutput 或 tuple

如果 return_dict 为 True，则返回 ~pipelines.stable_diffusion_3.StableDiffusion3PipelineOutput，否则返回 tuple。当返回元组时，第一个元素是包含生成图像的列表。

调用管道进行生成时调用的函数。

示例

>>> import torch
>>> from diffusers import StableDiffusion3Pipeline

>>> pipe = StableDiffusion3Pipeline.from_pretrained(
...     "stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16
... )
>>> pipe.to("cuda")
>>> prompt = "A cat holding a sign that says hello world"
>>> image = pipe(prompt).images[0]
>>> image.save("sd3.png")

encode_image

< source >

( image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] device: device ) → torch.Tensor

参数

image (PipelineImageInput) — 要编码的输入图像。
device — (torch.device): Torch 设备。

torch.Tensor

编码后的图像特征表示。

使用预训练的图像编码器将给定图像编码为特征表示。

encode_prompt

< source >

( prompt: typing.Union[str, typing.List[str]] prompt_2: typing.Union[str, typing.List[str]] prompt_3: typing.Union[str, typing.List[str]] device: typing.Optional[torch.device] = None num_images_per_prompt: int = 1 do_classifier_free_guidance: bool = True negative_prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_2: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_3: typing.Union[str, typing.List[str], NoneType] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None pooled_prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.FloatTensor] = None clip_skip: typing.Optional[int] = None max_sequence_length: int = 256 lora_scale: typing.Optional[float] = None )

参数

prompt (str or List[str], optional) — prompt to be encoded
prompt_2 (str or List[str], optional) — The prompt or prompts to be sent to the tokenizer_2 and text_encoder_2. If not defined, prompt is used in all text-encoders
prompt_3 (str or List[str], optional) — The prompt or prompts to be sent to the tokenizer_3 and text_encoder_3. If not defined, prompt is used in all text-encoders
device — (torch.device): torch 设备
num_images_per_prompt (int) — 每个 prompt 应该生成的图像数量
do_classifier_free_guidance (bool) — 是否使用无分类器引导（classifier free guidance）
negative_prompt (str or List[str], optional) — 不用于引导图像生成的 prompt 或 prompts。如果未定义，则必须传递 negative_prompt_embeds。当不使用引导时忽略（即，如果 guidance_scale 小于 1 则忽略）。
negative_prompt_2 (str or List[str], optional) — 不用于引导图像生成的 prompt 或 prompts，将发送到 tokenizer_2 和 text_encoder_2。如果未定义，则在所有文本编码器中使用 negative_prompt。
negative_prompt_2 (str or List[str], optional) — 不用于引导图像生成的 prompt 或 prompts，将发送到 tokenizer_3 和 text_encoder_3。如果未定义，则在两个文本编码器中使用 negative_prompt。
prompt_embeds (torch.FloatTensor, optional) — 预生成的文本嵌入 (text embeddings)。可以用于轻松调整文本输入，例如 prompt 权重。如果未提供，则将从 prompt 输入参数生成文本嵌入。
negative_prompt_embeds (torch.FloatTensor, optional) — 预生成的负面文本嵌入 (negative text embeddings)。可以用于轻松调整文本输入，例如 prompt 权重。如果未提供，则将从 negative_prompt 输入参数生成 negative_prompt_embeds。
pooled_prompt_embeds (torch.FloatTensor, optional) — 预生成的池化文本嵌入 (pooled text embeddings)。可以用于轻松调整文本输入，例如 prompt 权重。如果未提供，则将从 prompt 输入参数生成池化文本嵌入。
negative_pooled_prompt_embeds (torch.FloatTensor, optional) — 预生成的负面池化文本嵌入 (negative pooled text embeddings)。可以用于轻松调整文本输入，例如 prompt 权重。如果未提供，则将从 negative_prompt 输入参数生成 pooled negative_prompt_embeds。
clip_skip (int, optional) — 从 CLIP 中跳过的层数，用于计算 prompt 嵌入。值为 1 表示预倒数第二层的输出将用于计算 prompt 嵌入。
lora_scale (float, optional) — 如果加载了 LoRA 层，则将应用于文本编码器的所有 LoRA 层的 LoRA 缩放比例。

prepare_ip_adapter_image_embeds

< source >

( ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[torch.Tensor] = None device: typing.Optional[torch.device] = None num_images_per_prompt: int = 1 do_classifier_free_guidance: bool = True )

参数

ip_adapter_image (PipelineImageInput, optional) — 用于从 IP-Adapter 提取特征的输入图像。
ip_adapter_image_embeds (torch.Tensor, optional) — 预计算的图像嵌入 (image embeddings)。
device — (torch.device, optional): Torch 设备。
num_images_per_prompt (int, defaults to 1) — 每个 prompt 应该生成的图像数量。
do_classifier_free_guidance (bool, defaults to True) — 是否使用无分类器引导（classifier free guidance）。

Prepares image embeddings for use in the IP-Adapter.

Either ip_adapter_image or ip_adapter_image_embeds must be passed.

< > Update on GitHub

←Stable Diffusion 2 Stable Diffusion XL→

Diffusers

Stable Diffusion 3

使用示例

使用 IP 适配器进行图像提示

SD3 的内存优化

使用模型卸载运行推理

在推理期间删除 T5 文本编码器

使用 T5 文本编码器的量化版本

SD3 的性能优化

使用 Torch Compile 加速推理

将长提示与 T5 文本编码器一起使用

向 T5 文本编码器发送不同的提示

Stable Diffusion 3 的微型自动编码器

通过 from_single_file 加载原始检查点

加载 SD3Transformer2DModel 的原始检查点

加载 StableDiffusion3Pipeline 的单个检查点

加载不带 T5 的单个文件检查点

加载带 T5 的单个文件检查点

加载 Stable Diffusion 3.5 Transformer 模型的单个文件检查点

StableDiffusion3Pipeline

class diffusers.StableDiffusion3Pipeline

__call__

encode_image

encode_prompt

prepare_ip_adapter_image_embeds

call