Diffusers 文档

Stable Diffusion XL

Diffusers

加入 Hugging Face 社区

并获得增强的文档体验

在模型、数据集和 Spaces 上协作

通过加速推理获得更快的示例

切换文档主题

开始使用

Stable Diffusion XL

Stable Diffusion XL (SDXL) 在 SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis 这篇论文中被提出，作者是 Dustin Podell、Zion English、Kyle Lacey、Andreas Blattmann、Tim Dockhorn、Jonas Müller、Joe Penna 和 Robin Rombach。

该论文的摘要如下：

我们提出了 SDXL，一种用于文本到图像合成的潜在扩散模型。与之前的 Stable Diffusion 版本相比，SDXL 利用了三倍大的 UNet 主干网络：模型参数的增加主要是由于更多的注意力模块和更大的交叉注意力上下文，因为 SDXL 使用了第二个文本编码器。我们设计了多种新颖的条件方案，并在多个宽高比上训练 SDXL。我们还引入了一个细化模型，该模型使用追溯图像到图像技术来提高 SDXL 生成样本的视觉保真度。我们证明，与之前的 Stable Diffusion 版本相比，SDXL 显示出大大提高的性能，并取得了与黑盒最先进图像生成器相媲美的结果。

提示

已知使用 SDXL 和 DPM++ 调度器在少于 50 步的情况下会产生视觉伪影，因为求解器在数值上变得不稳定。要解决此问题，请查看此 PR，其中建议对于 ODE/SDE 求解器：
- 设置 use_karras_sigmas=True 或 lu_lambdas=True 以提高图像质量
- 如果您使用的是步长均匀的求解器（DPM++2M 或 DPM++2M SDE），请设置 euler_at_final=True
大多数 SDXL 检查点在 1024x1024 的图像尺寸下效果最佳。也支持 768x768 和 512x512 的图像尺寸，但结果不如前者。不建议使用低于 512x512 的任何尺寸，对于像 stabilityai/stable-diffusion-xl-base-1.0 这样的默认检查点来说可能也不适用。
SDXL 可以为其训练的每个文本编码器传递不同的 prompt。我们甚至可以将同一 prompt 的不同部分传递给文本编码器。
通过在图像到图像设置中使用细化模型，可以改进 SDXL 输出图像。
SDXL 提供 negative_original_size、negative_crops_coords_top_left 和 negative_target_size，以便对图像分辨率和裁剪参数进行负面条件约束。

要了解如何将 SDXL 用于各种任务、如何优化性能以及其他使用示例，请查看 Stable Diffusion XL 指南。

查看 Stability AI Hub 组织，获取官方的基础模型和细化模型检查点！

StableDiffusionXLPipeline

class diffusers.StableDiffusionXLPipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel text_encoder_2: CLIPTextModelWithProjection tokenizer: CLIPTokenizer tokenizer_2: CLIPTokenizer unet: UNet2DConditionModel scheduler: KarrasDiffusionSchedulers image_encoder: CLIPVisionModelWithProjection = None feature_extractor: CLIPImageProcessor = None force_zeros_for_empty_prompt: bool = True add_watermarker: typing.Optional[bool] = None )

Parameters

vae (AutoencoderKL) — 变分自编码器 (VAE) 模型，用于将图像编码和解码为潜在表示形式。
text_encoder (CLIPTextModel) — 冻结的文本编码器。 Stable Diffusion XL 使用 CLIP 的文本部分，特别是 clip-vit-large-patch14 变体。
text_encoder_2 ( CLIPTextModelWithProjection) — 第二个冻结的文本编码器。 Stable Diffusion XL 使用 CLIP 的文本和池化部分，特别是 laion/CLIP-ViT-bigG-14-laion2B-39B-b160k 变体。
tokenizer (CLIPTokenizer) — CLIPTokenizer 类的分词器。
tokenizer_2 (CLIPTokenizer) — 第二个 CLIPTokenizer 类的分词器。
unet (UNet2DConditionModel) — 条件 U-Net 架构，用于对编码后的图像潜在空间进行去噪。
scheduler (SchedulerMixin) — 调度器，与 unet 结合使用，以对编码后的图像潜在空间进行去噪。可以是 DDIMScheduler、 LMSDiscreteScheduler 或 PNDMScheduler 之一。
force_zeros_for_empty_prompt (bool, optional, defaults to "True") — 负面提示词的嵌入是否应始终强制设置为 0。另请参阅 stabilityai/stable-diffusion-xl-base-1-0 的配置。
add_watermarker (bool, optional) — 是否使用 invisible_watermark library 库为输出图像添加水印。如果未定义，如果安装了该包，则默认为 True，否则不使用水印。

Pipeline for text-to-image generation using Stable Diffusion XL.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

The pipeline also inherits the following loading methods

load_textual_inversion() for loading textual inversion embeddings
from_single_file() for loading .ckpt files
load_lora_weights() for loading LoRA weights
save_lora_weights() for saving LoRA weights
load_ip_adapter() for loading IP Adapters

call

< source >

( prompt: typing.Union[str, typing.List[str]] = None prompt_2: typing.Union[str, typing.List[str], NoneType] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 timesteps: typing.List[int] = None sigmas: typing.List[float] = None denoising_end: typing.Optional[float] = None guidance_scale: float = 5.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_2: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None guidance_rescale: float = 0.0 original_size: typing.Optional[typing.Tuple[int, int]] = None crops_coords_top_left: typing.Tuple[int, int] = (0, 0) target_size: typing.Optional[typing.Tuple[int, int]] = None negative_original_size: typing.Optional[typing.Tuple[int, int]] = None negative_crops_coords_top_left: typing.Tuple[int, int] = (0, 0) negative_target_size: typing.Optional[typing.Tuple[int, int]] = None clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] **kwargs ) → ~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput or tuple

Parameters

prompt (str or List[str], optional) — 用于引导图像生成的提示词。如果未定义，则必须传递 prompt_embeds。
prompt_2 (str or List[str], optional) — 要发送到 tokenizer_2 和 text_encoder_2 的提示词。如果未定义，则 prompt 将用于两个文本编码器。
height (int, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) — 生成图像的高度（像素）。默认设置为 1024 以获得最佳效果。对于 stabilityai/stable-diffusion-xl-base-1.0 以及未针对低分辨率进行微调的检查点，低于 512 像素的任何值都无法正常工作。
width (int, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) — 生成图像的宽度（像素）。默认设置为 1024 以获得最佳效果。对于 stabilityai/stable-diffusion-xl-base-1.0 以及未针对低分辨率进行微调的检查点，低于 512 像素的任何值都无法正常工作。
num_inference_steps (int, optional, defaults to 50) — 去噪步骤的数量。更多的去噪步骤通常会带来更高质量的图像，但代价是推理速度较慢。
timesteps (List[int], optional) — 自定义时间步长，用于支持在其 set_timesteps 方法中使用 timesteps 参数的调度器的去噪过程。如果未定义，则将使用传递 num_inference_steps 时的默认行为。必须按降序排列。
sigmas (List[float], optional) — 自定义 sigmas，用于支持在其 set_timesteps 方法中使用 sigmas 参数的调度器的去噪过程。如果未定义，则将使用传递 num_inference_steps 时的默认行为。
denoising_end (float, optional) — 如果指定，则确定在有意提前终止之前要完成的总去噪过程的分数（介于 0.0 和 1.0 之间）。因此，返回的样本仍将保留大量噪声，具体取决于调度器选择的离散时间步长。当此 pipeline 构成“去噪器混合”多 pipeline 设置的一部分时，应理想地使用 denoising_end 参数，如 优化图像输出 中详述。
guidance_scale (float, optional, defaults to 5.0) — Classifier-Free Diffusion Guidance 中定义的 guidance scale。 guidance_scale 定义为 Imagen Paper 的公式 2 中的 w。通过设置 guidance_scale > 1 启用 Guidance scale。较高的 guidance scale 鼓励生成与文本 prompt 紧密相关的图像，通常以降低图像质量为代价。
negative_prompt (str 或 List[str], 可选) — 不用于引导图像生成的提示或提示列表。如果未定义，则必须传递 negative_prompt_embeds。当不使用引导时忽略（即，如果 guidance_scale 小于 1 则忽略）。
negative_prompt_2 (str 或 List[str], 可选) — 不用于引导图像生成的提示或提示列表，将发送到 tokenizer_2 和 text_encoder_2。如果未定义，则 negative_prompt 将在两个文本编码器中使用
num_images_per_prompt (int, 可选, 默认为 1) — 每个提示生成的图像数量。
eta (float, 可选, 默认为 0.0) — 对应于 DDIM 论文中的参数 eta (η)： https://arxiv.org/abs/2010.02502。仅适用于 schedulers.DDIMScheduler，对于其他调度器将被忽略。
generator (torch.Generator 或 List[torch.Generator], 可选) — 用于使生成具有确定性的一个或多个 torch 生成器。
latents (torch.Tensor, 可选) — 预生成的噪声潜变量，从高斯分布中采样，用作图像生成的输入。可用于使用不同的提示调整相同的生成。如果未提供，将通过使用提供的随机 generator 进行采样来生成潜变量张量。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，则将从 prompt 输入参数生成文本嵌入。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负面文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，则将从 negative_prompt 输入参数生成 negative_prompt_embeds。
pooled_prompt_embeds (torch.Tensor, 可选) — 预生成的池化文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，则将从 prompt 输入参数生成池化文本嵌入。
negative_pooled_prompt_embeds (torch.Tensor, 可选) — 预生成的负面池化文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，则将从 negative_prompt 输入参数生成池化 negative_prompt_embeds。
ip_adapter_image — (PipelineImageInput, 可选): 与 IP 适配器一起使用的可选图像输入。
ip_adapter_image_embeds (List[torch.Tensor], 可选) — IP-Adapter 的预生成图像嵌入。它应该是一个列表，长度与 IP 适配器的数量相同。每个元素都应该是一个形状为 (batch_size, num_images, emb_dim) 的张量。如果 do_classifier_free_guidance 设置为 True，则应包含负图像嵌入。如果未提供，则从 ip_adapter_image 输入参数计算嵌入。
output_type (str, 可选, 默认为 "pil") — 生成图像的输出格式。在 PIL: PIL.Image.Image 或 np.array 之间选择。
return_dict (bool, 可选, 默认为 True) — 是否返回 ~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput 而不是普通元组。
cross_attention_kwargs (dict, 可选) — 一个 kwargs 字典，如果指定，则会传递给 diffusers.models.attention_processor 中 self.processor 下定义的 AttentionProcessor。
guidance_rescale (float, 可选, 默认为 0.0) — Common Diffusion Noise Schedules and Sample Steps are Flawed 提出的引导重缩放因子。guidance_scale 在 Common Diffusion Noise Schedules and Sample Steps are Flawed 的公式 16 中定义为 φ。当使用零终端信噪比时，引导重缩放因子应修复过度曝光。
original_size (Tuple[int], 可选, 默认为 (1024, 1024)) — 如果 original_size 与 target_size 不同，图像将显示为缩小或放大。如果未指定，original_size 默认为 (height, width)。SDXL 微调的一部分，如 https://huggingface.ac.cn/papers/2307.01952 第 2.2 节中所述。
crops_coords_top_left (Tuple[int], 可选, 默认为 (0, 0)) — crops_coords_top_left 可用于生成一个图像，该图像看起来像是从 crops_coords_top_left 位置向下“裁剪”的。通常，通过将 crops_coords_top_left 设置为 (0, 0) 可以获得良好的、居中的图像。SDXL 微调的一部分，如 https://huggingface.ac.cn/papers/2307.01952 第 2.2 节中所述。
target_size (Tuple[int], 可选, 默认为 (1024, 1024)) — 对于大多数情况，target_size 应设置为生成图像的期望高度和宽度。如果未指定，则默认为 (height, width)。SDXL 微调的一部分，如 https://huggingface.ac.cn/papers/2307.01952 第 2.2 节中所述。
negative_original_size (Tuple[int], 可选, 默认为 (1024, 1024)) — 为了基于特定的图像分辨率对生成过程进行负面条件约束。SDXL 微调的一部分，如 https://huggingface.ac.cn/papers/2307.01952 第 2.2 节中所述。有关更多信息，请参阅此问题线程： https://github.com/huggingface/diffusers/issues/4208。
negative_crops_coords_top_left (Tuple[int], 可选, 默认为 (0, 0)) — 为了基于特定的裁剪坐标对生成过程进行负面条件约束。SDXL 微调的一部分，如 https://huggingface.ac.cn/papers/2307.01952 第 2.2 节中所述。有关更多信息，请参阅此问题线程： https://github.com/huggingface/diffusers/issues/4208。
negative_target_size (Tuple[int], 可选, 默认为 (1024, 1024)) — 为了基于目标图像分辨率对生成过程进行负面条件约束。在大多数情况下，它应与 target_size 相同。SDXL 微调的一部分，如 https://huggingface.ac.cn/papers/2307.01952 第 2.2 节中所述。有关更多信息，请参阅此问题线程： https://github.com/huggingface/diffusers/issues/4208。
callback_on_step_end (Callable, PipelineCallback, MultiPipelineCallbacks, 可选) — 一个函数或 PipelineCallback 或 MultiPipelineCallbacks 的子类，它在推理期间每个去噪步骤结束时被调用。具有以下参数： callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。 callback_kwargs 将包含 callback_on_step_end_tensor_inputs 指定的所有张量列表。
callback_on_step_end_tensor_inputs (List, 可选) — callback_on_step_end 函数的张量输入列表。列表中指定的张量将作为 callback_kwargs 参数传递。您将只能包含在管道类的 ._callback_tensor_inputs 属性中列出的变量。

~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput 或 tuple

如果 return_dict 为 True，则返回 ~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput，否则返回 tuple。当返回元组时，第一个元素是包含生成图像的列表。

调用管道进行生成时调用的函数。

示例

>>> import torch
>>> from diffusers import StableDiffusionXLPipeline

>>> pipe = StableDiffusionXLPipeline.from_pretrained(
...     "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16
... )
>>> pipe = pipe.to("cuda")

>>> prompt = "a photo of an astronaut riding a horse on mars"
>>> image = pipe(prompt).images[0]

encode_prompt

< source >

( prompt: str prompt_2: typing.Optional[str] = None device: typing.Optional[torch.device] = None num_images_per_prompt: int = 1 do_classifier_free_guidance: bool = True negative_prompt: typing.Optional[str] = None negative_prompt_2: typing.Optional[str] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None clip_skip: typing.Optional[int] = None )

Parameters

prompt (str 或 List[str], 可选) — 要编码的提示
prompt_2 (str 或 List[str], 可选) — 将发送到 tokenizer_2 和 text_encoder_2 的提示或提示列表。如果未定义，则 prompt 将在两个文本编码器中使用
device — (torch.device): torch 设备
num_images_per_prompt (int) — 每个提示应生成的图像数量
do_classifier_free_guidance (bool) — 是否使用无分类器引导（classifier free guidance）
negative_prompt (str 或 List[str], 可选) — 不用于引导图像生成的提示或提示列表。如果未定义，则必须传递 negative_prompt_embeds 代替。当不使用引导时忽略（即，如果 guidance_scale 小于 1 则忽略）。
negative_prompt_2 (str 或 List[str], 可选) — 不用于引导图像生成的提示或提示列表，将发送到 tokenizer_2 和 text_encoder_2。如果未定义，则 negative_prompt 将在两个文本编码器中都使用
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入（embeddings）。可用于轻松调整文本输入，例如提示权重。如果未提供，则将从 prompt 输入参数生成文本嵌入。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负面文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，则将从 negative_prompt 输入参数生成 negative_prompt_embeds。
pooled_prompt_embeds (torch.Tensor, 可选) — 预生成的池化文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，则将从 prompt 输入参数生成池化文本嵌入。
negative_pooled_prompt_embeds (torch.Tensor, 可选) — 预生成的负面池化文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，则将从 negative_prompt 输入参数生成池化的 negative_prompt_embeds。
lora_scale (float, 可选) — 如果加载了 LoRA 层，则将应用于文本编码器所有 LoRA 层的 LoRA 缩放比例。
clip_skip (int, 可选) — 从 CLIP 跳过的层数，用于计算提示嵌入。值为 1 表示预最终层的输出将用于计算提示嵌入。

将提示编码为文本编码器隐藏状态。

get_guidance_scale_embedding

< source >

( w: Tensor embedding_dim: int = 512 dtype: dtype = torch.float32 ) → torch.Tensor

Parameters

w (torch.Tensor) — 使用指定的引导缩放比例生成嵌入向量，以随后丰富时间步嵌入。
embedding_dim (int, 可选, 默认为 512) — 要生成的嵌入的维度。
dtype (torch.dtype, 可选, 默认为 torch.float32) — 生成的嵌入的数据类型。

torch.Tensor

形状为 (len(w), embedding_dim) 的嵌入向量。

参见 https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

StableDiffusionXLImg2ImgPipeline

class diffusers.StableDiffusionXLImg2ImgPipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel text_encoder_2: CLIPTextModelWithProjection tokenizer: CLIPTokenizer tokenizer_2: CLIPTokenizer unet: UNet2DConditionModel scheduler: KarrasDiffusionSchedulers image_encoder: CLIPVisionModelWithProjection = None feature_extractor: CLIPImageProcessor = None requires_aesthetics_score: bool = False force_zeros_for_empty_prompt: bool = True add_watermarker: typing.Optional[bool] = None )

Parameters

vae (AutoencoderKL) — 变分自动编码器 (VAE) 模型，用于将图像编码和解码为潜在表示形式以及从潜在表示形式解码图像。
text_encoder (CLIPTextModel) — 冻结的文本编码器。 Stable Diffusion XL 使用 CLIP 的文本部分，特别是 clip-vit-large-patch14 变体。
text_encoder_2 ( CLIPTextModelWithProjection) — 第二个冻结的文本编码器。 Stable Diffusion XL 使用 CLIP 的文本和池化部分，特别是 laion/CLIP-ViT-bigG-14-laion2B-39B-b160k 变体。
tokenizer (CLIPTokenizer) — 类 CLIPTokenizer 的分词器。
tokenizer_2 (CLIPTokenizer) — 类 CLIPTokenizer 的第二个分词器。
unet (UNet2DConditionModel) — 条件 U-Net 架构，用于对编码的图像潜在空间进行去噪。
scheduler (SchedulerMixin) — 调度器，与 unet 结合使用，以对编码的图像潜在空间进行去噪。可以是 DDIMScheduler、LMSDiscreteScheduler 或 PNDMScheduler 之一。
requires_aesthetics_score (bool, 可选, 默认为 "False") — unet 是否需要在推理期间传递 aesthetic_score 条件。另请参阅 stabilityai/stable-diffusion-xl-refiner-1-0 的配置。
force_zeros_for_empty_prompt (bool, 可选, 默认为 "True") — 负面提示嵌入是否应强制始终设置为 0。另请参阅 stabilityai/stable-diffusion-xl-base-1-0 的配置。
add_watermarker (bool, 可选) — 是否使用 invisible_watermark library 库为输出图像添加水印。如果未定义，如果已安装该包，则默认为 True，否则将不使用水印。

Pipeline for text-to-image generation using Stable Diffusion XL.

The pipeline also inherits the following loading methods

load_textual_inversion() for loading textual inversion embeddings
from_single_file() for loading .ckpt files
load_lora_weights() for loading LoRA weights
save_lora_weights() for saving LoRA weights
load_ip_adapter() for loading IP Adapters

call

< source >

( prompt: typing.Union[str, typing.List[str]] = None prompt_2: typing.Union[str, typing.List[str], NoneType] = None image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None strength: float = 0.3 num_inference_steps: int = 50 timesteps: typing.List[int] = None sigmas: typing.List[float] = None denoising_start: typing.Optional[float] = None denoising_end: typing.Optional[float] = None guidance_scale: float = 5.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_2: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None guidance_rescale: float = 0.0 original_size: typing.Tuple[int, int] = None crops_coords_top_left: typing.Tuple[int, int] = (0, 0) target_size: typing.Tuple[int, int] = None negative_original_size: typing.Optional[typing.Tuple[int, int]] = None negative_crops_coords_top_left: typing.Tuple[int, int] = (0, 0) negative_target_size: typing.Optional[typing.Tuple[int, int]] = None aesthetic_score: float = 6.0 negative_aesthetic_score: float = 2.5 clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] **kwargs ) → ~pipelines.stable_diffusion.StableDiffusionXLPipelineOutput or tuple

Parameters

prompt (str 或 List[str], 可选) — 用于引导图像生成的提示或提示列表。如果未定义，则必须传入 prompt_embeds 代替。
prompt_2 (str 或 List[str], 可选) — 要发送到 tokenizer_2 和 text_encoder_2 的提示或提示列表。如果未定义，则 prompt 将在两个文本编码器中都使用。
image (torch.Tensor 或 PIL.Image.Image 或 np.ndarray 或 List[torch.Tensor] 或 List[PIL.Image.Image] 或 List[np.ndarray]) — 要使用 pipeline 修改的图像。
strength (float, 可选, 默认为 0.3) — 从概念上讲，表示要转换参考 image 的程度。必须介于 0 和 1 之间。image 将用作起点，strength 越大，添加的噪声就越多。去噪步骤的数量取决于最初添加的噪声量。当 strength 为 1 时，添加的噪声将最大，并且去噪过程将运行 num_inference_steps 中指定的完整迭代次数。因此，值为 1 本质上会忽略 image。请注意，如果 denoising_start 被声明为整数，则 strength 的值将被忽略。
num_inference_steps (int, 可选, 默认为 50) — 去噪步骤的数量。更多的去噪步骤通常会带来更高质量的图像，但会牺牲更慢的推理速度。
timesteps (List[int], 可选) — 用于去噪过程的自定义时间步长，用于支持在其 set_timesteps 方法中使用 timesteps 参数的调度器。如果未定义，则将使用传递 num_inference_steps 时的默认行为。必须以降序排列。
sigmas (List[float], 可选) — 用于去噪过程的自定义 sigmas，用于支持在其 set_timesteps 方法中使用 sigmas 参数的调度器。如果未定义，则将使用传递 num_inference_steps 时的默认行为。
denoising_start (float, 可选) — 当指定时，表示在启动之前要绕过的总去噪过程的分数（介于 0.0 和 1.0 之间）。因此，去噪过程的初始部分被跳过，并且假定传递的 image 是部分去噪的图像。请注意，当指定此参数时，strength 将被忽略。当此 pipeline 集成到“去噪器混合”多 pipeline 设置中时，denoising_start 参数尤其有利，如 优化图像质量 中详述。
denoising_end (float, 可选) — 当指定时，确定在有意过早终止之前要完成的总去噪过程的分数（介于 0.0 和 1.0 之间）。因此，返回的样本仍将保留大量噪声（约占仍需最后 20% 的时间步长），并且应由 denoising_start 设置为 0.8 的后续 pipeline 进行去噪，以便仅对其最后 20% 的调度器进行去噪。当此 pipeline 构成“去噪器混合”多 pipeline 设置的一部分时，应理想地利用 denoising_end 参数，如 优化图像质量 中详述。
guidance_scale (float, 可选, 默认为 7.5) — 无分类器扩散引导中定义的引导比例。guidance_scale 定义为 Imagen Paper 的等式 2 中的 w。通过设置 guidance_scale > 1 启用引导比例。较高的引导比例鼓励生成与文本 prompt 紧密相关的图像，通常以降低图像质量为代价。
negative_prompt (str 或 List[str], 可选) — 不用于引导图像生成的提示或提示列表。如果未定义，则必须传入 negative_prompt_embeds 代替。当不使用引导时忽略（即，如果 guidance_scale 小于 1 则忽略）。
negative_prompt_2 (str 或 List[str], 可选) — 不用于引导图像生成的提示或提示列表，将发送到 tokenizer_2 和 text_encoder_2。如果未定义，则 negative_prompt 将在两个文本编码器中都使用。
num_images_per_prompt (int, 可选, 默认为 1) — 每个提示要生成的图像数量。
eta (float, 可选, 默认为 0.0) — 对应于 DDIM 论文中的参数 eta (η): https://arxiv.org/abs/2010.02502。仅适用于 schedulers.DDIMScheduler，对于其他调度器将被忽略。
generator (torch.Generator 或 List[torch.Generator], 可选) — 用于使生成具有确定性的一个或一组 torch 生成器。
latents (torch.Tensor, 可选) — 预生成的噪声 latents，从高斯分布中采样，用作图像生成的输入。可用于使用不同的提示调整相同的生成。如果未提供，则将通过使用提供的随机 generator 进行采样来生成 latents 张量。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本 embeddings。可用于轻松调整文本输入，例如提示权重。如果未提供，将从 prompt 输入参数生成文本 embeddings。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负文本 embeddings。可用于轻松调整文本输入，例如提示权重。如果未提供，将从 negative_prompt 输入参数生成 negative_prompt_embeds。
pooled_prompt_embeds (torch.Tensor, 可选) — 预生成的池化文本嵌入 (pooled text embeddings)。可以用于轻松调整文本输入，例如 prompt 权重。如果未提供，将从 prompt 输入参数生成池化文本嵌入。
negative_pooled_prompt_embeds (torch.Tensor, 可选) — 预生成的负池化文本嵌入 (negative pooled text embeddings)。可以用于轻松调整文本输入，例如 prompt 权重。如果未提供，将从 negative_prompt 输入参数生成池化的 negative_prompt_embeds。
ip_adapter_image — (PipelineImageInput, 可选): 与 IP 适配器一起使用的可选图像输入。
ip_adapter_image_embeds (List[torch.Tensor], 可选) — IP 适配器的预生成图像嵌入。它应该是一个列表，其长度与 IP 适配器的数量相同。每个元素都应该是一个形状为 (batch_size, num_images, emb_dim) 的张量。如果 do_classifier_free_guidance 设置为 True，则应包含负图像嵌入。如果未提供，则从 ip_adapter_image 输入参数计算嵌入。
output_type (str, 可选, 默认为 "pil") — 生成图像的输出格式。在 PIL: PIL.Image.Image 或 np.array 之间选择。
return_dict (bool, 可选, 默认为 True) — 是否返回 ~pipelines.stable_diffusion.StableDiffusionXLPipelineOutput 而不是普通元组。
cross_attention_kwargs (dict, 可选) — 一个 kwargs 字典，如果指定，则会传递给 AttentionProcessor，定义在 diffusers.models.attention_processor 中的 self.processor 下。
guidance_rescale (float, 可选, 默认为 0.0) — 由 Common Diffusion Noise Schedules and Sample Steps are Flawed 提出的 Guidance rescale 因子。 guidance_scale 在 Common Diffusion Noise Schedules and Sample Steps are Flawed 的公式 16 中定义为 φ。 Guidance rescale 因子应修复使用零终端 SNR 时的过度曝光。
original_size (Tuple[int], 可选, 默认为 (1024, 1024)) — 如果 original_size 与 target_size 不同，图像将显示为下采样或上采样。如果未指定，original_size 默认为 (height, width)。 SDXL 微调的一部分，如 https://huggingface.ac.cn/papers/2307.01952 的第 2.2 节中所述。
crops_coords_top_left (Tuple[int], 可选, 默认为 (0, 0)) — crops_coords_top_left 可用于生成看起来像是从位置 crops_coords_top_left 向下“裁剪”的图像。通过将 crops_coords_top_left 设置为 (0, 0)，通常可以获得有利的、居中良好的图像。 SDXL 微调的一部分，如 https://huggingface.ac.cn/papers/2307.01952 的第 2.2 节中所述。
target_size (Tuple[int], 可选, 默认为 (1024, 1024)) — 对于大多数情况，target_size 应设置为生成图像的所需高度和宽度。如果未指定，它将默认为 (height, width)。 SDXL 微调的一部分，如 https://huggingface.ac.cn/papers/2307.01952 的第 2.2 节中所述。
negative_original_size (Tuple[int], 可选, 默认为 (1024, 1024)) — 为了基于特定的图像分辨率对生成过程进行负面调节。 SDXL 微调的一部分，如 https://huggingface.ac.cn/papers/2307.01952 的第 2.2 节中所述。有关更多信息，请参阅此问题线程： https://github.com/huggingface/diffusers/issues/4208。
negative_crops_coords_top_left (Tuple[int], 可选, 默认为 (0, 0)) — 为了基于特定的裁剪坐标对生成过程进行负面调节。 SDXL 微调的一部分，如 https://huggingface.ac.cn/papers/2307.01952 的第 2.2 节中所述。有关更多信息，请参阅此问题线程： https://github.com/huggingface/diffusers/issues/4208。
negative_target_size (Tuple[int], 可选, 默认为 (1024, 1024)) — 为了基于目标图像分辨率对生成过程进行负面调节。在大多数情况下，它应与 target_size 相同。 SDXL 微调的一部分，如 https://huggingface.ac.cn/papers/2307.01952 的第 2.2 节中所述。有关更多信息，请参阅此问题线程： https://github.com/huggingface/diffusers/issues/4208。
aesthetic_score (float, 可选, 默认为 6.0) — 用于通过影响正面文本条件来模拟生成图像的美学评分。 SDXL 微调的一部分，如 https://huggingface.ac.cn/papers/2307.01952 的第 2.2 节中所述。
negative_aesthetic_score (float, 可选, 默认为 2.5) — SDXL 微调的一部分，如 https://huggingface.ac.cn/papers/2307.01952 的第 2.2 节中所述。可用于通过影响负面文本条件来模拟生成图像的美学评分。
clip_skip (int, 可选) — 在计算 prompt 嵌入时，要从 CLIP 跳过的层数。值 1 表示预最终层的输出将用于计算 prompt 嵌入。
callback_on_step_end (Callable, PipelineCallback, MultiPipelineCallbacks, 可选) — 在推理期间每个去噪步骤结束时调用的函数或 PipelineCallback 或 MultiPipelineCallbacks 的子类。使用以下参数：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。 callback_kwargs 将包含 callback_on_step_end_tensor_inputs 指定的所有张量的列表。
callback_on_step_end_tensor_inputs (List, 可选) — callback_on_step_end 函数的张量输入列表。列表中指定的张量将作为 callback_kwargs 参数传递。您将只能包含在管道类的 ._callback_tensor_inputs 属性中列出的变量。

~pipelines.stable_diffusion.StableDiffusionXLPipelineOutput 或 tuple

如果 return_dict 为 True，则为 ~pipelines.stable_diffusion.StableDiffusionXLPipelineOutput，否则为 `tuple。当返回元组时，第一个元素是包含生成图像的列表。

调用管道进行生成时调用的函数。

示例

>>> import torch
>>> from diffusers import StableDiffusionXLImg2ImgPipeline
>>> from diffusers.utils import load_image

>>> pipe = StableDiffusionXLImg2ImgPipeline.from_pretrained(
...     "stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16
... )
>>> pipe = pipe.to("cuda")
>>> url = "https://huggingface.co/datasets/patrickvonplaten/images/resolve/main/aa_xl/000000009.png"

>>> init_image = load_image(url).convert("RGB")
>>> prompt = "a photo of an astronaut riding a horse on mars"
>>> image = pipe(prompt, image=init_image).images[0]

encode_prompt

< source >

Parameters

prompt (str 或 List[str], 可选) — 要编码的 prompt
prompt_2 (str 或 List[str], 可选) — 要发送到 tokenizer_2 和 text_encoder_2 的 prompt 或 prompts。如果未定义，则 prompt 用于两个文本编码器
device — (torch.device): torch 设备
num_images_per_prompt (int) — 每个 prompt 应生成的图像数量
do_classifier_free_guidance (bool) — 是否使用无分类器引导（classifier free guidance）。
negative_prompt (str 或 List[str], 可选) — 不用于引导图像生成的提示或提示列表。如果未定义，则必须传递 negative_prompt_embeds。当不使用引导时忽略（即，如果 guidance_scale 小于 1 则忽略）。
negative_prompt_2 (str 或 List[str], 可选) — 不用于引导图像生成的提示或提示列表，将发送到 tokenizer_2 和 text_encoder_2。如果未定义，则 negative_prompt 将在两个文本编码器中都使用。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，则将从 prompt 输入参数生成文本嵌入。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负面文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，则将从 negative_prompt 输入参数生成 negative_prompt_embeds。
pooled_prompt_embeds (torch.Tensor, 可选) — 预生成的池化文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，则将从 prompt 输入参数生成池化文本嵌入。
negative_pooled_prompt_embeds (torch.Tensor, 可选) — 预生成的负面池化文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，则将从 negative_prompt 输入参数生成池化 negative_prompt_embeds。
lora_scale (float, 可选) — 如果加载了 LoRA 层，则将应用于文本编码器所有 LoRA 层的 lora 缩放比例。
clip_skip (int, 可选) — 在计算提示嵌入时，要从 CLIP 中跳过的层数。值 1 表示预最终层的输出将用于计算提示嵌入。

将提示编码为文本编码器隐藏状态。

get_guidance_scale_embedding

< source >

( w: Tensor embedding_dim: int = 512 dtype: dtype = torch.float32 ) → torch.Tensor

Parameters

w (torch.Tensor) — 生成具有指定引导比例的嵌入向量，以便随后丰富时间步嵌入。
embedding_dim (int, 可选, 默认为 512) — 要生成的嵌入的维度。
dtype (torch.dtype, 可选, 默认为 torch.float32) — 生成的嵌入的数据类型。

torch.Tensor

形状为 (len(w), embedding_dim) 的嵌入向量。

参见 https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

StableDiffusionXLInpaintPipeline

class diffusers.StableDiffusionXLInpaintPipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel text_encoder_2: CLIPTextModelWithProjection tokenizer: CLIPTokenizer tokenizer_2: CLIPTokenizer unet: UNet2DConditionModel scheduler: KarrasDiffusionSchedulers image_encoder: CLIPVisionModelWithProjection = None feature_extractor: CLIPImageProcessor = None requires_aesthetics_score: bool = False force_zeros_for_empty_prompt: bool = True add_watermarker: typing.Optional[bool] = None )

Parameters

vae (AutoencoderKL) — 变分自动编码器 (VAE) 模型，用于将图像编码和解码为潜在表示形式以及从潜在表示形式解码图像。
text_encoder (CLIPTextModel) — 冻结的文本编码器。 Stable Diffusion XL 使用 CLIP 的文本部分，特别是 clip-vit-large-patch14 变体。
text_encoder_2 ( CLIPTextModelWithProjection) — 第二个冻结的文本编码器。 Stable Diffusion XL 使用 CLIP 的文本和池化部分，特别是 laion/CLIP-ViT-bigG-14-laion2B-39B-b160k 变体。
tokenizer (CLIPTokenizer) — 类 CLIPTokenizer 的分词器。
tokenizer_2 (CLIPTokenizer) — 类 CLIPTokenizer 的第二个分词器。
unet (UNet2DConditionModel) — 条件 U-Net 架构，用于对编码的图像潜在空间进行去噪。
scheduler (SchedulerMixin) — 调度器，与 unet 结合使用，以对编码的图像潜在空间进行去噪。可以是 DDIMScheduler、LMSDiscreteScheduler 或 PNDMScheduler 之一。
requires_aesthetics_score (bool, 可选, 默认为 "False") — unet 是否需要在推理期间传递 aesthetic_score 条件。另请参阅 stabilityai/stable-diffusion-xl-refiner-1-0 的配置。
force_zeros_for_empty_prompt (bool, 可选, 默认为 "True") — 负面提示嵌入是否应始终强制设置为 0。另请参阅 stabilityai/stable-diffusion-xl-base-1-0 的配置。
add_watermarker (bool, 可选) — 是否使用 invisible_watermark library 为输出图像添加水印。如果未定义，如果已安装该软件包，则默认为 True，否则将不使用水印。

Pipeline for text-to-image generation using Stable Diffusion XL.

The pipeline also inherits the following loading methods

load_textual_inversion() for loading textual inversion embeddings
from_single_file() for loading .ckpt files
load_lora_weights() for loading LoRA weights
save_lora_weights() for saving LoRA weights
load_ip_adapter() for loading IP Adapters

call

< source >

( prompt: typing.Union[str, typing.List[str]] = None prompt_2: typing.Union[str, typing.List[str], NoneType] = None image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None mask_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None masked_image_latents: Tensor = None height: typing.Optional[int] = None width: typing.Optional[int] = None padding_mask_crop: typing.Optional[int] = None strength: float = 0.9999 num_inference_steps: int = 50 timesteps: typing.List[int] = None sigmas: typing.List[float] = None denoising_start: typing.Optional[float] = None denoising_end: typing.Optional[float] = None guidance_scale: float = 7.5 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_2: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None guidance_rescale: float = 0.0 original_size: typing.Tuple[int, int] = None crops_coords_top_left: typing.Tuple[int, int] = (0, 0) target_size: typing.Tuple[int, int] = None negative_original_size: typing.Optional[typing.Tuple[int, int]] = None negative_crops_coords_top_left: typing.Tuple[int, int] = (0, 0) negative_target_size: typing.Optional[typing.Tuple[int, int]] = None aesthetic_score: float = 6.0 negative_aesthetic_score: float = 2.5 clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] **kwargs ) → ~pipelines.stable_diffusion.StableDiffusionXLPipelineOutput or tuple

Parameters

prompt (str or List[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to pass prompt_embeds. instead.
prompt_2 (str or List[str], optional) — The prompt or prompts to be sent to the tokenizer_2 and text_encoder_2. If not defined, prompt is used in both text-encoders
image (PIL.Image.Image) — Image，或表示图像批次的张量，将被修复，即图像的部分将被 mask_image 遮罩，并根据 prompt 重新绘制。
mask_image (PIL.Image.Image) — Image，或表示图像批次的张量，用于遮罩 image。蒙版中的白色像素将被重新绘制，而黑色像素将被保留。如果 mask_image 是 PIL 图像，它将在使用前转换为单通道（亮度）。如果它是张量，则应包含一个颜色通道 (L) 而不是 3 个，因此预期形状应为 (B, H, W, 1)。
height (int, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) — 生成图像的像素高度。默认设置为 1024 以获得最佳效果。对于 stabilityai/stable-diffusion-xl-base-1.0 以及未针对低分辨率进行微调的检查点，低于 512 像素的任何值都无法很好地工作。
width (int, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) — 生成图像的像素宽度。默认设置为 1024 以获得最佳效果。对于 stabilityai/stable-diffusion-xl-base-1.0 以及未针对低分辨率进行微调的检查点，低于 512 像素的任何值都无法很好地工作。
padding_mask_crop (int, optional, defaults to None) — 应用于图像和蒙版的裁剪边距大小。如果为 None，则不将裁剪应用于图像和 mask_image。如果 padding_mask_crop 不为 None，它将首先找到一个具有与图像相同宽高比的矩形区域，并包含所有蒙版区域，然后根据 padding_mask_crop 扩展该区域。然后将基于扩展区域裁剪图像和 mask_image，然后再调整大小为原始图像大小以进行修复。当蒙版区域很小而图像很大并且包含与修复无关的信息（例如背景）时，这非常有用。
strength (float, optional, defaults to 0.9999) — 从概念上讲，表示要转换参考 image 的蒙版部分的程度。必须介于 0 和 1 之间。 image 将用作起点，strength 越大，向其添加的噪声就越多。去噪步骤的数量取决于最初添加的噪声量。当 strength 为 1 时，添加的噪声将最大，并且去噪过程将运行 num_inference_steps 中指定的完整迭代次数。因此，值为 1 实际上会忽略参考 image 的蒙版部分。请注意，如果将 denoising_start 声明为整数，则将忽略 strength 的值。
num_inference_steps (int, optional, defaults to 50) — 去噪步骤的数量。更多的去噪步骤通常会以较慢的推理速度为代价，从而产生更高质量的图像。
timesteps (List[int], optional) — 用于去噪过程的自定义时间步长，适用于在其 set_timesteps 方法中支持 timesteps 参数的调度器。如果未定义，则将使用传递 num_inference_steps 时的默认行为。必须按降序排列。
sigmas (List[float], optional) — 用于去噪过程的自定义 sigmas，适用于在其 set_timesteps 方法中支持 sigmas 参数的调度器。如果未定义，则将使用传递 num_inference_steps 时的默认行为。
denoising_start (float, optional) — 如果指定，则指示在启动之前要绕过的总去噪过程的分数（介于 0.0 和 1.0 之间）。因此，跳过了去噪过程的初始部分，并假定传递的 image 是部分去噪的图像。请注意，当指定此项时，将忽略强度。当此管道集成到“去噪器混合”多管道设置中时，denoising_start 参数特别有利，如 优化图像输出 中详述。
denoising_end (float, optional) — 如果指定，则确定在有意过早终止之前要完成的总去噪过程的分数（介于 0.0 和 1.0 之间）。因此，返回的样本仍将保留大量的噪声（大约仍然需要最后 20% 的时间步长），并且应由后续管道去噪，该管道的 denoising_start 设置为 0.8，以便它仅对调度器的最后 20% 进行去噪。当此管道构成“去噪器混合”多管道设置的一部分时，理想情况下应使用 denoising_end 参数，如 优化图像输出 中详述。
guidance_scale (float, optional, defaults to 7.5) — 无分类器扩散引导中定义的引导比例。 guidance_scale 定义为 Imagen Paper 的公式 2 中的 w。通过设置 guidance_scale > 1 启用引导比例。较高的引导比例鼓励生成与文本 prompt 紧密相关的图像，通常以降低图像质量为代价。
negative_prompt (str or List[str], optional) — 不用于引导图像生成的提示语或提示语列表。如果未定义，则必须改为传递 negative_prompt_embeds。当不使用引导时（即，如果 guidance_scale 小于 1），将被忽略。
negative_prompt_2 (str or List[str], optional) — 不用于引导图像生成的提示语或提示语列表，将发送到 tokenizer_2 和 text_encoder_2。如果未定义，则 negative_prompt 将在两个文本编码器中使用
prompt_embeds (torch.Tensor, optional) — 预生成的文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，则将从 prompt 输入参数生成文本嵌入。
negative_prompt_embeds (torch.Tensor, optional) — 预生成的负面文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，则将从 negative_prompt 输入参数生成 negative_prompt_embeds。
pooled_prompt_embeds (torch.Tensor, optional) — 预生成的池化文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，则将从 prompt 输入参数生成池化文本嵌入。
negative_pooled_prompt_embeds (torch.Tensor, optional) — 预生成的负面池化文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，则将从 negative_prompt 输入参数生成池化的 negative_prompt_embeds。
ip_adapter_image — (PipelineImageInput, 可选): 与 IP 适配器一起使用的可选图像输入。
ip_adapter_image_embeds (List[torch.Tensor], 可选) — IP-Adapter 的预生成图像嵌入。它应该是一个列表，长度与 IP 适配器的数量相同。每个元素都应该是一个形状为 (batch_size, num_images, emb_dim) 的张量。如果 do_classifier_free_guidance 设置为 True，则应包含负图像嵌入。如果未提供，则从 ip_adapter_image 输入参数计算嵌入。
num_images_per_prompt (int, 可选, 默认为 1) — 每个提示要生成的图像数量。
eta (float, 可选, 默认为 0.0) — 对应于 DDIM 论文中的参数 eta (η): https://arxiv.org/abs/2010.02502。仅适用于 schedulers.DDIMScheduler，将被其他调度器忽略。
generator (torch.Generator, 可选) — 用于使生成具有确定性的一个或一组 torch 生成器。
latents (torch.Tensor, 可选) — 预生成的噪声潜变量，从高斯分布中采样，用作图像生成的输入。可用于通过不同的提示调整相同的生成。如果未提供，将使用提供的随机 generator 采样生成潜变量张量。
output_type (str, 可选, 默认为 "pil") — 生成图像的输出格式。在 PIL: PIL.Image.Image 或 np.array 之间选择。
return_dict (bool, 可选, 默认为 True) — 是否返回 StableDiffusionPipelineOutput 而不是纯元组。
cross_attention_kwargs (dict, 可选) — 一个 kwargs 字典，如果指定，则作为 self.processor 下定义的 AttentionProcessor 传递到 diffusers.models.attention_processor。
original_size (Tuple[int], 可选, 默认为 (1024, 1024)) — 如果 original_size 与 target_size 不同，则图像将显示为降采样或升采样。如果未指定，original_size 默认为 (height, width)。SDXL 微调的一部分，如 https://huggingface.ac.cn/papers/2307.01952 第 2.2 节所述。
crops_coords_top_left (Tuple[int], 可选, 默认为 (0, 0)) — crops_coords_top_left 可用于生成看起来像是从位置 crops_coords_top_left 向下“裁剪”的图像。通常通过将 crops_coords_top_left 设置为 (0, 0) 来获得良好的、居中对齐的图像。SDXL 微调的一部分，如 https://huggingface.ac.cn/papers/2307.01952 第 2.2 节所述。
target_size (Tuple[int], 可选, 默认为 (1024, 1024)) — 在大多数情况下，target_size 应设置为生成图像的所需高度和宽度。如果未指定，则默认为 (height, width)。SDXL 微调的一部分，如 https://huggingface.ac.cn/papers/2307.01952 第 2.2 节所述。
negative_original_size (Tuple[int], 可选, 默认为 (1024, 1024)) — 为了基于特定的图像分辨率对生成过程进行负面调节。SDXL 微调的一部分，如 https://huggingface.ac.cn/papers/2307.01952 第 2.2 节所述。有关更多信息，请参阅此问题线程: https://github.com/huggingface/diffusers/issues/4208。
negative_crops_coords_top_left (Tuple[int], 可选, 默认为 (0, 0)) — 为了基于特定的裁剪坐标对生成过程进行负面调节。SDXL 微调的一部分，如 https://huggingface.ac.cn/papers/2307.01952 第 2.2 节所述。有关更多信息，请参阅此问题线程: https://github.com/huggingface/diffusers/issues/4208。
negative_target_size (Tuple[int], 可选, 默认为 (1024, 1024)) — 为了基于目标图像分辨率对生成过程进行负面调节。在大多数情况下，它应与 target_size 相同。SDXL 微调的一部分，如 https://huggingface.ac.cn/papers/2307.01952 第 2.2 节所述。有关更多信息，请参阅此问题线程: https://github.com/huggingface/diffusers/issues/4208。
aesthetic_score (float, 可选, 默认为 6.0) — 用于通过影响正面文本条件来模拟生成图像的美学分数。SDXL 微调的一部分，如 https://huggingface.ac.cn/papers/2307.01952 第 2.2 节所述。
negative_aesthetic_score (float, 可选, 默认为 2.5) — SDXL 微调的一部分，如 https://huggingface.ac.cn/papers/2307.01952 第 2.2 节所述。可用于通过影响负面文本条件来模拟生成图像的美学分数。
clip_skip (int, 可选) — 从 CLIP 跳过的层数，用于计算提示嵌入。值为 1 表示预最终层的输出将用于计算提示嵌入。
callback_on_step_end (Callable, PipelineCallback, MultiPipelineCallbacks, 可选) — 在推理期间每个去噪步骤结束时调用的函数或 PipelineCallback 或 MultiPipelineCallbacks 的子类。使用以下参数：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。 callback_kwargs 将包含 callback_on_step_end_tensor_inputs 指定的所有张量列表。
callback_on_step_end_tensor_inputs (List, 可选) — callback_on_step_end 函数的张量输入列表。列表中指定的张量将作为 callback_kwargs 参数传递。您只能包含管道类的 ._callback_tensor_inputs 属性中列出的变量。

~pipelines.stable_diffusion.StableDiffusionXLPipelineOutput 或 tuple

~pipelines.stable_diffusion.StableDiffusionXLPipelineOutput 如果 return_dict 为 True，否则为 tuple。当返回元组时，第一个元素是包含生成图像的列表。

调用管道进行生成时调用的函数。

示例

>>> import torch
>>> from diffusers import StableDiffusionXLInpaintPipeline
>>> from diffusers.utils import load_image

>>> pipe = StableDiffusionXLInpaintPipeline.from_pretrained(
...     "stabilityai/stable-diffusion-xl-base-1.0",
...     torch_dtype=torch.float16,
...     variant="fp16",
...     use_safetensors=True,
... )
>>> pipe.to("cuda")

>>> img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
>>> mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"

>>> init_image = load_image(img_url).convert("RGB")
>>> mask_image = load_image(mask_url).convert("RGB")

>>> prompt = "A majestic tiger sitting on a bench"
>>> image = pipe(
...     prompt=prompt, image=init_image, mask_image=mask_image, num_inference_steps=50, strength=0.80
... ).images[0]

encode_prompt

< source >

Parameters

prompt (str 或 List[str], 可选) — 要编码的提示
prompt_2 (str 或 List[str], 可选) — 要发送到 tokenizer_2 和 text_encoder_2 的提示。如果未定义，则 prompt 将在两个文本编码器中使用
device — (torch.device): torch 设备
num_images_per_prompt (int) — 每个提示应生成的图像数量
do_classifier_free_guidance (bool) — 是否使用无分类器引导（classifier free guidance）
negative_prompt (str 或 List[str], 可选) — 不用于引导图像生成的提示或提示列表。如果未定义，则必须传递 negative_prompt_embeds。当不使用引导时忽略（即，如果 guidance_scale 小于 1，则忽略）。
negative_prompt_2 (str 或 List[str], 可选) — 不用于引导图像生成的提示或提示列表，将发送到 tokenizer_2 和 text_encoder_2。如果未定义，则 negative_prompt 将用于两个文本编码器。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，将从 prompt 输入参数生成文本嵌入。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负面文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，将从 negative_prompt 输入参数生成 negative_prompt_embeds。
pooled_prompt_embeds (torch.Tensor, 可选) — 预生成的池化文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，将从 prompt 输入参数生成池化文本嵌入。
negative_pooled_prompt_embeds (torch.Tensor, 可选) — 预生成的负面池化文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，将从 negative_prompt 输入参数生成池化 negative_prompt_embeds。
lora_scale (float, 可选) — 如果加载了 LoRA 层，则将应用于文本编码器所有 LoRA 层的 lora 缩放比例。
clip_skip (int, 可选) — 从 CLIP 跳过的层数，用于计算提示嵌入。值为 1 表示预倒数第二层的输出将用于计算提示嵌入。

将提示编码为文本编码器隐藏状态。

get_guidance_scale_embedding

< source >

( w: Tensor embedding_dim: int = 512 dtype: dtype = torch.float32 ) → torch.Tensor

Parameters

w (torch.Tensor) — 生成具有指定引导尺度的嵌入向量，以随后丰富时间步嵌入。
embedding_dim (int, 可选, 默认为 512) — 要生成的嵌入的维度。
dtype (torch.dtype, 可选, 默认为 torch.float32) — 生成的嵌入的数据类型。

torch.Tensor

形状为 (len(w), embedding_dim) 的嵌入向量。

参见 https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

< > 更新在 GitHub 上

←Stable Diffusion 3 SDXL Turbo→

Diffusers

Stable Diffusion XL

提示

StableDiffusionXLPipeline

class diffusers.StableDiffusionXLPipeline

__call__

encode_prompt

get_guidance_scale_embedding

StableDiffusionXLImg2ImgPipeline

class diffusers.StableDiffusionXLImg2ImgPipeline

__call__

encode_prompt

get_guidance_scale_embedding

StableDiffusionXLInpaintPipeline

class diffusers.StableDiffusionXLInpaintPipeline

__call__

encode_prompt

get_guidance_scale_embedding

call

call

call