Diffusers 文档

T2I-Adapter

Diffusers

加入 Hugging Face 社区

并获取增强的文档体验

在模型、数据集和 Spaces 上协作

通过加速推理获得更快的示例

切换文档主题

开始使用

T2I-Adapter

T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models，作者：Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, Xiaohu Qie。

使用预训练模型，我们可以提供控制图像（例如，深度图）来控制 Stable Diffusion 文本到图像的生成，使其遵循深度图像的结构并填充细节。

论文的摘要如下：

大规模文本到图像 (T2I) 模型令人难以置信的生成能力已经展示了学习复杂结构和有意义语义的强大能力。然而，仅仅依靠文本提示并不能充分利用模型学习到的知识，尤其是在需要灵活和精确的控制（例如，颜色和结构）时。在本文中，我们的目标是“挖掘”T2I 模型隐含学习到的能力，然后显式地使用它们来更精细地控制生成。具体来说，我们建议学习简单轻量级的 T2I-Adapter，以将 T2I 模型中的内部知识与外部控制信号对齐，同时冻结原始的大型 T2I 模型。通过这种方式，我们可以根据不同的条件训练各种适配器，在生成结果的颜色和结构中实现丰富的控制和编辑效果。此外，所提出的 T2I-Adapter 具有实际价值的吸引人的特性，例如可组合性和泛化能力。广泛的实验表明，我们的 T2I-Adapter 具有良好的生成质量和广泛的应用前景。

此模型由社区贡献者 HimariO ❤️ 贡献。

StableDiffusionAdapterPipeline

class diffusers.StableDiffusionAdapterPipeline

( vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer unet: UNet2DConditionModel adapter: typing.Union[diffusers.models.adapter.T2IAdapter, diffusers.models.adapter.MultiAdapter, typing.List[diffusers.models.adapter.T2IAdapter]] scheduler: KarrasDiffusionSchedulers safety_checker: StableDiffusionSafetyChecker feature_extractor: CLIPImageProcessor requires_safety_checker: bool = True )

参数

adapter (T2IAdapter 或 MultiAdapter 或 List[T2IAdapter]) — 在去噪过程中为 unet 提供额外的条件控制。如果将多个 Adapter 设置为列表，则每个 Adapter 的输出将被加在一起以创建一个组合的额外条件控制。
adapter_weights (List[float], 可选，默认为 None) — 浮点数列表，表示在将每个适配器的输出加在一起之前，将乘以每个适配器输出的权重。
vae (AutoencoderKL) — 变分自编码器 (VAE) 模型，用于将图像编码和解码为潜在表示形式。
text_encoder (CLIPTextModel) — 冻结的文本编码器。 Stable Diffusion 使用 CLIP 的文本部分，特别是 clip-vit-large-patch14 变体。
tokenizer (CLIPTokenizer) — CLIPTokenizer 类的分词器。
unet (UNet2DConditionModel) — 条件 U-Net 架构，用于对编码后的图像潜在空间进行去噪。
scheduler (SchedulerMixin) — 与 unet 结合使用的调度器，用于对编码后的图像潜在空间进行去噪。可以是 DDIMScheduler、LMSDiscreteScheduler 或 PNDMScheduler 之一。
safety_checker (StableDiffusionSafetyChecker) — 分类模块，用于估计生成的图像是否可能被认为是冒犯性或有害的。请参阅模型卡了解详细信息。
feature_extractor (CLIPImageProcessor) — 从生成的图像中提取特征的模型，用作 safety_checker 的输入。

使用 T2I-Adapter 增强的 Stable Diffusion 文本到图像生成管线 https://arxiv.org/abs/2302.08453

此模型继承自 DiffusionPipeline。查看超类文档，了解库为所有管线实现的通用方法（例如，下载或保存，在特定设备上运行等）。

call

( prompt: typing.Union[str, typing.List[str]] = None image: typing.Union[torch.Tensor, PIL.Image.Image, typing.List[PIL.Image.Image]] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 timesteps: typing.List[int] = None sigmas: typing.List[float] = None guidance_scale: float = 7.5 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None adapter_conditioning_scale: typing.Union[float, typing.List[float]] = 1.0 clip_skip: typing.Optional[int] = None ) → ~pipelines.stable_diffusion.StableDiffusionAdapterPipelineOutput 或 tuple

参数

prompt (str 或 List[str], 可选) — 用于引导图像生成的提示。如果未定义，则必须改为传递 prompt_embeds。
image (torch.Tensor, PIL.Image.Image, List[torch.Tensor] 或 List[PIL.Image.Image] 或 List[List[PIL.Image.Image]]) — Adapter 输入条件。Adapter 使用此输入条件来生成对 Unet 的引导。如果类型指定为 torch.Tensor，则按原样传递给 Adapter。PIL.Image.Image 也可以接受作为图像。控制图像会自动调整大小以适合输出图像。
height (int, 可选, 默认为 self.unet.config.sample_size * self.vae_scale_factor) — 生成图像的高度（像素）。
width (int, 可选, 默认为 self.unet.config.sample_size * self.vae_scale_factor) — 生成图像的宽度（像素）。
num_inference_steps (int, 可选, 默认为 50) — 去噪步骤的数量。更多去噪步骤通常会带来更高质量的图像，但会牺牲较慢的推理速度。
timesteps (List[int], 可选) — 自定义时间步长，用于支持在其 set_timesteps 方法中使用 timesteps 参数的调度器的去噪过程。如果未定义，则将使用传递 num_inference_steps 时的默认行为。必须以降序排列。
sigmas (List[float], 可选) — 自定义 sigma 值，用于支持在其 set_timesteps 方法中使用 sigmas 参数的调度器的去噪过程。如果未定义，则将使用传递 num_inference_steps 时的默认行为。
guidance_scale (float，可选，默认为 7.5) — 指导尺度，定义见 Classifier-Free Diffusion Guidance。 guidance_scale 定义为 Imagen Paper 中公式 2 的 w。通过设置 guidance_scale > 1 启用指导尺度。较高的指导尺度会促使生成与文本 prompt 紧密相关的图像，但通常会以牺牲图像质量为代价。
negative_prompt (str 或 List[str]，可选) — 不用于引导图像生成的提示或提示语。如果未定义，则必须传递 negative_prompt_embeds。代替。如果未定义，则必须传递 negative_prompt_embeds。代替。在不使用指导时忽略（即，如果 guidance_scale 小于 1 则忽略）。
num_images_per_prompt (int，可选，默认为 1) — 每个提示要生成的图像数量。
eta (float，可选，默认为 0.0) — 对应于 DDIM 论文中的参数 eta (η)：https://arxiv.org/abs/2010.02502。仅适用于 schedulers.DDIMScheduler，对于其他调度器将被忽略。
generator (torch.Generator 或 List[torch.Generator]，可选) — 用于使生成过程具有确定性的一个或一组 torch 生成器。
latents (torch.Tensor，可选) — 预生成的噪声潜在变量，从高斯分布中采样，用作图像生成的输入。可用于通过不同的提示调整相同的生成过程。如果未提供，则将通过使用提供的随机 generator 进行采样来生成潜在变量张量。
prompt_embeds (torch.Tensor，可选) — 预生成的文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，则将从 prompt 输入参数生成文本嵌入。
negative_prompt_embeds (torch.Tensor，可选) — 预生成的负文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，则将从 negative_prompt 输入参数生成 negative_prompt_embeds。
output_type (str，可选，默认为 "pil") — 生成图像的输出格式。在 PIL：PIL.Image.Image 或 np.array 之间选择。
return_dict (bool，可选，默认为 True) — 是否返回 ~pipelines.stable_diffusion.StableDiffusionAdapterPipelineOutput 而不是普通元组。
callback (Callable，可选) — 一个函数，它将在推理期间每 callback_steps 步调用一次。该函数将被调用，并带有以下参数：callback(step: int, timestep: int, latents: torch.Tensor)。
callback_steps (int，可选，默认为 1) — 调用 callback 函数的频率。如果未指定，则将在每个步骤调用回调。
cross_attention_kwargs (dict，可选) — 一个 kwargs 字典，如果指定，则会传递给 AttnProcessor，如 diffusers.models.attention_processor 中 self.processor 下定义的那样。
adapter_conditioning_scale (float 或 List[float]，可选，默认为 1.0) — adapter 的输出在添加到原始 unet 中的残差之前，会乘以 adapter_conditioning_scale。如果在 init 中指定了多个 adapter，则可以将相应的比例设置为列表。
clip_skip (int，可选) — 在计算提示嵌入时，要从 CLIP 跳过的层数。值为 1 表示将使用倒数第二层的输出计算提示嵌入。

返回

~pipelines.stable_diffusion.StableDiffusionAdapterPipelineOutput 或 tuple

如果 return_dict 为 True，则返回 ~pipelines.stable_diffusion.StableDiffusionAdapterPipelineOutput，否则返回 tuple。当返回元组时，第一个元素是包含生成图像的列表，第二个元素是 bool 列表，指示根据 safety_checker，相应的生成图像是否可能代表“不适合工作场所观看”（nsfw）内容。

调用 pipeline 进行生成时调用的函数。

示例

>>> from PIL import Image
>>> from diffusers.utils import load_image
>>> import torch
>>> from diffusers import StableDiffusionAdapterPipeline, T2IAdapter

>>> image = load_image(
...     "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_ref.png"
... )

>>> color_palette = image.resize((8, 8))
>>> color_palette = color_palette.resize((512, 512), resample=Image.Resampling.NEAREST)

>>> adapter = T2IAdapter.from_pretrained("TencentARC/t2iadapter_color_sd14v1", torch_dtype=torch.float16)
>>> pipe = StableDiffusionAdapterPipeline.from_pretrained(
...     "CompVis/stable-diffusion-v1-4",
...     adapter=adapter,
...     torch_dtype=torch.float16,
... )

>>> pipe.to("cuda")

>>> out_image = pipe(
...     "At night, glowing cubes in front of the beach",
...     image=color_palette,
... ).images[0]

启用 attention slicing

( slice_size: typing.Union[int, str, NoneType] = 'auto' )

参数

slice_size (str 或 int，可选，默认为 "auto") — 当为 "auto" 时，将 attention head 的输入减半，因此 attention 将分两步计算。如果为 "max"，则通过一次仅运行一个切片来最大程度地节省内存。如果提供了数字，则使用与 attention_head_dim // slice_size 一样多的切片。在这种情况下，attention_head_dim 必须是 slice_size 的倍数。

启用切片 attention 计算。启用此选项后，attention 模块会将输入张量拆分为切片，以分几个步骤计算 attention。对于多个 attention head，计算将按顺序对每个 head 执行。这对于节省一些内存以换取较小的速度降低很有用。

⚠️ 如果您已经在使用 PyTorch 2.0 或 xFormers 的 scaled_dot_product_attention (SDPA)，请不要启用 attention slicing。这些 attention 计算已经非常节省内存，因此您无需启用此功能。如果您将 attention slicing 与 SDPA 或 xFormers 一起启用，则可能会导致严重的减速！

示例

>>> import torch
>>> from diffusers import StableDiffusionPipeline

>>> pipe = StableDiffusionPipeline.from_pretrained(
...     "runwayml/stable-diffusion-v1-5",
...     torch_dtype=torch.float16,
...     use_safetensors=True,
... )

>>> prompt = "a photo of an astronaut riding a horse on mars"
>>> pipe.enable_attention_slicing()
>>> image = pipe(prompt).images[0]

禁用 attention slicing

( )

禁用切片 attention 计算。如果之前调用了 enable_attention_slicing，则 attention 将在一个步骤中计算。

启用 VAE slicing

( )

启用切片 VAE 解码。启用此选项后，VAE 将把输入张量拆分为切片，以分几个步骤计算解码。这对于节省一些内存并允许更大的批量大小很有用。

禁用 VAE slicing

( )

禁用切片 VAE 解码。如果之前启用了 enable_vae_slicing，则此方法将恢复在一个步骤中计算解码。

启用 xFormers 内存高效 attention

( attention_op: typing.Optional[typing.Callable] = None )

参数

attention_op (Callable，可选) — 覆盖默认的 None 运算符，用作 xFormers 的 memory_efficient_attention() 函数的 op 参数。

启用来自 xFormers 的内存高效 attention。启用此选项后，您应该观察到更低的 GPU 内存使用率以及推理期间潜在的速度提升。不保证训练期间的速度提升。

⚠️ 当内存高效 attention 和切片 attention 都启用时，内存高效 attention 优先。

示例

>>> import torch
>>> from diffusers import DiffusionPipeline
>>> from xformers.ops import MemoryEfficientAttentionFlashAttentionOp

>>> pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16)
>>> pipe = pipe.to("cuda")
>>> pipe.enable_xformers_memory_efficient_attention(attention_op=MemoryEfficientAttentionFlashAttentionOp)
>>> # Workaround for not accepting attention shape using VAE for Flash Attention
>>> pipe.vae.enable_xformers_memory_efficient_attention(attention_op=None)

禁用 xFormers 内存高效 attention

( )

禁用来自 xFormers 的内存高效 attention。

encode_prompt

( prompt device num_images_per_prompt do_classifier_free_guidance negative_prompt = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None clip_skip: typing.Optional[int] = None )

参数

prompt (str 或 List[str], 可选) — 要编码的提示词
device — (torch.device): Torch 设备
num_images_per_prompt (int) — 每个提示词应生成的图像数量
do_classifier_free_guidance (bool) — 是否使用无分类器引导（classifier free guidance）
negative_prompt (str 或 List[str], 可选) — 不用于引导图像生成的提示或提示列表。如果未定义，则必须传递 negative_prompt_embeds。当不使用引导时忽略（即，如果 guidance_scale 小于 1 则忽略）。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入。可以用于轻松调整文本输入，例如提示词权重。如果未提供，则将从 prompt 输入参数生成文本嵌入。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负面文本嵌入。可以用于轻松调整文本输入，例如提示词权重。如果未提供，则将从 negative_prompt 输入参数生成 negative_prompt_embeds。
lora_scale (float, 可选) — 如果加载了 LoRA 层，则将应用于文本编码器所有 LoRA 层的 LoRA 缩放比例。
clip_skip (int, 可选) — 从 CLIP 中跳过的层数，用于计算提示词嵌入。值为 1 表示预最终层的输出将用于计算提示词嵌入。

将提示词编码为文本编码器隐藏状态。

get_guidance_scale_embedding

( w: Tensor embedding_dim: int = 512 dtype: dtype = torch.float32 ) → torch.Tensor

参数

w (torch.Tensor) — 生成具有指定引导缩放比例的嵌入向量，以随后丰富时间步嵌入。
embedding_dim (int, 可选, 默认为 512) — 要生成的嵌入的维度。
dtype (torch.dtype, 可选, 默认为 torch.float32) — 生成的嵌入的数据类型。

返回

torch.Tensor

形状为 (len(w), embedding_dim) 的嵌入向量。

参见 https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

StableDiffusionXLAdapterPipeline

class diffusers.StableDiffusionXLAdapterPipeline

( vae: AutoencoderKL text_encoder: CLIPTextModel text_encoder_2: CLIPTextModelWithProjection tokenizer: CLIPTokenizer tokenizer_2: CLIPTokenizer unet: UNet2DConditionModel adapter: typing.Union[diffusers.models.adapter.T2IAdapter, diffusers.models.adapter.MultiAdapter, typing.List[diffusers.models.adapter.T2IAdapter]] scheduler: KarrasDiffusionSchedulers force_zeros_for_empty_prompt: bool = True feature_extractor: CLIPImageProcessor = None image_encoder: CLIPVisionModelWithProjection = None )

参数

adapter (T2IAdapter 或 MultiAdapter 或 List[T2IAdapter]) — 在去噪过程中为 unet 提供额外的条件控制。如果您将多个 Adapter 设置为列表，则每个 Adapter 的输出将相加，以创建一个组合的额外条件控制。
adapter_weights (List[float], 可选, 默认为 None) — 浮点数列表，表示在将每个适配器的输出相加之前，将乘以每个适配器输出的权重。
vae (AutoencoderKL) — 变分自动编码器 (VAE) 模型，用于将图像编码和解码为潜在表示形式以及从潜在表示形式解码图像。
text_encoder (CLIPTextModel) — 冻结的文本编码器。 Stable Diffusion 使用 CLIP 的文本部分，特别是 clip-vit-large-patch14 变体。
tokenizer (CLIPTokenizer) — CLIPTokenizer 类的分词器。
unet (UNet2DConditionModel) — 条件 U-Net 架构，用于对编码后的图像潜在空间进行去噪。
scheduler (SchedulerMixin) — 调度器，与 unet 结合使用，以对编码后的图像潜在空间进行去噪。可以是 DDIMScheduler, LMSDiscreteScheduler, 或 PNDMScheduler 之一。
safety_checker (StableDiffusionSafetyChecker) — 分类模块，用于估计生成的图像是否可能被认为具有攻击性或有害。请参阅模型卡以了解详情。
feature_extractor (CLIPImageProcessor) — 从生成的图像中提取特征的模型，用作 safety_checker 的输入。

使用 T2I-Adapter 增强的 Stable Diffusion 文本到图像生成管线 https://arxiv.org/abs/2302.08453

此模型继承自 DiffusionPipeline。查看超类文档，了解库为所有管线实现的通用方法（例如，下载或保存，在特定设备上运行等）。

该管线还继承了以下加载方法

load_textual_inversion() 用于加载文本反演嵌入
from_single_file() 用于加载 .ckpt 文件
load_lora_weights() 用于加载 LoRA 权重
save_lora_weights() 用于保存 LoRA 权重
load_ip_adapter() 用于加载 IP 适配器

call

( prompt: typing.Union[str, typing.List[str]] = None prompt_2: typing.Union[str, typing.List[str], NoneType] = None image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 timesteps: typing.List[int] = None sigmas: typing.List[float] = None denoising_end: typing.Optional[float] = None guidance_scale: float = 5.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_2: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None guidance_rescale: float = 0.0 original_size: typing.Optional[typing.Tuple[int, int]] = None crops_coords_top_left: typing.Tuple[int, int] = (0, 0) target_size: typing.Optional[typing.Tuple[int, int]] = None negative_original_size: typing.Optional[typing.Tuple[int, int]] = None negative_crops_coords_top_left: typing.Tuple[int, int] = (0, 0) negative_target_size: typing.Optional[typing.Tuple[int, int]] = None adapter_conditioning_scale: typing.Union[float, typing.List[float]] = 1.0 adapter_conditioning_factor: float = 1.0 clip_skip: typing.Optional[int] = None ) → ~pipelines.stable_diffusion.StableDiffusionAdapterPipelineOutput 或 tuple

参数

prompt (str 或 List[str], 可选) — 用于引导图像生成的提示或提示列表。如果未定义，则必须传递 prompt_embeds 代替。
prompt_2 (str 或 List[str], 可选) — 要发送到 tokenizer_2 和 text_encoder_2 的提示或提示列表。如果未定义，则 prompt 将在两个文本编码器中都使用。
image (torch.Tensor, PIL.Image.Image, List[torch.Tensor] 或 List[PIL.Image.Image] 或 List[List[PIL.Image.Image]]) — Adapter 输入条件。 Adapter 使用此输入条件来生成对 Unet 的引导。如果类型指定为 torch.Tensor，则会按原样传递给 Adapter。 PIL.Image.Image 也可以接受作为图像。控制图像会自动调整大小以适合输出图像。
height (int, 可选, 默认为 self.unet.config.sample_size * self.vae_scale_factor) — 生成图像的高度像素值。对于 stabilityai/stable-diffusion-xl-base-1.0 和未在低分辨率下进行专门微调的检查点，低于 512 像素的任何值都无法很好地工作。
width (int, 可选, 默认为 self.unet.config.sample_size * self.vae_scale_factor) — 生成图像的宽度像素值。对于 stabilityai/stable-diffusion-xl-base-1.0 和未在低分辨率下进行专门微调的检查点，低于 512 像素的任何值都无法很好地工作。
num_inference_steps (int, 可选, 默认为 50) — 去噪步骤的数量。更多的去噪步骤通常会带来更高质量的图像，但会牺牲较慢的推理速度。
timesteps (List[int], 可选) — 用于去噪过程的自定义时间步长，调度器在其 set_timesteps 方法中支持 timesteps 参数。如果未定义，将使用传递 num_inference_steps 时的默认行为。必须以降序排列。
sigmas (List[float], 可选) — 用于去噪过程的自定义 sigmas，调度器在其 set_timesteps 方法中支持 sigmas 参数。如果未定义，将使用传递 num_inference_steps 时的默认行为。
denoising_end (float, 可选) — 如果指定，则确定在有意提前终止之前要完成的总去噪过程的分数（介于 0.0 和 1.0 之间）。因此，返回的样本仍将保留大量的噪声，这由调度器选择的离散时间步长确定。当此管道构成“去噪器混合”多管道设置的一部分时，应理想地利用 denoising_end 参数，如 优化图像输出 中详述的那样。
guidance_scale (float, 可选, 默认为 5.0) — Classifier-Free Diffusion Guidance 中定义的引导缩放。 guidance_scale 定义为 Imagen Paper 的等式 2 中的 w。通过设置 guidance_scale > 1 启用引导缩放。较高的引导缩放鼓励生成与文本 prompt 紧密相关的图像，但通常以较低的图像质量为代价。
negative_prompt (str 或 List[str], 可选) — 不用于引导图像生成的提示或提示列表。如果未定义，则必须传递 negative_prompt_embeds 代替。不使用引导时忽略（即，如果 guidance_scale 小于 1 则忽略）。
negative_prompt_2 (str 或 List[str], 可选) — 不用于引导图像生成的提示或提示列表，将发送到 tokenizer_2 和 text_encoder_2。如果未定义，则 negative_prompt 将在两个文本编码器中都使用。
num_images_per_prompt (int, 可选, 默认为 1) — 每个提示要生成的图像数量。
eta (float, 可选, 默认为 0.0) — 对应于 DDIM 论文中的参数 eta (η)： https://arxiv.org/abs/2010.02502。仅适用于 schedulers.DDIMScheduler，对于其他调度器将被忽略。
generator (torch.Generator 或 List[torch.Generator], 可选) — 用于使生成具有确定性的一个或多个 torch generator。
latents (torch.Tensor, 可选) — 预生成的噪声潜在变量，从高斯分布中采样，用作图像生成的输入。可用于使用不同的提示来调整相同的生成。如果未提供，则将通过使用提供的随机 generator 进行采样来生成潜在张量。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，则将从 prompt 输入参数生成文本嵌入。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负面文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，则将从 negative_prompt 输入参数生成 negative_prompt_embeds。
pooled_prompt_embeds (torch.Tensor, 可选) — 预生成的池化文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，则将从 prompt 输入参数生成池化文本嵌入。
negative_pooled_prompt_embeds (torch.Tensor, 可选) — 预生成的负面池化文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，则将从 negative_prompt 输入参数生成池化的 negative_prompt_embeds。
ip_adapter_image — (PipelineImageInput, 可选): 与 IP 适配器一起使用的可选图像输入。
ip_adapter_image_embeds (List[torch.Tensor], 可选) — IP-Adapter 的预生成图像嵌入。它应该是一个列表，长度与 IP 适配器的数量相同。每个元素都应该是一个形状为 (batch_size, num_images, emb_dim) 的张量。如果 do_classifier_free_guidance 设置为 True，它应该包含负图像嵌入。如果未提供，则从 ip_adapter_image 输入参数计算嵌入。
output_type (str, optional, defaults to "pil") — 生成图像的输出格式。从 PIL: PIL.Image.Image 或 np.array 中选择。
return_dict (bool, optional, defaults to True) — 是否返回 ~pipelines.stable_diffusion_xl.StableDiffusionAdapterPipelineOutput 而不是普通元组。
callback (Callable, optional) — 一个函数，它将在推理过程的每 callback_steps 步被调用。该函数将使用以下参数调用：callback(step: int, timestep: int, latents: torch.Tensor)。
callback_steps (int, optional, defaults to 1) — callback 函数将被调用的频率。如果未指定，则将在每一步调用回调。
cross_attention_kwargs (dict, optional) — 一个 kwargs 字典，如果指定，则会传递给 AttentionProcessor，如 diffusers.models.attention_processor 中的 self.processor 下定义。
guidance_rescale (float, optional, defaults to 0.0) — 由 Common Diffusion Noise Schedules and Sample Steps are Flawed 提出的引导重缩放因子。 guidance_scale 在 Common Diffusion Noise Schedules and Sample Steps are Flawed 的公式 16 中定义为 φ。当使用零终端 SNR 时，引导重缩放因子应修复过度曝光。
original_size (Tuple[int], optional, defaults to (1024, 1024)) — 如果 original_size 与 target_size 不同，则图像将显示为缩小或放大。如果未指定，original_size 默认为 (height, width)。 SDXL 微调的一部分，如 https://huggingface.ac.cn/papers/2307.01952 的第 2.2 节中所述。
crops_coords_top_left (Tuple[int], optional, defaults to (0, 0)) — crops_coords_top_left 可用于生成看起来从位置 crops_coords_top_left 向下“裁剪”的图像。通过将 crops_coords_top_left 设置为 (0, 0)，通常可以获得良好的、居中的图像。 SDXL 微调的一部分，如 https://huggingface.ac.cn/papers/2307.01952 的第 2.2 节中所述。
target_size (Tuple[int], optional, defaults to (1024, 1024)) — 在大多数情况下，target_size 应设置为生成图像的所需高度和宽度。如果未指定，它将默认为 (height, width)。 SDXL 微调的一部分，如 https://huggingface.ac.cn/papers/2307.01952 的第 2.2 节中所述。 https://huggingface.ac.cn/papers/2307.01952 的第 2.2 节。
negative_original_size (Tuple[int], optional, defaults to (1024, 1024)) — 基于特定图像分辨率对生成过程进行负面调节。 SDXL 微调的一部分，如 https://huggingface.ac.cn/papers/2307.01952 的第 2.2 节中所述。有关更多信息，请参阅此问题线程：https://github.com/huggingface/diffusers/issues/4208。
negative_crops_coords_top_left (Tuple[int], optional, defaults to (0, 0)) — 基于特定裁剪坐标对生成过程进行负面调节。 SDXL 微调的一部分，如 https://huggingface.ac.cn/papers/2307.01952 的第 2.2 节中所述。有关更多信息，请参阅此问题线程：https://github.com/huggingface/diffusers/issues/4208。
negative_target_size (Tuple[int], optional, defaults to (1024, 1024)) — 基于目标图像分辨率对生成过程进行负面调节。在大多数情况下，它应与 target_size 相同。 SDXL 微调的一部分，如 https://huggingface.ac.cn/papers/2307.01952 的第 2.2 节中所述。有关更多信息，请参阅此问题线程：https://github.com/huggingface/diffusers/issues/4208。
adapter_conditioning_scale (float or List[float], optional, defaults to 1.0) — adapter 的输出在添加到原始 unet 中的残差之前，会乘以 adapter_conditioning_scale。如果在 init 中指定了多个适配器，则可以将相应的比例设置为列表。
adapter_conditioning_factor (float, optional, defaults to 1.0) — 应用 adapter 的时间步长比例。如果 adapter_conditioning_factor 为 0.0，则根本不应用适配器。如果 adapter_conditioning_factor 为 1.0，则适配器应用于所有时间步长。如果 adapter_conditioning_factor 为 0.5，则适配器应用于一半的时间步长。
clip_skip (int, optional) — 从 CLIP 中跳过的层数，用于计算提示嵌入。值为 1 表示预最终层的输出将用于计算提示嵌入。

返回

~pipelines.stable_diffusion.StableDiffusionAdapterPipelineOutput 或 tuple

~pipelines.stable_diffusion.StableDiffusionAdapterPipelineOutput 如果 return_dict 为 True，否则为 tuple。当返回元组时，第一个元素是包含生成图像的列表。

调用 pipeline 进行生成时调用的函数。

示例

>>> import torch
>>> from diffusers import T2IAdapter, StableDiffusionXLAdapterPipeline, DDPMScheduler
>>> from diffusers.utils import load_image

>>> sketch_image = load_image("https://huggingface.co/Adapter/t2iadapter/resolve/main/sketch.png").convert("L")

>>> model_id = "stabilityai/stable-diffusion-xl-base-1.0"

>>> adapter = T2IAdapter.from_pretrained(
...     "Adapter/t2iadapter",
...     subfolder="sketch_sdxl_1.0",
...     torch_dtype=torch.float16,
...     adapter_type="full_adapter_xl",
... )
>>> scheduler = DDPMScheduler.from_pretrained(model_id, subfolder="scheduler")

>>> pipe = StableDiffusionXLAdapterPipeline.from_pretrained(
...     model_id, adapter=adapter, torch_dtype=torch.float16, variant="fp16", scheduler=scheduler
... ).to("cuda")

>>> generator = torch.manual_seed(42)
>>> sketch_image_out = pipe(
...     prompt="a photo of a dog in real world, high quality",
...     negative_prompt="extra digit, fewer digits, cropped, worst quality, low quality",
...     image=sketch_image,
...     generator=generator,
...     guidance_scale=7.5,
... ).images[0]

启用 attention slicing

( slice_size: typing.Union[int, str, NoneType] = 'auto' )

参数

slice_size (str or int, optional, defaults to "auto") — 当为 "auto" 时，将注意力头的输入减半，因此注意力将在两个步骤中计算。如果为 "max"，则通过一次仅运行一个切片来最大程度地节省内存。如果提供数字，则使用 attention_head_dim // slice_size 个切片。在这种情况下，attention_head_dim 必须是 slice_size 的倍数。

启用切片 attention 计算。启用此选项后，attention 模块会将输入张量拆分为切片，以分几个步骤计算 attention。对于多个 attention head，计算将按顺序对每个 head 执行。这对于节省一些内存以换取较小的速度降低很有用。

⚠️ 如果您已经在使用 PyTorch 2.0 或 xFormers 的 scaled_dot_product_attention (SDPA)，请不要启用 attention slicing。这些 attention 计算已经非常节省内存，因此您无需启用此功能。如果您将 attention slicing 与 SDPA 或 xFormers 一起启用，则可能会导致严重的减速！

示例

>>> import torch
>>> from diffusers import StableDiffusionPipeline

>>> pipe = StableDiffusionPipeline.from_pretrained(
...     "runwayml/stable-diffusion-v1-5",
...     torch_dtype=torch.float16,
...     use_safetensors=True,
... )

>>> prompt = "a photo of an astronaut riding a horse on mars"
>>> pipe.enable_attention_slicing()
>>> image = pipe(prompt).images[0]

禁用 attention slicing

( )

禁用切片 attention 计算。如果之前调用了 enable_attention_slicing，则 attention 将在一个步骤中计算。

启用 VAE slicing

( )

启用切片 VAE 解码。启用此选项后，VAE 将把输入张量拆分为切片，以分几个步骤计算解码。这对于节省一些内存并允许更大的批量大小很有用。

禁用 VAE slicing

( )

禁用切片 VAE 解码。如果之前启用了 enable_vae_slicing，则此方法将恢复在一个步骤中计算解码。

启用 xFormers 内存高效 attention

( attention_op: typing.Optional[typing.Callable] = None )

参数

attention_op (Callable, optional) — 覆盖默认的 None 运算符，用作 memory_efficient_attention() 函数的 op 参数。

启用来自 xFormers 的内存高效 attention。启用此选项后，您应该观察到更低的 GPU 内存使用率以及推理期间潜在的速度提升。不保证训练期间的速度提升。

⚠️ 当内存高效 attention 和切片 attention 都启用时，内存高效 attention 优先。

示例

>>> import torch
>>> from diffusers import DiffusionPipeline
>>> from xformers.ops import MemoryEfficientAttentionFlashAttentionOp

>>> pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16)
>>> pipe = pipe.to("cuda")
>>> pipe.enable_xformers_memory_efficient_attention(attention_op=MemoryEfficientAttentionFlashAttentionOp)
>>> # Workaround for not accepting attention shape using VAE for Flash Attention
>>> pipe.vae.enable_xformers_memory_efficient_attention(attention_op=None)

禁用 xFormers 内存高效 attention

( )

禁用来自 xFormers 的内存高效 attention。

encode_prompt

( prompt: str prompt_2: typing.Optional[str] = None device: typing.Optional[torch.device] = None num_images_per_prompt: int = 1 do_classifier_free_guidance: bool = True negative_prompt: typing.Optional[str] = None negative_prompt_2: typing.Optional[str] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None clip_skip: typing.Optional[int] = None )

参数

prompt (str 或 List[str], 可选) — 要编码的提示词
prompt_2 (str 或 List[str], 可选) — 要发送到 tokenizer_2 和 text_encoder_2 的提示词。如果未定义，则 prompt 会在两个文本编码器中都使用
device — (torch.device): torch 设备
num_images_per_prompt (int) — 每个提示词应生成的图像数量
do_classifier_free_guidance (bool) — 是否使用无分类器引导（classifier free guidance）
negative_prompt (str 或 List[str], 可选) — 不用于引导图像生成的提示词。如果未定义，则必须传递 negative_prompt_embeds 代替。当不使用引导时忽略（即，如果 guidance_scale 小于 1，则忽略）。
negative_prompt_2 (str 或 List[str], 可选) — 不用于引导图像生成的提示词，将发送到 tokenizer_2 和 text_encoder_2。如果未定义，则 negative_prompt 会在两个文本编码器中都使用
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入，例如提示词权重。如果未提供，则将从 prompt 输入参数生成文本嵌入。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负面文本嵌入。可用于轻松调整文本输入，例如提示词权重。如果未提供，则将从 negative_prompt 输入参数生成 negative_prompt_embeds。
pooled_prompt_embeds (torch.Tensor, 可选) — 预生成的池化文本嵌入。可用于轻松调整文本输入，例如提示词权重。如果未提供，则将从 prompt 输入参数生成池化文本嵌入。
negative_pooled_prompt_embeds (torch.Tensor, 可选) — 预生成的负面池化文本嵌入。可用于轻松调整文本输入，例如提示词权重。如果未提供，则将从 negative_prompt 输入参数生成池化 negative_prompt_embeds。
lora_scale (float, 可选) — 如果加载了 LoRA 层，则将应用于文本编码器所有 LoRA 层的 LoRA 缩放比例。
clip_skip (int, 可选) — 从 CLIP 中跳过的层数，用于计算提示嵌入。值为 1 表示预最终层的输出将用于计算提示嵌入。

将提示词编码为文本编码器隐藏状态。

get_guidance_scale_embedding

( w: Tensor embedding_dim: int = 512 dtype: dtype = torch.float32 ) → torch.Tensor

参数

w (torch.Tensor) — 使用指定的引导比例生成嵌入向量，以随后丰富时间步嵌入。
embedding_dim (int, 可选, 默认为 512) — 要生成的嵌入的维度。
dtype (torch.dtype, 可选, 默认为 torch.float32) — 生成的嵌入的数据类型。

返回

torch.Tensor

形状为 (len(w), embedding_dim) 的嵌入向量。

参见 https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

< > 在 GitHub 上更新

←LDM3D 文本到（RGB，深度）、文本到（RGB-全景，深度-全景）、LDM3D 放大器 GLIGEN (Grounded Language-to-Image Generation)→