Diffusers 文档

MultiDiffusion

Diffusers

加入 Hugging Face 社区

并获得增强的文档体验

在模型、数据集和 Spaces 上进行协作

通过加速推理获得更快的示例

切换文档主题

开始使用

MultiDiffusion

MultiDiffusion: 融合扩散路径以实现受控图像生成由 Omer Bar-Tal、Lior Yariv、Yaron Lipman 和 Tali Dekel 撰写。

论文摘要如下：

文本到图像生成与扩散模型的最新进展展示了图像质量的变革性能力。然而，用户对生成图像的可控性以及对新任务的快速适应仍然是一个开放的挑战，目前主要通过昂贵且漫长的再训练和微调或针对特定图像生成任务的临时适应来解决。在这项工作中，我们提出了 MultiDiffusion，一个统一的框架，它能够使用预训练的文本到图像扩散模型，无需任何进一步的训练或微调，实现多功能和可控的图像生成。我们方法的核心是一种新的生成过程，它基于一个优化任务，该任务将多个扩散生成过程与一组共享参数或约束相结合。我们展示了 MultiDiffusion 可以轻松应用于生成高质量和多样化的图像，这些图像符合用户提供的控制，例如所需的纵横比（例如，全景图）和空间引导信号，范围从紧密的分割掩码到边界框。

您可以在项目页面、原始代码库上找到有关 MultiDiffusion 的更多信息，并可以在演示中试用。

提示

调用 StableDiffusionPanoramaPipeline 时，可以将 view_batch_size 参数设置为 > 1。对于某些高性能 GPU，这可以加速生成过程并增加 VRAM 使用量。

要生成全景式图像，请确保相应地传递 width 参数。我们建议宽度值为 2048，这是默认值。

当处理全景图时，应用循环填充以确保没有拼接伪影，从而实现从最右侧到最左侧的无缝过渡。通过启用循环填充（设置 circular_padding=True），该操作会在图像最右侧之后应用额外的裁剪，从而使模型能够“看到”从最右侧到最左侧的过渡。这有助于保持 360 度视觉一致性，并创建可以使用 360 度全景查看器查看的正确“全景图”。在 Stable Diffusion 中解码潜在变量时，应用循环填充以确保解码后的潜在变量在 RGB 空间中匹配。

例如，没有循环填充时，存在拼接伪影（默认）：

但有了循环填充，右侧和左侧部分匹配（circular_padding=True）：

请务必查看调度器指南，了解如何探索调度器速度和质量之间的权衡，并查看跨管道重用组件部分，了解如何有效地将相同组件加载到多个管道中。

StableDiffusionPanoramaPipeline

类 diffusers.StableDiffusionPanoramaPipeline

< 来源 >

( vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer unet: UNet2DConditionModel scheduler: DDIMScheduler safety_checker: StableDiffusionSafetyChecker feature_extractor: CLIPImageProcessor image_encoder: typing.Optional[transformers.models.clip.modeling_clip.CLIPVisionModelWithProjection] = None requires_safety_checker: bool = True )

call

< 来源 >

( prompt: typing.Union[str, typing.List[str]] = None height: typing.Optional[int] = 512 width: typing.Optional[int] = 2048 num_inference_steps: int = 50 timesteps: typing.List[int] = None guidance_scale: float = 7.5 view_batch_size: int = 1 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None guidance_rescale: float = 0.0 circular_padding: bool = False clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] **kwargs: typing.Any ) → StableDiffusionPipelineOutput 或 tuple

参数

prompt (str 或 List[str], 可选) — 用于引导图像生成的提示或提示列表。如果未定义，则需要传递 prompt_embeds。
height (int, 可选, 默认为 512) — 生成图像的高度（像素）。
width (int, 可选, 默认为 2048) — 生成图像的宽度（像素）。由于管道旨在生成全景式图像，因此宽度保持较高。
num_inference_steps (int, 可选, 默认为 50) — 去噪步数。更多的去噪步数通常会带来更高质量的图像，但推理速度会变慢。
timesteps (List[int], 可选) — 生成图像的时间步。如果未指定，则使用调度器的默认时间步间隔策略。
guidance_scale (float, 可选, 默认为 7.5) — 较高的指导比例值会鼓励模型生成与文本 prompt 紧密相关的图像，但会以较低的图像质量为代价。当 guidance_scale > 1 时启用指导比例。
view_batch_size (int, 可选, 默认为 1) — 去噪拆分视图的批处理大小。对于某些高性能 GPU，更高的视图批处理大小可以加速生成并增加 VRAM 使用量。
negative_prompt (str 或 List[str], 可选) — 用于引导图像生成中不包含的内容的提示或提示列表。如果未定义，则需要传递 negative_prompt_embeds。当不使用指导时（guidance_scale < 1），此参数将被忽略。
num_images_per_prompt (int, 可选, 默认为 1) — 每个提示生成的图像数量。
eta (float, 可选, 默认为 0.0) — 对应于 DDIM 论文中的参数 eta (η)。仅适用于 DDIMScheduler，在其他调度器中被忽略。
generator (torch.Generator 或 List[torch.Generator], 可选) — 一个 torch.Generator，用于使生成确定性。
latents (torch.Tensor, 可选) — 从高斯分布采样的预生成噪声潜在变量，用作图像生成的输入。可用于使用不同提示调整相同的生成。如果未提供，则使用提供的随机 generator 进行采样生成潜在张量。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入（提示权重）。如果未提供，则从 prompt 输入参数生成文本嵌入。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负文本嵌入。可用于轻松调整文本输入（提示权重）。如果未提供，则从 negative_prompt 输入参数生成 negative_prompt_embeds。
ip_adapter_image — (PipelineImageInput, 可选): 用于与 IP 适配器一起使用的可选图像输入。
ip_adapter_image_embeds (List[torch.Tensor], 可选) — 适用于 IP-Adapter 的预生成图像嵌入。它应该是一个列表，长度与 IP 适配器数量相同。每个元素都应该是一个形状为 (batch_size, num_images, emb_dim) 的张量。如果 do_classifier_free_guidance 设置为 True，它应该包含负图像嵌入。如果未提供，嵌入将从 ip_adapter_image 输入参数计算。
output_type (str, 可选, 默认为 "pil") — 生成图像的输出格式。选择 PIL.Image 或 np.array。
return_dict (bool, 可选, 默认为 True) — 是否返回 StableDiffusionPipelineOutput 而不是普通元组。
cross_attention_kwargs (dict, 可选) — 一个 kwargs 字典，如果指定，将作为 self.processor 中定义的 AttentionProcessor 的参数传递给 diffusers.models.attention_processor。
guidance_rescale (float, 可选, 默认为 0.0) — 指导嵌入的重新缩放因子。值为 0.0 表示不应用重新缩放。
circular_padding (bool, 可选, 默认为 False) — 如果设置为 True，则应用循环填充以确保没有拼接伪影。循环填充允许模型无缝地生成从图像最右侧到最左侧的过渡，从而保持 360 度的一致性。
clip_skip (int, 可选) — 计算提示嵌入时要从 CLIP 跳过的层数。值为 1 表示将使用预最终层的输出计算提示嵌入。
callback_on_step_end (Callable, 可选) — 在推理过程中每次去噪步骤结束时调用的函数。该函数将使用以下参数调用：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。callback_kwargs 将包含 callback_on_step_end_tensor_inputs 指定的所有张量列表。
callback_on_step_end_tensor_inputs (List[str], 可选) — callback_on_step_end 函数的张量输入列表。列表中指定的张量将作为 callback_kwargs 参数传递。您只能包含管道类 ._callback_tensor_inputs 属性中列出的变量。

StableDiffusionPipelineOutput 或 tuple

如果 return_dict 为 True，则返回 StableDiffusionPipelineOutput，否则返回一个 tuple，其中第一个元素是生成的图像列表，第二个元素是布尔值列表，指示相应生成的图像是否包含“不适合工作”(nsfw) 内容。

用于生成的管道的调用函数。

示例

>>> import torch
>>> from diffusers import StableDiffusionPanoramaPipeline, DDIMScheduler

>>> model_ckpt = "stabilityai/stable-diffusion-2-base"
>>> scheduler = DDIMScheduler.from_pretrained(model_ckpt, subfolder="scheduler")
>>> pipe = StableDiffusionPanoramaPipeline.from_pretrained(
...     model_ckpt, scheduler=scheduler, torch_dtype=torch.float16
... )

>>> pipe = pipe.to("cuda")

>>> prompt = "a photo of the dolomites"
>>> image = pipe(prompt).images[0]

带填充的解码潜在变量

< 来源 >

( latents: Tensor padding: int = 8 ) → torch.Tensor

参数

latents (torch.Tensor) — 要解码的输入潜变量。
padding (int, optional) — 为循环推理在每侧添加的潜变量数量。默认为8。

torch.Tensor

已移除填充的解码图像。

使用填充解码给定潜变量以进行循环推理。

备注

添加填充是为了消除边界伪影并提高输出质量。
这会略微增加内存使用。
然后从解码图像中移除填充像素。

encode_prompt

< 源 >

( prompt device num_images_per_prompt do_classifier_free_guidance negative_prompt = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None clip_skip: typing.Optional[int] = None )

参数

prompt (str 或 List[str], 可选) — 要编码的提示。
device — (torch.device)：torch设备。
num_images_per_prompt (int) — 每个提示应生成的图像数量。
do_classifier_free_guidance (bool) — 是否使用分类器自由引导。
negative_prompt (str 或 List[str], 可选) — 不引导图像生成的提示。如果未定义，则必须传入 negative_prompt_embeds。在使用非引导时（即，如果 guidance_scale 小于 1 时）将被忽略。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入，例如提示加权。如果未提供，将根据 prompt 输入参数生成文本嵌入。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负面文本嵌入。可用于轻松调整文本输入，例如提示加权。如果未提供，将根据 negative_prompt 输入参数生成 negative_prompt_embeds。
lora_scale (float, 可选) — 如果加载了LoRA层，则应用于文本编码器所有LoRA层的LoRA比例。
clip_skip (int, 可选) — 在计算提示嵌入时，从CLIP中跳过的层数。值为1表示将使用倒数第二层的输出计算提示嵌入。

将提示编码为文本编码器隐藏状态。

get_guidance_scale_embedding

< 源 >

( w: Tensor embedding_dim: int = 512 dtype: dtype = torch.float32 ) → torch.Tensor

参数

w (torch.Tensor) — 生成具有指定引导尺度的嵌入向量，以随后丰富时间步嵌入。
embedding_dim (int, 可选, 默认为 512) — 要生成的嵌入维度。
dtype (torch.dtype, 可选, 默认为 torch.float32) — 生成嵌入的数据类型。

torch.Tensor

形状为 (len(w), embedding_dim) 的嵌入向量。

请参阅 https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

get_views

< 源 >

( panorama_height: int panorama_width: int window_size: int = 64 stride: int = 8 circular_padding: bool = False ) → List[Tuple[int, int, int, int]]

参数

panorama_height (int) — 全景图的高度。
panorama_width (int) — 全景图的宽度。
window_size (int, optional) — 窗口大小。默认为64。
stride (int, optional) — 步长值。默认为8。
circular_padding (bool, optional) — 是否应用循环填充。默认为False。

List[Tuple[int, int, int, int]]

表示视图的元组列表。每个元组包含四个整数，表示窗口在全景图中的起始和结束坐标。

根据给定参数生成视图列表。在这里，我们定义映射 F_i (参见 MultiDiffusion 论文 https://huggingface.co/papers/2302.08113 中的公式7)。如果全景图的高度/宽度 < 窗口大小，则高度/宽度应返回1个块。

StableDiffusionPipelineOutput

class diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput

< 源 >

( images: typing.Union[typing.List[PIL.Image.Image], numpy.ndarray] nsfw_content_detected: typing.Optional[typing.List[bool]] )

参数

images (List[PIL.Image.Image] 或 np.ndarray) — 长度为 batch_size 的去噪 PIL 图像列表，或形状为 (batch_size, height, width, num_channels) 的 NumPy 数组。
nsfw_content_detected (List[bool]) — 列表，指示相应的生成图像是否包含“不适合工作”（nsfw）内容；如果无法执行安全检查，则为 None。

Stable Diffusion 管道的输出类。

< > 在 GitHub 上更新

←Mochi MusicLDM→

Diffusers

MultiDiffusion

提示

StableDiffusionPanoramaPipeline

类 diffusers.StableDiffusionPanoramaPipeline

__call__

带填充的解码潜在变量

encode_prompt

get_guidance_scale_embedding

get_views

StableDiffusionPipelineOutput

class diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput

call