Diffusers 文档

Kandinsky 2.2

Diffusers

加入 Hugging Face 社区

并获取增强的文档体验

在模型、数据集和 Spaces 上协作

通过加速推理获得更快的示例

切换文档主题

开始使用

Kandinsky 2.2

Kandinsky 2.2 由 Arseniy Shakhmatov, Anton Razzhigaev, Aleksandr Nikolich, Vladimir Arkhipkin, Igor Pavlov, Andrey Kuznetsov, 和 Denis Dimitrov 创建。

来自其 GitHub 页面的描述是

Kandinsky 2.2 在其前代 Kandinsky 2.1 的基础上进行了重大改进，引入了新的、更强大的图像编码器 - CLIP-ViT-G 和 ControlNet 支持。切换到 CLIP-ViT-G 作为图像编码器显著提高了模型生成更美观图片和更好理解文本的能力，从而提升了模型的整体性能。ControlNet 机制的加入使得模型能够有效地控制图像生成过程。这带来了更准确和更具视觉吸引力的输出，并为文本引导的图像操作开辟了新的可能性。

原始代码库可以在 ai-forever/Kandinsky-2 找到。

查看 Hub 上的 Kandinsky Community 组织，获取用于文本到图像、图像到图像和图像修复等任务的官方模型检查点。

请务必查看 schedulers 指南，了解如何探索调度器速度和质量之间的权衡，并查看 reuse components across pipelines 部分，了解如何有效地将相同组件加载到多个 pipelines 中。

KandinskyV22PriorPipeline

class diffusers.KandinskyV22PriorPipeline

< source >

( prior: PriorTransformer image_encoder: CLIPVisionModelWithProjection text_encoder: CLIPTextModelWithProjection tokenizer: CLIPTokenizer scheduler: UnCLIPScheduler image_processor: CLIPImageProcessor )

参数

prior (PriorTransformer) — 规范的 unCLIP prior，用于从文本嵌入近似图像嵌入。
image_encoder (CLIPVisionModelWithProjection) — 冻结的图像编码器。
text_encoder (CLIPTextModelWithProjection) — 冻结的文本编码器。
tokenizer (CLIPTokenizer) — CLIPTokenizer 类的分词器。
scheduler (UnCLIPScheduler) — 调度器，与 prior 结合使用以生成图像嵌入。
image_processor (CLIPImageProcessor) — 一个 image_processor，用于预处理来自 clip 的图像。

用于为 Kandinsky 生成图像 prior 的 Pipeline

此模型继承自 DiffusionPipeline。查看超类文档，了解库为所有 pipeline 实现的通用方法（例如下载或保存、在特定设备上运行等）。

call

< source >

( prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: int = 1 num_inference_steps: int = 25 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None guidance_scale: float = 4.0 output_type: typing.Optional[str] = 'pt' return_dict: bool = True callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] ) → KandinskyPriorPipelineOutput 或 tuple

参数

prompt (str 或 List[str]) — 用于引导图像生成的 prompt 或 prompts。
negative_prompt (str 或 List[str], 可选) — 不用于引导图像生成的 prompt 或 prompts。当不使用引导时忽略（即，如果 guidance_scale 小于 1 则忽略）。
num_images_per_prompt (int, 可选, 默认为 1) — 每个 prompt 生成的图像数量。
num_inference_steps (int, 可选, 默认为 100) — 去噪步骤的数量。更多的去噪步骤通常会以较慢的推理速度为代价带来更高质量的图像。
generator (torch.Generator 或 List[torch.Generator], 可选) — 一个或一组 torch 生成器，用于使生成具有确定性。
latents (torch.Tensor, 可选) — 预生成的噪声 latents，从高斯分布中采样，用作图像生成的输入。可用于使用不同的 prompts 调整相同的生成。如果未提供，则将使用提供的随机 generator 采样生成 latents tensor。
guidance_scale (float, 可选, 默认为 4.0) — Classifier-Free Diffusion Guidance 中定义的引导缩放比例。 guidance_scale 定义为 Imagen Paper 的公式 2 中的 w。通过设置 guidance_scale > 1 启用引导缩放比例。较高的引导缩放比例鼓励生成与文本 prompt 紧密相关的图像，通常以降低图像质量为代价。
output_type (str, 可选, 默认为 "pt") — 生成图像的输出格式。在以下选项之间选择： "np" (np.array) 或 "pt" (torch.Tensor)。
return_dict (bool, 可选, 默认为 True) — 是否返回 ImagePipelineOutput 而不是普通 tuple。
callback_on_step_end (Callable, 可选) — 在推理期间的每个去噪步骤结束时调用的函数。该函数使用以下参数调用： callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。 callback_kwargs 将包括 callback_on_step_end_tensor_inputs 指定的所有 tensors 列表。
callback_on_step_end_tensor_inputs (List, 可选) — callback_on_step_end 函数的 tensor 输入列表。列表中指定的 tensors 将作为 callback_kwargs 参数传递。您将只能包含 pipeline 类的 ._callback_tensor_inputs 属性中列出的变量。

返回值

KandinskyPriorPipelineOutput 或 tuple

调用 pipeline 进行生成时调用的函数。

示例

>>> from diffusers import KandinskyV22Pipeline, KandinskyV22PriorPipeline
>>> import torch

>>> pipe_prior = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior")
>>> pipe_prior.to("cuda")
>>> prompt = "red cat, 4k photo"
>>> image_emb, negative_image_emb = pipe_prior(prompt).to_tuple()

>>> pipe = KandinskyV22Pipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder")
>>> pipe.to("cuda")
>>> image = pipe(
...     image_embeds=image_emb,
...     negative_image_embeds=negative_image_emb,
...     height=768,
...     width=768,
...     num_inference_steps=50,
... ).images
>>> image[0].save("cat.png")

interpolate

< source >

( images_and_prompts: typing.List[typing.Union[str, PIL.Image.Image, torch.Tensor]] weights: typing.List[float] num_images_per_prompt: int = 1 num_inference_steps: int = 25 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None negative_prior_prompt: typing.Optional[str] = None negative_prompt: str = '' guidance_scale: float = 4.0 device = None ) → KandinskyPriorPipelineOutput 或 tuple

参数

images_and_prompts (List[Union[str, PIL.Image.Image, torch.Tensor]]) — 用于引导图像生成的 prompts 和图像列表。
weights — (List[float]): images_and_prompts 中每个条件的权重列表
num_images_per_prompt (int, 可选, 默认为 1) — 每个 prompt 生成的图像数量。
num_inference_steps (int, 可选, 默认为 100) — 去噪步骤的数量。更多的去噪步骤通常会以较慢的推理速度为代价带来更高质量的图像。
generator (torch.Generator 或 List[torch.Generator], 可选) — 一个或一组 torch 生成器，用于使生成过程具有确定性。
latents (torch.Tensor, 可选) — 预生成的噪声潜变量，从高斯分布中采样，用作图像生成的输入。可用于通过不同的提示调整相同的生成结果。如果未提供，将使用提供的随机 generator 采样生成潜变量张量。
negative_prior_prompt (str, 可选) — 不用于引导先验扩散过程的提示。当不使用引导时忽略（即，如果 guidance_scale 小于 1 则忽略）。
negative_prompt (str 或 List[str], 可选) — 不用于引导图像生成的提示。当不使用引导时忽略（即，如果 guidance_scale 小于 1 则忽略）。
guidance_scale (float, 可选, 默认为 4.0) — Classifier-Free Diffusion Guidance 中定义的引导缩放比例。 guidance_scale 定义为 Imagen Paper 等式 2 中的 w。通过设置 guidance_scale > 1 启用引导缩放。较高的引导缩放比例会促使生成与文本 prompt 紧密相关的图像，但通常会以降低图像质量为代价。

返回值

KandinskyPriorPipelineOutput 或 tuple

使用先验管道进行插值时调用的函数。

示例

>>> from diffusers import KandinskyV22PriorPipeline, KandinskyV22Pipeline
>>> from diffusers.utils import load_image
>>> import PIL
>>> import torch
>>> from torchvision import transforms

>>> pipe_prior = KandinskyV22PriorPipeline.from_pretrained(
...     "kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16
... )
>>> pipe_prior.to("cuda")
>>> img1 = load_image(
...     "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
...     "/kandinsky/cat.png"
... )
>>> img2 = load_image(
...     "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
...     "/kandinsky/starry_night.jpeg"
... )
>>> images_texts = ["a cat", img1, img2]
>>> weights = [0.3, 0.3, 0.4]
>>> out = pipe_prior.interpolate(images_texts, weights)
>>> pipe = KandinskyV22Pipeline.from_pretrained(
...     "kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16
... )
>>> pipe.to("cuda")
>>> image = pipe(
...     image_embeds=out.image_embeds,
...     negative_image_embeds=out.negative_image_embeds,
...     height=768,
...     width=768,
...     num_inference_steps=50,
... ).images[0]
>>> image.save("starry_cat.png")

KandinskyV22Pipeline

class diffusers.KandinskyV22Pipeline

< source >

( unet: UNet2DConditionModel scheduler: DDPMScheduler movq: VQModel )

参数

scheduler (Union[DDIMScheduler,DDPMScheduler]) — 调度器，与 unet 结合使用以生成图像潜在空间。
unet (UNet2DConditionModel) — 条件 U-Net 架构，用于对图像嵌入进行去噪。
movq (VQModel) — MoVQ 解码器，用于从潜在空间生成图像。

使用 Kandinsky 进行文本到图像生成的管道。

此模型继承自 DiffusionPipeline。查看超类文档，了解库为所有 pipeline 实现的通用方法（例如下载或保存、在特定设备上运行等）。

call

< source >

( image_embeds: typing.Union[torch.Tensor, typing.List[torch.Tensor]] negative_image_embeds: typing.Union[torch.Tensor, typing.List[torch.Tensor]] height: int = 512 width: int = 512 num_inference_steps: int = 100 guidance_scale: float = 4.0 num_images_per_prompt: int = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] **kwargs ) → ImagePipelineOutput 或 tuple

参数

image_embeds (torch.Tensor 或 List[torch.Tensor]) — 文本提示的 clip 图像嵌入，将用于调节图像生成。
negative_image_embeds (torch.Tensor 或 List[torch.Tensor]) — 负面文本提示的 clip 图像嵌入，将用于调节图像生成。
height (int, 可选, 默认为 512) — 生成图像的高度像素值。
width (int, 可选, 默认为 512) — 生成图像的宽度像素值。
num_inference_steps (int, 可选, 默认为 100) — 去噪步骤的数量。更多的去噪步骤通常会带来更高质量的图像，但会牺牲推理速度。
guidance_scale (float, 可选, 默认为 4.0) — Classifier-Free Diffusion Guidance 中定义的引导缩放比例。 guidance_scale 定义为 Imagen Paper 等式 2 中的 w。通过设置 guidance_scale > 1 启用引导缩放。较高的引导缩放比例会促使生成与文本 prompt 紧密相关的图像，但通常会以降低图像质量为代价。
num_images_per_prompt (int, 可选, 默认为 1) — 每个提示要生成的图像数量。
generator (torch.Generator 或 List[torch.Generator], 可选) — 一个或一组 torch 生成器，用于使生成过程具有确定性。
latents (torch.Tensor, 可选) — 预生成的噪声潜变量，从高斯分布中采样，用作图像生成的输入。可用于通过不同的提示调整相同的生成结果。如果未提供，将使用提供的随机 generator 采样生成潜变量张量。
output_type (str, 可选, 默认为 "pil") — 生成图像的输出格式。从以下选项中选择："pil" (PIL.Image.Image), "np" (np.array) 或 "pt" (torch.Tensor)。
return_dict (bool, 可选, 默认为 True) — 是否返回 ImagePipelineOutput 而不是纯元组。
callback_on_step_end (Callable, 可选) — 在推理期间的每个去噪步骤结束时调用的函数。该函数使用以下参数调用： callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。 callback_kwargs 将包含由 callback_on_step_end_tensor_inputs 指定的所有张量列表。
callback_on_step_end_tensor_inputs (List, 可选) — callback_on_step_end 函数的张量输入列表。列表中指定的张量将作为 callback_kwargs 参数传递。您将只能包含在管道类的 ._callback_tensor_inputs 属性中列出的变量。

返回值

ImagePipelineOutput 或 tuple

调用 pipeline 进行生成时调用的函数。

示例

>>> from diffusers import KandinskyV22Pipeline, KandinskyV22PriorPipeline
>>> import torch

>>> pipe_prior = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior")
>>> pipe_prior.to("cuda")
>>> prompt = "red cat, 4k photo"
>>> out = pipe_prior(prompt)
>>> image_emb = out.image_embeds
>>> zero_image_emb = out.negative_image_embeds
>>> pipe = KandinskyV22Pipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder")
>>> pipe.to("cuda")
>>> image = pipe(
...     image_embeds=image_emb,
...     negative_image_embeds=zero_image_emb,
...     height=768,
...     width=768,
...     num_inference_steps=50,
... ).images
>>> image[0].save("cat.png")

KandinskyV22CombinedPipeline

class diffusers.KandinskyV22CombinedPipeline

< source >

( unet: UNet2DConditionModel scheduler: DDPMScheduler movq: VQModel prior_prior: PriorTransformer prior_image_encoder: CLIPVisionModelWithProjection prior_text_encoder: CLIPTextModelWithProjection prior_tokenizer: CLIPTokenizer prior_scheduler: UnCLIPScheduler prior_image_processor: CLIPImageProcessor )

参数

scheduler (Union[DDIMScheduler,DDPMScheduler]) — 调度器，与 unet 结合使用以生成图像潜在空间。
unet (UNet2DConditionModel) — 用于去噪图像嵌入的有条件 U-Net 架构。
movq (VQModel) — MoVQ 解码器，用于从潜在空间生成图像。
prior_prior (PriorTransformer) — 规范的 unCLIP 先验模型，用于从文本嵌入近似图像嵌入。
prior_image_encoder (CLIPVisionModelWithProjection) — 冻结的图像编码器。
prior_text_encoder (CLIPTextModelWithProjection) — 冻结的文本编码器。
prior_tokenizer (CLIPTokenizer) — CLIPTokenizer 类的分词器。
prior_scheduler (UnCLIPScheduler) — 一个调度器，与 prior 结合使用以生成图像嵌入。
prior_image_processor (CLIPImageProcessor) — 一个图像处理器，用于预处理来自 clip 的图像。

使用 Kandinsky 的文本到图像生成组合管线

此模型继承自 DiffusionPipeline。查看超类文档，了解库为所有 pipeline 实现的通用方法（例如下载或保存、在特定设备上运行等）。

call

< source >

( prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_inference_steps: int = 100 guidance_scale: float = 4.0 num_images_per_prompt: int = 1 height: int = 512 width: int = 512 prior_guidance_scale: float = 4.0 prior_num_inference_steps: int = 25 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 return_dict: bool = True prior_callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None prior_callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] ) → ImagePipelineOutput or tuple

参数

prompt (str 或 List[str]) — 用于引导图像生成的提示或提示列表。
negative_prompt (str 或 List[str], 可选) — 不用于引导图像生成的提示或提示列表。当不使用引导时忽略（即，如果 guidance_scale 小于 1 则忽略）。
num_images_per_prompt (int, 可选, 默认为 1) — 每个提示生成的图像数量。
num_inference_steps (int, 可选, 默认为 100) — 去噪步骤的数量。更多去噪步骤通常会以较慢的推理速度为代价带来更高质量的图像。
height (int, 可选, 默认为 512) — 生成图像的高度像素。
width (int, 可选, 默认为 512) — 生成图像的宽度像素。
prior_guidance_scale (float, 可选, 默认为 4.0) — 引导比例，如 Classifier-Free Diffusion Guidance 中定义。guidance_scale 定义为 Imagen Paper 等式 2 中的 w。通过设置 guidance_scale > 1 启用引导比例。较高的引导比例鼓励生成与文本 prompt 紧密相关的图像，但通常以降低图像质量为代价。
prior_num_inference_steps (int, 可选, 默认为 100) — 去噪步骤的数量。更多去噪步骤通常会以较慢的推理速度为代价带来更高质量的图像。
guidance_scale (float, 可选, 默认为 4.0) — 引导比例，如 Classifier-Free Diffusion Guidance 中定义。guidance_scale 定义为 Imagen Paper 等式 2 中的 w。通过设置 guidance_scale > 1 启用引导比例。较高的引导比例鼓励生成与文本 prompt 紧密相关的图像，但通常以降低图像质量为代价。
generator (torch.Generator 或 List[torch.Generator], 可选) — 用于使生成具有确定性的一个或多个 torch 生成器。
latents (torch.Tensor, 可选) — 预生成的噪声潜在变量，从高斯分布中采样，用作图像生成的输入。可用于使用不同的提示调整相同的生成。如果未提供，则将通过使用提供的随机 generator 进行采样来生成潜在变量张量。
output_type (str, 可选, 默认为 "pil") — 生成图像的输出格式。在以下选项之间选择："pil" (PIL.Image.Image)、"np" (np.array) 或 "pt" (torch.Tensor)。
return_dict (bool, 可选, 默认为 True) — 是否返回 ImagePipelineOutput 而不是普通元组。
prior_callback_on_step_end (Callable, 可选) — 一个函数，在 prior 管线的推理期间，在每个去噪步骤结束时调用。该函数使用以下参数调用：prior_callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。
prior_callback_on_step_end_tensor_inputs (List, 可选) — prior_callback_on_step_end 函数的张量输入列表。列表中指定的张量将作为 callback_kwargs 参数传递。您将只能包含在 prior 管线类的 ._callback_tensor_inputs 属性中列出的变量。
callback_on_step_end (Callable, 可选) — 一个函数，在 decoder 管线的推理期间，在每个去噪步骤结束时调用。该函数使用以下参数调用：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。callback_kwargs 将包含 callback_on_step_end_tensor_inputs 指定的所有张量的列表。
callback_on_step_end_tensor_inputs (List, 可选) — callback_on_step_end 函数的张量输入列表。列表中指定的张量将作为 callback_kwargs 参数传递。您将只能包含在管线类的 ._callback_tensor_inputs 属性中列出的变量。

返回值

ImagePipelineOutput 或 tuple

调用 pipeline 进行生成时调用的函数。

示例

from diffusers import AutoPipelineForText2Image
import torch

pipe = AutoPipelineForText2Image.from_pretrained(
    "kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16
)
pipe.enable_model_cpu_offload()

prompt = "A lion in galaxies, spirals, nebulae, stars, smoke, iridescent, intricate detail, octane render, 8k"

image = pipe(prompt=prompt, num_inference_steps=25).images[0]

enable_sequential_cpu_offload

< source >

( gpu_id: typing.Optional[int] = None device: typing.Union[torch.device, str] = 'cuda' )

使用 accelerate 将所有模型卸载到 CPU，从而显著减少内存使用量。当调用时，unet、text_encoder、vae 和 safety checker 的状态字典将保存到 CPU，然后移动到 torch.device('meta')，并且仅当其特定的子模块调用了 forward 方法时才加载到 GPU。请注意，卸载是基于子模块进行的。内存节省比 enable_model_cpu_offload 更高，但性能更低。

KandinskyV22ControlnetPipeline

class diffusers.KandinskyV22ControlnetPipeline

< source >

( unet: UNet2DConditionModel scheduler: DDPMScheduler movq: VQModel )

参数

scheduler (DDIMScheduler) — 与 unet 结合使用的调度器，用于生成图像 latents。
unet (UNet2DConditionModel) — 条件 U-Net 架构，用于对图像嵌入进行去噪。
movq (VQModel) — MoVQ 解码器，用于从 latents 生成图像。

使用 Kandinsky 进行文本到图像生成的管道。

此模型继承自 DiffusionPipeline。查看超类文档，了解库为所有 pipeline 实现的通用方法（例如下载或保存、在特定设备上运行等）。

call

< source >

( image_embeds: typing.Union[torch.Tensor, typing.List[torch.Tensor]] negative_image_embeds: typing.Union[torch.Tensor, typing.List[torch.Tensor]] hint: Tensor height: int = 512 width: int = 512 num_inference_steps: int = 100 guidance_scale: float = 4.0 num_images_per_prompt: int = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 return_dict: bool = True ) → ImagePipelineOutput 或 tuple

参数

prompt (str 或 List[str]) — 用于引导图像生成的 prompt 或 prompts。
hint (torch.Tensor) — Controlnet 条件。
image_embeds (torch.Tensor 或 List[torch.Tensor]) — 文本 prompt 的 clip 图像嵌入，将用于调节图像生成。
negative_image_embeds (torch.Tensor 或 List[torch.Tensor]) — 负面文本 prompt 的 clip 图像嵌入，将用于调节图像生成。
negative_prompt (str 或 List[str], 可选) — 不用于引导图像生成的 prompt 或 prompts。当不使用引导时忽略（即，如果 guidance_scale 小于 1，则忽略）。
height (int, 可选, 默认为 512) — 生成图像的高度像素。
width (int, 可选, 默认为 512) — 生成图像的宽度像素。
num_inference_steps (int, 可选, 默认为 100) — 去噪步骤的数量。更多的去噪步骤通常会带来更高质量的图像，但会牺牲更慢的推理速度。
guidance_scale (float, 可选, 默认为 4.0) — Classifier-Free Diffusion Guidance 中定义的引导缩放。 guidance_scale 定义为 Imagen Paper 的等式 2 中的 w。通过设置 guidance_scale > 1 启用引导缩放。较高的引导缩放鼓励生成与文本 prompt 紧密相关的图像，但通常以较低的图像质量为代价。
num_images_per_prompt (int, 可选, 默认为 1) — 每个 prompt 生成的图像数量。
generator (torch.Generator 或 List[torch.Generator], 可选) — 用于使生成具有确定性的一个或一组 torch 生成器。
latents (torch.Tensor, 可选) — 预生成的噪声 latents，从高斯分布中采样，用作图像生成的输入。可用于使用不同的 prompts 调整相同的生成结果。如果未提供，则将使用提供的随机 generator 采样生成 latents 张量。
output_type (str, 可选, 默认为 "pil") — 生成图像的输出格式。在以下选项中选择："pil" (PIL.Image.Image)、"np" (np.array) 或 "pt" (torch.Tensor)。
callback (Callable, 可选) — 在推理期间每 callback_steps 步调用的函数。该函数使用以下参数调用：callback(step: int, timestep: int, latents: torch.Tensor)。
callback_steps (int, 可选, 默认为 1) — 调用 callback 函数的频率。如果未指定，则在每个步骤都调用回调。
return_dict (bool, 可选, 默认为 True) — 是否返回 ImagePipelineOutput 而不是普通元组。

返回值

ImagePipelineOutput 或 tuple

调用 pipeline 进行生成时调用的函数。

示例

KandinskyV22PriorEmb2EmbPipeline

class diffusers.KandinskyV22PriorEmb2EmbPipeline

< source >

参数

prior (PriorTransformer) — 规范的 unCLIP prior，用于从文本嵌入近似图像嵌入。
image_encoder (CLIPVisionModelWithProjection) — 冻结的图像编码器。
text_encoder (CLIPTextModelWithProjection) — 冻结的文本编码器。
tokenizer (CLIPTokenizer) — CLIPTokenizer 类的分词器。
scheduler (UnCLIPScheduler) — 调度器，与 prior 结合使用以生成图像嵌入。

用于为 Kandinsky 生成图像 prior 的 Pipeline

此模型继承自 DiffusionPipeline。查看超类文档，了解库为所有 pipeline 实现的通用方法（例如下载或保存、在特定设备上运行等）。

call

< source >

( prompt: typing.Union[str, typing.List[str]] image: typing.Union[torch.Tensor, typing.List[torch.Tensor], PIL.Image.Image, typing.List[PIL.Image.Image]] strength: float = 0.3 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: int = 1 num_inference_steps: int = 25 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None guidance_scale: float = 4.0 output_type: typing.Optional[str] = 'pt' return_dict: bool = True ) → KandinskyPriorPipelineOutput 或 tuple

参数

prompt (str 或 List[str]) — 引导图像生成的提示或提示列表。
strength (float, 可选，默认为 0.8) — 从概念上讲，表示转换参考 emb 的程度。必须介于 0 和 1 之间。image 将用作起点，strength 越大，向其中添加的噪声就越多。去噪步骤的数量取决于最初添加的噪声量。
emb (torch.Tensor) — 图像嵌入。
negative_prompt (str 或 List[str], 可选) — 不引导图像生成的提示或提示列表。当不使用引导时忽略（即，如果 guidance_scale 小于 1 则忽略）。
num_images_per_prompt (int, 可选，默认为 1) — 每个提示要生成的图像数量。
num_inference_steps (int, 可选，默认为 100) — 去噪步骤的数量。更多的去噪步骤通常会带来更高质量的图像，但会牺牲较慢的推理速度。
generator (torch.Generator 或 List[torch.Generator], 可选) — 一个或一组 torch 生成器，用于使生成具有确定性。
guidance_scale (float, 可选，默认为 4.0) — Classifier-Free Diffusion Guidance 中定义的引导缩放比例。guidance_scale 定义为 Imagen Paper 的公式 2 中的 w。通过设置 guidance_scale > 1 启用引导缩放。较高的引导比例鼓励生成与文本 prompt 紧密相关的图像，但通常以降低图像质量为代价。
output_type (str, 可选，默认为 "pt") — 生成图像的输出格式。在以下选项中选择："np" (np.array) 或 "pt" (torch.Tensor)。
return_dict (bool, 可选，默认为 True) — 是否返回 ImagePipelineOutput 而不是普通元组。

返回值

KandinskyPriorPipelineOutput 或 tuple

调用 pipeline 进行生成时调用的函数。

示例

>>> from diffusers import KandinskyV22Pipeline, KandinskyV22PriorEmb2EmbPipeline
>>> import torch

>>> pipe_prior = KandinskyPriorPipeline.from_pretrained(
...     "kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16
... )
>>> pipe_prior.to("cuda")

>>> prompt = "red cat, 4k photo"
>>> img = load_image(
...     "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
...     "/kandinsky/cat.png"
... )
>>> image_emb, nagative_image_emb = pipe_prior(prompt, image=img, strength=0.2).to_tuple()

>>> pipe = KandinskyPipeline.from_pretrained(
...     "kandinsky-community/kandinsky-2-2-decoder, torch_dtype=torch.float16"
... )
>>> pipe.to("cuda")

>>> image = pipe(
...     image_embeds=image_emb,
...     negative_image_embeds=negative_image_emb,
...     height=768,
...     width=768,
...     num_inference_steps=100,
... ).images

>>> image[0].save("cat.png")

interpolate

< source >

参数

images_and_prompts (List[Union[str, PIL.Image.Image, torch.Tensor]]) — 提示和图像列表，用于引导图像生成。
weights — (List[float]): images_and_prompts 中每个条件的权重列表
num_images_per_prompt (int, 可选，默认为 1) — 每个提示要生成的图像数量。
num_inference_steps (int, 可选，默认为 100) — 去噪步骤的数量。更多的去噪步骤通常会带来更高质量的图像，但会牺牲较慢的推理速度。
generator (torch.Generator 或 List[torch.Generator], 可选) — 一个或一组 torch 生成器，用于使生成具有确定性。
latents (torch.Tensor, 可选) — 预生成的噪声潜在变量，从高斯分布中采样，用作图像生成的输入。可用于通过不同的提示调整相同的生成结果。如果未提供，则将通过使用提供的随机 generator 进行采样来生成潜在张量。
negative_prior_prompt (str, 可选) — 不引导 prior 扩散过程的提示。当不使用引导时忽略（即，如果 guidance_scale 小于 1 则忽略）。
negative_prompt (str 或 List[str], 可选) — 不用于引导图像生成的提示。当不使用引导时忽略（即，如果 guidance_scale 小于 1，则忽略）。
guidance_scale (float, 可选, 默认为 4.0) — 引导比例，定义见 Classifier-Free Diffusion Guidance。 guidance_scale 被定义为 Imagen Paper 的公式 2 中的 w。通过设置 guidance_scale > 1 启用引导比例。较高的引导比例鼓励生成与文本 prompt 紧密相关的图像，但通常会牺牲较低的图像质量。

返回值

KandinskyPriorPipelineOutput 或 tuple

使用先验管道进行插值时调用的函数。

示例

>>> from diffusers import KandinskyV22PriorEmb2EmbPipeline, KandinskyV22Pipeline
>>> from diffusers.utils import load_image
>>> import PIL

>>> import torch
>>> from torchvision import transforms

>>> pipe_prior = KandinskyV22PriorPipeline.from_pretrained(
...     "kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16
... )
>>> pipe_prior.to("cuda")

>>> img1 = load_image(
...     "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
...     "/kandinsky/cat.png"
... )

>>> img2 = load_image(
...     "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
...     "/kandinsky/starry_night.jpeg"
... )

>>> images_texts = ["a cat", img1, img2]
>>> weights = [0.3, 0.3, 0.4]
>>> image_emb, zero_image_emb = pipe_prior.interpolate(images_texts, weights)

>>> pipe = KandinskyV22Pipeline.from_pretrained(
...     "kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16
... )
>>> pipe.to("cuda")

>>> image = pipe(
...     image_embeds=image_emb,
...     negative_image_embeds=zero_image_emb,
...     height=768,
...     width=768,
...     num_inference_steps=150,
... ).images[0]

>>> image.save("starry_cat.png")

KandinskyV22Img2ImgPipeline

类 diffusers.KandinskyV22Img2ImgPipeline

< 源码 >

( unet: UNet2DConditionModel scheduler: DDPMScheduler movq: VQModel )

参数

scheduler (DDIMScheduler) — 一个调度器，与 unet 结合使用以生成图像潜在表示。
unet (UNet2DConditionModel) — 条件 U-Net 架构，用于对图像嵌入进行去噪。
movq (VQModel) — MoVQ 解码器，用于从潜在表示生成图像。

用于 Kandinsky 图像到图像生成的 Pipeline

此模型继承自 DiffusionPipeline。查看超类文档，了解库为所有 pipeline 实现的通用方法（例如下载或保存、在特定设备上运行等）。

call

< 源码 >

( image_embeds: typing.Union[torch.Tensor, typing.List[torch.Tensor]] image: typing.Union[torch.Tensor, PIL.Image.Image, typing.List[torch.Tensor], typing.List[PIL.Image.Image]] negative_image_embeds: typing.Union[torch.Tensor, typing.List[torch.Tensor]] height: int = 512 width: int = 512 num_inference_steps: int = 100 guidance_scale: float = 4.0 strength: float = 0.3 num_images_per_prompt: int = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] **kwargs ) → ImagePipelineOutput or tuple

参数

image_embeds (torch.Tensor 或 List[torch.Tensor]) — 用于文本提示的 clip 图像嵌入，将用于调节图像生成。
image (torch.Tensor, PIL.Image.Image, np.ndarray, List[torch.Tensor], List[PIL.Image.Image], 或 List[np.ndarray]) — Image，或表示图像批次的张量，将用作过程的起始点。也可以接受图像潜在表示作为 image，如果直接传递潜在表示，则不会再次编码。
strength (float, 可选, 默认为 0.8) — 概念上，指示转换参考 image 的程度。必须介于 0 和 1 之间。 image 将用作起始点，strength 越大，向其添加的噪声越多。去噪步数取决于最初添加的噪声量。当 strength 为 1 时，添加的噪声将是最大的，去噪过程将运行 num_inference_steps 中指定的完整迭代次数。因此，值为 1 实际上会忽略 image。
negative_image_embeds (torch.Tensor 或 List[torch.Tensor]) — 用于负面文本提示的 clip 图像嵌入，将用于调节图像生成。
height (int, 可选, 默认为 512) — 生成的图像的高度，以像素为单位。
width (int, 可选, 默认为 512) — 生成的图像的宽度，以像素为单位。
num_inference_steps (int, 可选, 默认为 100) — 去噪步数。更多的去噪步骤通常会带来更高质量的图像，但会牺牲较慢的推理速度。
guidance_scale (float, 可选, 默认为 4.0) — 引导比例，定义见 Classifier-Free Diffusion Guidance。 guidance_scale 被定义为 Imagen Paper 的公式 2 中的 w。通过设置 guidance_scale > 1 启用引导比例。较高的引导比例鼓励生成与文本 prompt 紧密相关的图像，但通常会牺牲较低的图像质量。
num_images_per_prompt (int, 可选, 默认为 1) — 每个提示要生成的图像数量。
generator (torch.Generator 或 List[torch.Generator], 可选) — 一个或一组 torch 生成器，用于使生成具有确定性。
output_type (str, 可选, 默认为 "pil") — 生成图像的输出格式。从以下选项中选择： "pil" (PIL.Image.Image), "np" (np.array) 或 "pt" (torch.Tensor)。
return_dict (bool, 可选, 默认为 True) — 是否返回 ImagePipelineOutput 而不是普通元组。
callback_on_step_end (Callable, 可选) — 一个在推理期间每个去噪步骤结束时调用的函数。该函数使用以下参数调用： callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。 callback_kwargs 将包括 callback_on_step_end_tensor_inputs 指定的所有张量列表。
callback_on_step_end_tensor_inputs (List, 可选) — callback_on_step_end 函数的张量输入列表。列表中指定的张量将作为 callback_kwargs 参数传递。您将只能包含在管道类的 ._callback_tensor_inputs 属性中列出的变量。

返回值

ImagePipelineOutput 或 tuple

调用 pipeline 进行生成时调用的函数。

示例

KandinskyV22Img2ImgCombinedPipeline

类 diffusers.KandinskyV22Img2ImgCombinedPipeline

< 源码 >

参数

scheduler (Union[DDIMScheduler,DDPMScheduler]) — 一个调度器，与 unet 结合使用以生成图像潜在表示。
unet (UNet2DConditionModel) — 用于去噪图像嵌入的条件 U-Net 架构。
movq (VQModel) — MoVQ 解码器，用于从潜在空间生成图像。
prior_prior (PriorTransformer) — 规范的 unCLIP 先验模型，用于从文本嵌入近似图像嵌入。
prior_image_encoder (CLIPVisionModelWithProjection) — 冻结的图像编码器。
prior_text_encoder (CLIPTextModelWithProjection) — 冻结的文本编码器。
prior_tokenizer (CLIPTokenizer) — CLIPTokenizer 类的分词器。
prior_scheduler (UnCLIPScheduler) — 一个调度器，与 prior 结合使用以生成图像嵌入。
prior_image_processor (CLIPImageProcessor) — 一个图像处理器，用于预处理来自 clip 的图像。

Kandinsky 图像到图像生成组合管道

此模型继承自 DiffusionPipeline。查看超类文档，了解库为所有 pipeline 实现的通用方法（例如下载或保存、在特定设备上运行等）。

call

< source >

( prompt: typing.Union[str, typing.List[str]] image: typing.Union[torch.Tensor, PIL.Image.Image, typing.List[torch.Tensor], typing.List[PIL.Image.Image]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_inference_steps: int = 100 guidance_scale: float = 4.0 strength: float = 0.3 num_images_per_prompt: int = 1 height: int = 512 width: int = 512 prior_guidance_scale: float = 4.0 prior_num_inference_steps: int = 25 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 return_dict: bool = True prior_callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None prior_callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] ) → ImagePipelineOutput or tuple

参数

prompt (str 或 List[str]) — 用于引导图像生成的提示或提示列表。
image (torch.Tensor, PIL.Image.Image, np.ndarray, List[torch.Tensor], List[PIL.Image.Image], 或 List[np.ndarray]) — Image，或表示图像批次的张量，将用作该过程的起点。也可以接受图像潜在表示作为 image，如果直接传递潜在表示，则不会再次编码。
negative_prompt (str 或 List[str], 可选) — 不用于引导图像生成的提示或提示列表。当不使用指导时忽略（即，如果 guidance_scale 小于 1 则忽略）。
num_images_per_prompt (int, 可选, 默认为 1) — 每个提示生成的图像数量。
guidance_scale (float, 可选, 默认为 4.0) — 指导缩放，定义见 Classifier-Free Diffusion Guidance。guidance_scale 定义为 Imagen Paper 方程式 2 中的 w。通过设置 guidance_scale > 1 启用指导缩放。较高的指导缩放会促使生成与文本 prompt 紧密相关的图像，但通常会牺牲较低的图像质量。
strength (float, 可选, 默认为 0.3) — 从概念上讲，表示要转换参考 image 的程度。必须介于 0 和 1 之间。image 将用作起点，strength 越大，添加到其中的噪声就越多。去噪步骤的数量取决于最初添加的噪声量。当 strength 为 1 时，添加的噪声将最大，去噪过程将运行完整数量的迭代次数，如 num_inference_steps 中指定的那样。因此，值为 1 时，实际上会忽略 image。
num_inference_steps (int, 可选, 默认为 100) — 去噪步骤的数量。更多的去噪步骤通常会带来更高质量的图像，但会牺牲较慢的推理速度。
height (int, 可选, 默认为 512) — 生成图像的高度像素。
width (int, 可选, 默认为 512) — 生成图像的宽度像素。
prior_guidance_scale (float, 可选, 默认为 4.0) — 指导缩放，定义见 Classifier-Free Diffusion Guidance。guidance_scale 定义为 Imagen Paper 方程式 2 中的 w。通过设置 guidance_scale > 1 启用指导缩放。较高的指导缩放会促使生成与文本 prompt 紧密相关的图像，但通常会牺牲较低的图像质量。
prior_num_inference_steps (int, 可选, 默认为 100) — 去噪步骤的数量。更多的去噪步骤通常会带来更高质量的图像，但会牺牲较慢的推理速度。
generator (torch.Generator 或 List[torch.Generator], 可选) — 一个或一组 torch 生成器，用于使生成具有确定性。
latents (torch.Tensor, 可选) — 预生成的噪声潜在表示，从高斯分布中采样，用作图像生成的输入。可用于使用不同的提示调整相同的生成。如果未提供，将通过使用提供的随机 generator 进行采样来生成潜在张量。
output_type (str, optional, defaults to "pil") — 生成图像的输出格式。可选值："pil" (PIL.Image.Image), "np" (np.array) 或 "pt" (torch.Tensor)。
callback (Callable, optional) — 一个函数，它在推理期间每 callback_steps 步调用一次。该函数使用以下参数调用：callback(step: int, timestep: int, latents: torch.Tensor)。
callback_steps (int, optional, defaults to 1) — 调用 callback 函数的频率。如果未指定，则在每个步骤调用回调。
return_dict (bool, optional, defaults to True) — 是否返回 ImagePipelineOutput 而不是纯元组。

返回值

ImagePipelineOutput 或 tuple

调用 pipeline 进行生成时调用的函数。

示例

from diffusers import AutoPipelineForImage2Image
import torch
import requests
from io import BytesIO
from PIL import Image
import os

pipe = AutoPipelineForImage2Image.from_pretrained(
    "kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16
)
pipe.enable_model_cpu_offload()

prompt = "A fantasy landscape, Cinematic lighting"
negative_prompt = "low quality, bad quality"

url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"

response = requests.get(url)
image = Image.open(BytesIO(response.content)).convert("RGB")
image.thumbnail((768, 768))

image = pipe(prompt=prompt, image=original_image, num_inference_steps=25).images[0]

enable_model_cpu_offload

< source >

( gpu_id: typing.Optional[int] = None device: typing.Union[torch.device, str] = 'cuda' )

使用 accelerate 将所有模型卸载到 CPU，从而减少内存使用，且对性能的影响很小。与 enable_sequential_cpu_offload 相比，此方法在调用模型的 forward 方法时一次将一个完整模型移动到 GPU，并且该模型保留在 GPU 中，直到下一个模型运行。内存节省低于 enable_sequential_cpu_offload，但由于 unet 的迭代执行，性能要好得多。

enable_sequential_cpu_offload

< source >

( gpu_id: typing.Optional[int] = None device: typing.Union[torch.device, str] = 'cuda' )

KandinskyV22ControlnetImg2ImgPipeline

class diffusers.KandinskyV22ControlnetImg2ImgPipeline

< source >

( unet: UNet2DConditionModel scheduler: DDPMScheduler movq: VQModel )

参数

scheduler (DDIMScheduler) — 一个调度器，与 unet 结合使用以生成图像潜在表示 (latents)。
unet (UNet2DConditionModel) — 条件 U-Net 架构，用于对图像嵌入进行去噪。
movq (VQModel) — MoVQ 解码器，用于从潜在表示生成图像。

用于 Kandinsky 图像到图像生成的 Pipeline

此模型继承自 DiffusionPipeline。查看超类文档，了解库为所有 pipeline 实现的通用方法（例如下载或保存、在特定设备上运行等）。

call

< source >

( image_embeds: typing.Union[torch.Tensor, typing.List[torch.Tensor]] image: typing.Union[torch.Tensor, PIL.Image.Image, typing.List[torch.Tensor], typing.List[PIL.Image.Image]] negative_image_embeds: typing.Union[torch.Tensor, typing.List[torch.Tensor]] hint: Tensor height: int = 512 width: int = 512 num_inference_steps: int = 100 guidance_scale: float = 4.0 strength: float = 0.3 num_images_per_prompt: int = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None output_type: typing.Optional[str] = 'pil' callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 return_dict: bool = True ) → ImagePipelineOutput 或 tuple

参数

image_embeds (torch.Tensor 或 List[torch.Tensor]) — 文本提示的 clip 图像嵌入，将用于调节图像生成。
image (torch.Tensor, PIL.Image.Image, np.ndarray, List[torch.Tensor], List[PIL.Image.Image], 或 List[np.ndarray]) — Image，或表示图像批次的张量，将用作该过程的起点。也可以接受图像潜在表示作为 image，如果直接传递潜在表示，则不会再次编码。
strength (float, optional, defaults to 0.8) — 从概念上讲，表示要转换参考 image 的程度。必须介于 0 和 1 之间。 image 将用作起点，strength 越大，向其添加的噪声就越多。去噪步骤的数量取决于最初添加的噪声量。当 strength 为 1 时，添加的噪声将是最大的，并且去噪过程将运行在 num_inference_steps 中指定的完整迭代次数。因此，值为 1 实际上会忽略 image。
hint (torch.Tensor) — controlnet 条件。
negative_image_embeds (torch.Tensor 或 List[torch.Tensor]) — 负面文本提示的 clip 图像嵌入，将用于调节图像生成。
height (int, optional, defaults to 512) — 生成图像的高度（像素）。
width (int, optional, defaults to 512) — 生成图像的宽度（像素）。
num_inference_steps (int, optional, defaults to 100) — 去噪步骤的数量。更多的去噪步骤通常会带来更高的图像质量，但会牺牲推理速度。
guidance_scale (float, optional, defaults to 4.0) — Guidance scale，定义见 Classifier-Free Diffusion Guidance。 guidance_scale 定义为 Imagen Paper 的等式 2 中的 w。通过设置 guidance_scale > 1 启用 Guidance scale。较高的 guidance scale 鼓励生成与文本 prompt 紧密相关的图像，但通常以降低图像质量为代价。
num_images_per_prompt (int, optional, defaults to 1) — 每个 prompt 生成的图像数量。
generator (torch.Generator 或 List[torch.Generator], optional) — 一个或一组 torch generator(s)，用于使生成具有确定性。
output_type (str, optional, defaults to "pil") — 生成图像的输出格式。可选值："pil" (PIL.Image.Image), "np" (np.array) 或 "pt" (torch.Tensor)。
callback (Callable, optional) — 一个函数，它在推理期间每 callback_steps 步调用一次。该函数使用以下参数调用：callback(step: int, timestep: int, latents: torch.Tensor)。
callback_steps (int, 可选, 默认为 1) — 调用 callback 函数的频率。如果未指定，则在每个步骤都调用回调函数。
return_dict (bool, 可选, 默认为 True) — 是否返回 ImagePipelineOutput 而不是纯元组。

返回值

ImagePipelineOutput 或 tuple

调用 pipeline 进行生成时调用的函数。

示例

KandinskyV22InpaintPipeline

class diffusers.KandinskyV22InpaintPipeline

< source >

( unet: UNet2DConditionModel scheduler: DDPMScheduler movq: VQModel )

参数

scheduler (DDIMScheduler) — 调度器，与 unet 结合使用以生成图像潜在空间。
unet (UNet2DConditionModel) — 条件 U-Net 架构，用于对图像嵌入进行去噪。
movq (VQModel) — MoVQ 解码器，用于从潜在空间生成图像。

用于文本引导的图像修复的 Pipeline，使用 Kandinsky2.1

此模型继承自 DiffusionPipeline。查看超类文档，了解库为所有 pipeline 实现的通用方法（例如下载或保存、在特定设备上运行等）。

call

< source >

( image_embeds: typing.Union[torch.Tensor, typing.List[torch.Tensor]] image: typing.Union[torch.Tensor, PIL.Image.Image] mask_image: typing.Union[torch.Tensor, PIL.Image.Image, numpy.ndarray] negative_image_embeds: typing.Union[torch.Tensor, typing.List[torch.Tensor]] height: int = 512 width: int = 512 num_inference_steps: int = 100 guidance_scale: float = 4.0 num_images_per_prompt: int = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] **kwargs ) → ImagePipelineOutput 或 tuple

参数

image_embeds (torch.Tensor 或 List[torch.Tensor]) — 文本提示的 clip 图像嵌入，将用于调节图像生成。
image (PIL.Image.Image) — Image，或表示将被修复的图像批次的张量，即图像的部分将被 mask_image 遮罩，并根据 prompt 重新绘制。
mask_image (np.array) — 表示图像批次的张量，用于遮罩 image。蒙版中的白色像素将被重新绘制，而黑色像素将被保留。如果 mask_image 是 PIL 图像，则在使用前会将其转换为单通道（亮度）。如果它是张量，则它应包含一个颜色通道 (L) 而不是 3 个，因此预期的形状将为 (B, H, W, 1)。
negative_image_embeds (torch.Tensor 或 List[torch.Tensor]) — 负面文本提示的 clip 图像嵌入，将用于调节图像生成。
height (int, 可选, 默认为 512) — 生成图像的高度像素。
width (int, 可选, 默认为 512) — 生成图像的宽度像素。
num_inference_steps (int, 可选, 默认为 100) — 去噪步骤的数量。更多的去噪步骤通常会带来更高质量的图像，但会以较慢的推理速度为代价。
guidance_scale (float, 可选, 默认为 4.0) — Guidance scale，如 Classifier-Free Diffusion Guidance 中定义。 guidance_scale 定义为 Imagen Paper 的公式 2 中的 w。通过设置 guidance_scale > 1 启用 Guidance scale。更高的 guidance scale 鼓励生成与文本 prompt 紧密相关的图像，但通常以降低图像质量为代价。
num_images_per_prompt (int, 可选, 默认为 1) — 每个 prompt 生成的图像数量。
generator (torch.Generator 或 List[torch.Generator], 可选) — 一个或一组 torch generator(s) 以使生成具有确定性。
latents (torch.Tensor, 可选) — 预生成的噪声潜在空间，从高斯分布中采样，用作图像生成的输入。可用于使用不同的 prompts 调整相同的生成。如果未提供，将使用提供的随机 generator 采样生成潜在空间张量。
output_type (str, 可选, 默认为 "pil") — 生成图像的输出格式。从以下选项中选择： "pil" (PIL.Image.Image), "np" (np.array) 或 "pt" (torch.Tensor)。
return_dict (bool, 可选, 默认为 True) — 是否返回 ImagePipelineOutput 而不是纯元组。
callback_on_step_end (Callable, 可选) — 在推理期间的每个去噪步骤结束时调用的函数。该函数使用以下参数调用： callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。 callback_kwargs 将包含由 callback_on_step_end_tensor_inputs 指定的所有张量的列表。
callback_on_step_end_tensor_inputs (List, 可选) — callback_on_step_end 函数的张量输入列表。列表中指定的张量将作为 callback_kwargs 参数传递。您将只能包含管道类 ._callback_tensor_inputs 属性中列出的变量。

返回值

ImagePipelineOutput 或 tuple

调用 pipeline 进行生成时调用的函数。

示例

KandinskyV22InpaintCombinedPipeline

class diffusers.KandinskyV22InpaintCombinedPipeline

< source >

参数

调度器 (Union[DDIMScheduler,DDPMScheduler]) — 与 unet 结合使用的调度器，用于生成图像潜在表示。
unet (UNet2DConditionModel) — 条件式 U-Net 架构，用于对图像嵌入进行去噪。
movq (VQModel) — MoVQ 解码器，用于从潜在变量生成图像。
prior_prior (PriorTransformer) — 规范的 unCLIP 先验模型，用于从文本嵌入近似图像嵌入。
prior_image_encoder (CLIPVisionModelWithProjection) — 冻结的图像编码器。
prior_text_encoder (CLIPTextModelWithProjection) — 冻结的文本编码器。
prior_tokenizer (CLIPTokenizer) — CLIPTokenizer 类的分词器。
prior_scheduler (UnCLIPScheduler) — 与 prior 结合使用的调度器，用于生成图像嵌入。
prior_image_processor (CLIPImageProcessor) — 一个 image_processor，用于预处理来自 clip 的图像。

使用 Kandinsky 进行图像修复生成的组合管线

此模型继承自 DiffusionPipeline。查看超类文档，了解库为所有 pipeline 实现的通用方法（例如下载或保存、在特定设备上运行等）。

call

< source >

( prompt: typing.Union[str, typing.List[str]] image: typing.Union[torch.Tensor, PIL.Image.Image, typing.List[torch.Tensor], typing.List[PIL.Image.Image]] mask_image: typing.Union[torch.Tensor, PIL.Image.Image, typing.List[torch.Tensor], typing.List[PIL.Image.Image]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_inference_steps: int = 100 guidance_scale: float = 4.0 num_images_per_prompt: int = 1 height: int = 512 width: int = 512 prior_guidance_scale: float = 4.0 prior_num_inference_steps: int = 25 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True prior_callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None prior_callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] **kwargs ) → ImagePipelineOutput or tuple

参数

prompt (str 或 List[str]) — 用于引导图像生成的提示词。
image (torch.Tensor, PIL.Image.Image, np.ndarray, List[torch.Tensor], List[PIL.Image.Image], 或 List[np.ndarray]) — Image，或表示图像批次的张量，将用作该过程的起点。也可以接受图像潜在变量作为 image，如果直接传递潜在变量，则不会再次编码。
mask_image (np.array) — 表示图像批次的张量，用于遮罩 image。蒙版中的白色像素将被重新绘制，而黑色像素将被保留。如果 mask_image 是 PIL 图像，则在使用前会将其转换为单通道（亮度）。如果它是一个张量，则它应该包含一个颜色通道 (L) 而不是 3 个，因此预期的形状将是 (B, H, W, 1)。
negative_prompt (str 或 List[str], 可选) — 不用于引导图像生成的提示词。当不使用引导时忽略（即，如果 guidance_scale 小于 1，则忽略）。
num_images_per_prompt (int, 可选, 默认为 1) — 每个提示词要生成的图像数量。
guidance_scale (float, 可选, 默认为 4.0) — Classifier-Free Diffusion Guidance 中定义的引导尺度。 guidance_scale 定义为 Imagen Paper 等式 2 中的 w。通过设置 guidance_scale > 1 启用引导尺度。较高的引导尺度鼓励生成与文本 prompt 紧密相关的图像，但通常以降低图像质量为代价。
num_inference_steps (int, 可选, 默认为 100) — 去噪步骤的数量。更多的去噪步骤通常会带来更高质量的图像，但会牺牲更慢的推理速度。
height (int, 可选, 默认为 512) — 生成图像的高度像素。
width (int, 可选, 默认为 512) — 生成图像的宽度像素。
prior_guidance_scale (float, 可选, 默认为 4.0) — Classifier-Free Diffusion Guidance 中定义的引导尺度。 guidance_scale 定义为 Imagen Paper 等式 2 中的 w。通过设置 guidance_scale > 1 启用引导尺度。较高的引导尺度鼓励生成与文本 prompt 紧密相关的图像，但通常以降低图像质量为代价。
prior_num_inference_steps (int, 可选, 默认为 100) — 去噪步骤的数量。更多的去噪步骤通常会带来更高质量的图像，但会牺牲更慢的推理速度。
generator (torch.Generator 或 List[torch.Generator], 可选) — 用于使生成结果具有确定性的单个或列表的 torch 生成器。
latents (torch.Tensor, 可选) — 预生成的噪声潜在变量，从高斯分布中采样，用作图像生成的输入。可用于通过不同的提示词调整相同的生成结果。如果未提供，则将使用提供的随机 generator 采样生成潜在变量张量。
output_type (str, 可选, 默认为 "pil") — 生成图像的输出格式。从以下选项中选择："pil" (PIL.Image.Image)、"np" (np.array) 或 "pt" (torch.Tensor)。
return_dict (bool, 可选, 默认为 True) — 是否返回 ImagePipelineOutput 而不是普通元组。
prior_callback_on_step_end (Callable, 可选) — 在推理过程中，每个去噪步骤结束时调用的函数。该函数被调用时带有以下参数：prior_callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。
prior_callback_on_step_end_tensor_inputs (List, 可选) — prior_callback_on_step_end 函数的张量输入列表。列表中指定的张量将作为 callback_kwargs 参数传递。您只能包含管道类 ._callback_tensor_inputs 属性中列出的变量。
callback_on_step_end (Callable, 可选) — 在推理过程中，每个去噪步骤结束时调用的函数。该函数被调用时带有以下参数：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。 callback_kwargs 将包含由 callback_on_step_end_tensor_inputs 指定的所有张量列表。
callback_on_step_end_tensor_inputs (List, 可选) — callback_on_step_end 函数的张量输入列表。列表中指定的张量将作为 callback_kwargs 参数传递。您只能包含管道类 ._callback_tensor_inputs 属性中列出的变量。

返回值

ImagePipelineOutput 或 tuple

调用 pipeline 进行生成时调用的函数。

示例

from diffusers import AutoPipelineForInpainting
from diffusers.utils import load_image
import torch
import numpy as np

pipe = AutoPipelineForInpainting.from_pretrained(
    "kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16
)
pipe.enable_model_cpu_offload()

prompt = "A fantasy landscape, Cinematic lighting"
negative_prompt = "low quality, bad quality"

original_image = load_image(
    "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main" "/kandinsky/cat.png"
)

mask = np.zeros((768, 768), dtype=np.float32)
# Let's mask out an area above the cat's head
mask[:250, 250:-250] = 1

image = pipe(prompt=prompt, image=original_image, mask_image=mask, num_inference_steps=25).images[0]

enable_sequential_cpu_offload

< 源代码 >

( gpu_id: typing.Optional[int] = None device: typing.Union[torch.device, str] = 'cuda' )

< > 在 GitHub 上更新

←Kandinsky 2.1 Kandinsky 3→

Diffusers

Kandinsky 2.2

KandinskyV22PriorPipeline

class diffusers.KandinskyV22PriorPipeline

__call__

interpolate

KandinskyV22Pipeline

class diffusers.KandinskyV22Pipeline

__call__

KandinskyV22CombinedPipeline

class diffusers.KandinskyV22CombinedPipeline

__call__

enable_sequential_cpu_offload

KandinskyV22ControlnetPipeline

class diffusers.KandinskyV22ControlnetPipeline

__call__

KandinskyV22PriorEmb2EmbPipeline

class diffusers.KandinskyV22PriorEmb2EmbPipeline

__call__

interpolate

KandinskyV22Img2ImgPipeline

类 diffusers.KandinskyV22Img2ImgPipeline

__call__

KandinskyV22Img2ImgCombinedPipeline

类 diffusers.KandinskyV22Img2ImgCombinedPipeline

__call__

enable_model_cpu_offload

enable_sequential_cpu_offload

KandinskyV22ControlnetImg2ImgPipeline

class diffusers.KandinskyV22ControlnetImg2ImgPipeline

__call__

KandinskyV22InpaintPipeline

class diffusers.KandinskyV22InpaintPipeline

__call__

KandinskyV22InpaintCombinedPipeline

class diffusers.KandinskyV22InpaintCombinedPipeline

__call__

enable_sequential_cpu_offload

call

call

call

call

call

call

call

call

call

call