Diffusers 文档

Kandinsky 3

Diffusers

加入 Hugging Face 社区

并获得增强的文档体验

协作开发模型、数据集和 Spaces

通过加速推理获得更快的示例

切换文档主题

开始使用

Kandinsky 3

Kandinsky 3 由以下人员创建：Vladimir Arkhipkin,Anastasia Maltseva,Igor Pavlov,Andrei Filatov,Arseniy Shakhmatov,Andrey Kuznetsov,Denis Dimitrov, Zein Shaheen

来自其 GitHub 页面的描述

Kandinsky 3.0 是一个开源的文本到图像扩散模型，构建于 Kandinsky2-x 模型系列之上。与其前身相比，通过增加文本编码器和扩散 U-Net 模型的大小，分别增强了模型的文本理解能力和视觉质量。

其架构包括 3 个主要组件

FLAN-UL2，这是一个基于 T5 架构的编码器-解码器模型。
新的 U-Net 架构，采用 BigGAN-deep 模块，在保持相同参数数量的同时，深度增加了一倍。
Sber-MoVQGAN 是一个解码器，已被证明在图像恢复方面具有卓越的结果。

原始代码库可以在 ai-forever/Kandinsky-3 找到。

请查看 Hub 上的 Kandinsky Community 组织，获取用于文本到图像、图像到图像和图像修复等任务的官方模型 checkpoints。

请务必查看 schedulers 指南，了解如何探索 scheduler 速度和质量之间的权衡，并查看跨 pipelines 重用组件部分，了解如何有效地将相同的组件加载到多个 pipelines 中。

Kandinsky3Pipeline

class diffusers.Kandinsky3Pipeline

< source >

( tokenizer: T5Tokenizer text_encoder: T5EncoderModel unet: Kandinsky3UNet scheduler: DDPMScheduler movq: VQModel )

call

< source >

( prompt: typing.Union[str, typing.List[str]] = None num_inference_steps: int = 25 guidance_scale: float = 3.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 height: typing.Optional[int] = 1024 width: typing.Optional[int] = 1024 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None negative_attention_mask: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True latents = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] **kwargs ) → ImagePipelineOutput or tuple

参数

prompt (str 或 List[str], 可选) — 用于引导图像生成的提示语。如果未定义，则必须传入 prompt_embeds。
num_inference_steps (int, 可选, 默认为 25) — 去噪步骤的数量。更多的去噪步骤通常会产生更高质量的图像，但会牺牲推理速度。
timesteps (List[int], 可选) — 用于去噪过程的自定义时间步长。如果未定义，则使用等间距的 num_inference_steps 时间步长。必须按降序排列。
guidance_scale (float, 可选, 默认为 3.0) — Classifier-Free Diffusion Guidance 中定义的引导尺度。guidance_scale 定义为 Imagen Paper 方程式 2 中的 w。通过设置 guidance_scale > 1 启用引导尺度。较高的引导尺度鼓励生成与文本 prompt 紧密相关的图像，但通常以降低图像质量为代价。
negative_prompt (str 或 List[str], 可选) — 不用于引导图像生成的提示语。如果未定义，则必须传入 negative_prompt_embeds。当不使用引导时（即，如果 guidance_scale 小于 1），则忽略。
num_images_per_prompt (int, 可选, 默认为 1) — 每个提示语要生成的图像数量。
height (int, 可选, 默认为 self.unet.config.sample_size) — 生成图像的高度像素。
width (int, 可选, 默认为 self.unet.config.sample_size) — 生成图像的宽度像素。
eta (float, 可选, 默认为 0.0) — 对应于 DDIM 论文中的参数 eta (η): https://arxiv.org/abs/2010.02502。仅适用于 schedulers.DDIMScheduler，对于其他调度器将被忽略。
generator (torch.Generator 或 List[torch.Generator], 可选) — 用于使生成确定性的一个或一组 torch 生成器。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入，例如提示语权重。如果未提供，将从 prompt 输入参数生成文本嵌入。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负面文本嵌入。可用于轻松调整文本输入，例如提示语权重。如果未提供，将从 negative_prompt 输入参数生成 negative_prompt_embeds。
attention_mask (torch.Tensor, 可选) — 预生成的注意力掩码。如果直接传递 prompt_embeds，则必须提供。
negative_attention_mask (torch.Tensor, 可选) — 预生成的负面注意力掩码。如果直接传递 negative_prompt_embeds，则必须提供。
output_type (str, 可选, 默认为 "pil") — 生成图像的输出格式。在 PIL: PIL.Image.Image 或 np.array 之间选择。
return_dict (bool, 可选, 默认为 True) — 是否返回 ~pipelines.stable_diffusion.IFPipelineOutput 而不是普通元组。
callback (Callable, 可选) — 将在推理期间每 callback_steps 步调用的函数。该函数将使用以下参数调用： callback(step: int, timestep: int, latents: torch.Tensor)。
callback_steps (int, 可选, 默认为 1) — 将调用 callback 函数的频率。如果未指定，则将在每个步骤调用回调。
clean_caption (bool, 可选, 默认为 True) — 是否在创建嵌入之前清理标题。需要安装 beautifulsoup4 和 ftfy。如果未安装依赖项，则将从原始提示语创建嵌入。
cross_attention_kwargs (dict, 可选) — 一个 kwargs 字典，如果指定，则会传递给 diffusers.models.attention_processor 中 self.processor 下定义的 AttentionProcessor。

返回值

ImagePipelineOutput 或 tuple

调用 pipeline 进行生成时调用的函数。

示例

>>> from diffusers import AutoPipelineForText2Image
>>> import torch

>>> pipe = AutoPipelineForText2Image.from_pretrained(
...     "kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16
... )
>>> pipe.enable_model_cpu_offload()

>>> prompt = "A photograph of the inside of a subway train. There are raccoons sitting on the seats. One of them is reading a newspaper. The window shows the city in the background."

>>> generator = torch.Generator(device="cpu").manual_seed(0)
>>> image = pipe(prompt, num_inference_steps=25, generator=generator).images[0]

encode_prompt

< source >

( prompt do_classifier_free_guidance = True num_images_per_prompt = 1 device = None negative_prompt = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None _cut_context = False attention_mask: typing.Optional[torch.Tensor] = None negative_attention_mask: typing.Optional[torch.Tensor] = None )

参数

prompt (str 或 List[str], 可选) — 要编码的提示词
device — (torch.device, 可选): 用于放置结果嵌入的 torch 设备
num_images_per_prompt (int, 可选, 默认为 1) — 每个提示词应生成的图像数量
do_classifier_free_guidance (bool, 可选, 默认为 True) — 是否使用无分类器引导
negative_prompt (str 或 List[str], 可选) — 不引导图像生成的提示词或提示词列表。如果未定义，则必须传递 negative_prompt_embeds。如果未定义，则必须传递 negative_prompt_embeds。不使用引导时忽略（即，如果 guidance_scale 小于 1 则忽略）。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入，例如提示词权重。如果未提供，将从 prompt 输入参数生成文本嵌入。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负面文本嵌入。可用于轻松调整文本输入，例如提示词权重。如果未提供，将从 negative_prompt 输入参数生成 negative_prompt_embeds。
attention_mask (torch.Tensor, 可选) — 预生成的注意力掩码。如果直接传递 prompt_embeds，则必须提供。
negative_attention_mask (torch.Tensor, 可选) — 预生成的负面注意力掩码。如果直接传递 negative_prompt_embeds，则必须提供。

将提示词编码为文本编码器隐藏状态。

Kandinsky3Img2ImgPipeline

class diffusers.Kandinsky3Img2ImgPipeline

< source >

( tokenizer: T5Tokenizer text_encoder: T5EncoderModel unet: Kandinsky3UNet scheduler: DDPMScheduler movq: VQModel )

call

< source >

( prompt: typing.Union[str, typing.List[str]] = None image: typing.Union[torch.Tensor, PIL.Image.Image, typing.List[torch.Tensor], typing.List[PIL.Image.Image]] = None strength: float = 0.3 num_inference_steps: int = 25 guidance_scale: float = 3.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None negative_attention_mask: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] **kwargs ) → ImagePipelineOutput 或 tuple

参数

prompt (str 或 List[str], 可选) — 用于引导图像生成的提示词或提示词列表。如果未定义，则必须传递 prompt_embeds。
image (torch.Tensor, PIL.Image.Image, np.ndarray, List[torch.Tensor], List[PIL.Image.Image], 或 List[np.ndarray]) — Image，或表示图像批次的张量，将用作处理的起点。
strength (float, 可选, 默认为 0.8) — 指示变换参考 image 的程度。必须介于 0 和 1 之间。image 用作起点，strength 越高，添加的噪声越多。去噪步骤的数量取决于最初添加的噪声量。当 strength 为 1 时，添加的噪声最大，去噪过程将运行 num_inference_steps 中指定的完整迭代次数。值为 1 本质上会忽略 image。
num_inference_steps (int, 可选, 默认为 50) — 去噪步骤的数量。更多去噪步骤通常会带来更高质量的图像，但代价是推理速度较慢。
guidance_scale (float, 可选, 默认为 3.0) — 无分类器扩散引导中定义的引导缩放。guidance_scale 定义为 Imagen Paper 等式 2 中的 w。通过设置 guidance_scale > 1 启用引导缩放。较高的引导缩放鼓励生成与文本 prompt 紧密相关的图像，通常以降低图像质量为代价。
negative_prompt (str 或 List[str], 可选) — 不引导图像生成的提示词或提示词列表。如果未定义，则必须传递 negative_prompt_embeds。不使用引导时忽略（即，如果 guidance_scale 小于 1 则忽略）。
num_images_per_prompt (int, 可选, 默认为 1) — 每个提示词要生成的图像数量。
generator (torch.Generator 或 List[torch.Generator], 可选) — 用于使生成具有确定性的一个或一组 torch 生成器。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入，例如提示词权重。如果未提供，将从 prompt 输入参数生成文本嵌入。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负面文本嵌入。可用于轻松调整文本输入，例如提示词权重。如果未提供，将从 negative_prompt 输入参数生成 negative_prompt_embeds。
attention_mask (torch.Tensor, 可选) — 预生成的注意力掩码。如果直接传递 prompt_embeds，则必须提供。
negative_attention_mask (torch.Tensor, 可选) — 预生成的负注意力掩码。如果直接传递 negative_prompt_embeds，则必须提供。
output_type (str, 可选, 默认为 "pil") — 生成图像的输出格式。在 PIL: PIL.Image.Image 或 np.array 之间选择。
return_dict (bool, 可选, 默认为 True) — 是否返回 ~pipelines.stable_diffusion.IFPipelineOutput 而不是普通元组。
callback_on_step_end (Callable, 可选) — 在推理期间每个去噪步骤结束时调用的函数。该函数使用以下参数调用：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。callback_kwargs 将包含由 callback_on_step_end_tensor_inputs 指定的所有张量的列表。
callback_on_step_end_tensor_inputs (List, 可选) — callback_on_step_end 函数的张量输入列表。列表中指定的张量将作为 callback_kwargs 参数传递。您只能包含在管道类的 ._callback_tensor_inputs 属性中列出的变量。

返回值

ImagePipelineOutput 或 tuple

调用 pipeline 进行生成时调用的函数。

示例

>>> from diffusers import AutoPipelineForImage2Image
>>> from diffusers.utils import load_image
>>> import torch

>>> pipe = AutoPipelineForImage2Image.from_pretrained(
...     "kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16
... )
>>> pipe.enable_model_cpu_offload()

>>> prompt = "A painting of the inside of a subway train with tiny raccoons."
>>> image = load_image(
...     "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky3/t2i.png"
... )

>>> generator = torch.Generator(device="cpu").manual_seed(0)
>>> image = pipe(prompt, image=image, strength=0.75, num_inference_steps=25, generator=generator).images[0]

encode_prompt

< 源码 >

参数

prompt (str 或 List[str], 可选) — 要编码的提示词

将提示词编码为文本编码器隐藏状态。

device: (torch.device, 可选): 用于放置结果嵌入的 torch 设备 num_images_per_prompt (int, 可选, 默认为 1): 每个提示应生成的图像数量 do_classifier_free_guidance (bool, 可选, 默认为 True): 是否使用无分类器引导 negative_prompt (str 或 List[str], 可选): 不用于引导图像生成的提示或提示列表。如果未定义，则必须传递 negative_prompt_embeds。代替。如果未定义，则必须传递 negative_prompt_embeds。代替。当不使用引导时忽略（即，如果 guidance_scale 小于 1 则忽略）。 prompt_embeds (torch.Tensor, 可选): 预生成的文本嵌入。可用于轻松调整文本输入，例如提示词权重。如果未提供，将从 prompt 输入参数生成文本嵌入。 negative_prompt_embeds (torch.Tensor, 可选): 预生成的负文本嵌入。可用于轻松调整文本输入，例如提示词权重。如果未提供，将从 negative_prompt 输入参数生成 negative_prompt_embeds。 attention_mask (torch.Tensor, 可选): 预生成的注意力掩码。如果直接传递 prompt_embeds，则必须提供。 negative_attention_mask (torch.Tensor, 可选): 预生成的负注意力掩码。如果直接传递 negative_prompt_embeds，则必须提供。

< > 在 GitHub 上更新

←Kandinsky 2.2 Kolors→

Diffusers

Kandinsky 3

Kandinsky3Pipeline

class diffusers.Kandinsky3Pipeline

__call__

encode_prompt

Kandinsky3Img2ImgPipeline

class diffusers.Kandinsky3Img2ImgPipeline

__call__

encode_prompt

call

call