Diffusers 文档

Kandinsky

Hugging Face's logo
加入 Hugging Face 社区

并获得增强的文档体验

开始使用

Kandinsky

Kandinsky 模型是一系列多语言文本到图像生成模型。Kandinsky 2.0 模型使用两个多语言文本编码器,并将它们的结果连接起来输入到 UNet 中。

Kandinsky 2.1 改变了架构,引入了一个图像先验模型(CLIP)来生成文本和图像嵌入之间的映射。这种映射提供了更好的文本-图像对齐,并在训练过程中与文本嵌入一起使用,从而产生更高质量的结果。最后,Kandinsky 2.1 使用了一个 调制量化向量(MoVQ) 解码器——它增加了一个空间条件归一化层来增强照片真实感——将潜在向量解码成图像。

Kandinsky 2.2 在前一个模型的基础上进行了改进,用一个更大的 CLIP-ViT-G 模型替换了图像先验模型的图像编码器,以提高质量。图像先验模型还在不同分辨率和长宽比的图像上进行了重新训练,以生成更高分辨率和不同尺寸的图像。

Kandinsky 3 简化了架构,并放弃了涉及先验模型和扩散模型的两阶段生成过程。取而代之的是,Kandinsky 3 使用 Flan-UL2 来编码文本,一个带有 BigGan-deep 块的 UNet,以及 Sber-MoVQGAN 来将潜在向量解码成图像。文本理解和生成图像的质量主要通过使用更大的文本编码器和 UNet 来实现。

本指南将向您展示如何使用 Kandinsky 模型进行文本到图像、图像到图像、图像修复、插值等操作。

开始之前,请确保已安装以下库:

# uncomment to install the necessary libraries in Colab
#!pip install -q diffusers transformers accelerate

Kandinsky 2.1 和 2.2 的用法非常相似!唯一的区别是 Kandinsky 2.2 在解码潜在向量时不再接受 prompt 作为输入。相反,Kandinsky 2.2 在解码时只接受 image_embeds


Kandinsky 3 的架构更简洁,并且不需要先验模型。这意味着它的用法与其他扩散模型(如 Stable Diffusion XL)完全相同。

文本到图像

要使用 Kandinsky 模型执行任何任务,您总是首先需要设置先验 pipeline 来编码提示词并生成图像嵌入。先验 pipeline 还会生成与负面提示词 "" 对应的 negative_image_embeds。为了获得更好的结果,您可以向先验 pipeline 传递一个实际的 negative_prompt,但这会将先验 pipeline 的有效批量大小增加一倍。

Kandinsky 2.1
Kandinsky 2.2
Kandinsky 3
from diffusers import KandinskyPriorPipeline, KandinskyPipeline
import torch

prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16).to("cuda")
pipeline = KandinskyPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16).to("cuda")

prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
negative_prompt = "low quality, bad quality" # optional to include a negative prompt, but results are usually better
image_embeds, negative_image_embeds = prior_pipeline(prompt, negative_prompt, guidance_scale=1.0).to_tuple()

现在,将所有提示词和嵌入传递给 KandinskyPipeline 来生成一张图像

image = pipeline(prompt, image_embeds=image_embeds, negative_prompt=negative_prompt, negative_image_embeds=negative_image_embeds, height=768, width=768).images[0]
image

🤗 Diffusers 还通过 KandinskyCombinedPipelineKandinskyV22CombinedPipeline 提供了端到端的 API,这意味着您不必分别加载先验 pipeline 和文本到图像 pipeline。组合 pipeline 会自动加载先验模型和解码器。如果您愿意,仍然可以使用 prior_guidance_scaleprior_num_inference_steps 参数为先验 pipeline 设置不同的值。

使用 AutoPipelineForText2Image 在底层自动调用组合 pipeline

Kandinsky 2.1
Kandinsky 2.2
from diffusers import AutoPipelineForText2Image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
pipeline.enable_model_cpu_offload()

prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
negative_prompt = "low quality, bad quality"

image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale=1.0, guidance_scale=4.0, height=768, width=768).images[0]
image

图像到图像

对于图像到图像,将初始图像和文本提示词传递给 pipeline 以对图像进行条件化。首先加载先验 pipeline

Kandinsky 2.1
Kandinsky 2.2
Kandinsky 3
import torch
from diffusers import KandinskyImg2ImgPipeline, KandinskyPriorPipeline

prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
pipeline = KandinskyImg2ImgPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16, use_safetensors=True).to("cuda")

下载一张用于条件化的图像

from diffusers.utils import load_image

# download image
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
original_image = load_image(url)
original_image = original_image.resize((768, 512))

使用先验 pipeline 生成 image_embedsnegative_image_embeds

prompt = "A fantasy landscape, Cinematic lighting"
negative_prompt = "low quality, bad quality"

image_embeds, negative_image_embeds = prior_pipeline(prompt, negative_prompt).to_tuple()

现在,将原始图像以及所有提示词和嵌入传递给 pipeline 来生成一张图像

Kandinsky 2.1
Kandinsky 2.2
Kandinsky 3
from diffusers.utils import make_image_grid

image = pipeline(prompt, negative_prompt=negative_prompt, image=original_image, image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768, strength=0.3).images[0]
make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)

🤗 Diffusers 还通过 KandinskyImg2ImgCombinedPipelineKandinskyV22Img2ImgCombinedPipeline 提供了端到端的 API,这意味着您不必分别加载先验 pipeline 和图像到图像 pipeline。组合 pipeline 会自动加载先验模型和解码器。如果您愿意,仍然可以使用 prior_guidance_scaleprior_num_inference_steps 参数为先验 pipeline 设置不同的值。

使用 AutoPipelineForImage2Image 在底层自动调用组合 pipeline

Kandinsky 2.1
Kandinsky 2.2
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import make_image_grid, load_image
import torch

pipeline = AutoPipelineForImage2Image.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16, use_safetensors=True)
pipeline.enable_model_cpu_offload()

prompt = "A fantasy landscape, Cinematic lighting"
negative_prompt = "low quality, bad quality"

url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
original_image = load_image(url)

original_image.thumbnail((768, 768))

image = pipeline(prompt=prompt, negative_prompt=negative_prompt, image=original_image, strength=0.3).images[0]
make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)

图像修复

⚠️ Kandinsky 模型现在使用 ⬜️ 白色像素 来表示蒙版区域,而不是黑色像素。如果您在生产环境中使用 KandinskyInpaintPipeline,您需要将蒙版更改为使用白色像素

# For PIL input
import PIL.ImageOps
mask = PIL.ImageOps.invert(mask)

# For PyTorch and NumPy input
mask = 1 - mask

对于图像修复,您需要原始图像、原始图像中要替换区域的蒙版,以及一个描述要修复内容的文本提示词。加载先验 pipeline

Kandinsky 2.1
Kandinsky 2.2
from diffusers import KandinskyInpaintPipeline, KandinskyPriorPipeline
from diffusers.utils import load_image, make_image_grid
import torch
import numpy as np
from PIL import Image

prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
pipeline = KandinskyInpaintPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-inpaint", torch_dtype=torch.float16, use_safetensors=True).to("cuda")

加载一张初始图像并创建一个蒙版

init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
mask = np.zeros((768, 768), dtype=np.float32)
# mask area above cat's head
mask[:250, 250:-250] = 1

使用先验 pipeline 生成嵌入

prompt = "a hat"
prior_output = prior_pipeline(prompt)

现在,将初始图像、蒙版以及提示词和嵌入传递给 pipeline 来生成一张图像

Kandinsky 2.1
Kandinsky 2.2
output_image = pipeline(prompt, image=init_image, mask_image=mask, **prior_output, height=768, width=768, num_inference_steps=150).images[0]
mask = Image.fromarray((mask*255).astype('uint8'), 'L')
make_image_grid([init_image, mask, output_image], rows=1, cols=3)

您也可以使用端到端的 KandinskyInpaintCombinedPipelineKandinskyV22InpaintCombinedPipeline 在底层一起调用先验和解码器 pipeline。为此,请使用 AutoPipelineForInpainting

Kandinsky 2.1
Kandinsky 2.2
import torch
import numpy as np
from PIL import Image
from diffusers import AutoPipelineForInpainting
from diffusers.utils import load_image, make_image_grid

pipe = AutoPipelineForInpainting.from_pretrained("kandinsky-community/kandinsky-2-1-inpaint", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()

init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
mask = np.zeros((768, 768), dtype=np.float32)
# mask area above cat's head
mask[:250, 250:-250] = 1
prompt = "a hat"

output_image = pipe(prompt=prompt, image=init_image, mask_image=mask).images[0]
mask = Image.fromarray((mask*255).astype('uint8'), 'L')
make_image_grid([init_image, mask, output_image], rows=1, cols=3)

插值

插值允许您探索图像和文本嵌入之间的潜在空间,这是一种很酷的方式来查看先验模型的一些中间输出。加载先验 pipeline 和您想要插值的两张图像

Kandinsky 2.1
Kandinsky 2.2
from diffusers import KandinskyPriorPipeline, KandinskyPipeline
from diffusers.utils import load_image, make_image_grid
import torch

prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
img_1 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/starry_night.jpeg")
make_image_grid([img_1.resize((512,512)), img_2.resize((512,512))], rows=1, cols=2)
一只猫
梵高的《星夜》画作

指定要插值的文本或图像,并为每个文本或图像设置权重。尝试不同的权重,看看它们如何影响插值结果!

images_texts = ["a cat", img_1, img_2]
weights = [0.3, 0.3, 0.4]

调用 interpolate 函数生成嵌入,然后将它们传递给 pipeline 来生成图像

Kandinsky 2.1
Kandinsky 2.2
# prompt can be left empty
prompt = ""
prior_out = prior_pipeline.interpolate(images_texts, weights)

pipeline = KandinskyPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16, use_safetensors=True).to("cuda")

image = pipeline(prompt, **prior_out, height=768, width=768).images[0]
image

ControlNet

⚠️ ControlNet 仅支持 Kandinsky 2.2!

ControlNet 能够使用额外的输入(如深度图或边缘检测)来对大型预训练扩散模型进行条件化。例如,您可以使用深度图来对 Kandinsky 2.2 进行条件化,使模型能够理解并保留深度图像的结构。

让我们加载一张图像并提取其深度图

from diffusers.utils import load_image

img = load_image(
    "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png"
).resize((768, 768))
img

然后,您可以使用 🤗 Transformers 中的 depth-estimation Pipeline 来处理图像并获取深度图

import torch
import numpy as np

from transformers import pipeline

def make_hint(image, depth_estimator):
    image = depth_estimator(image)["depth"]
    image = np.array(image)
    image = image[:, :, None]
    image = np.concatenate([image, image, image], axis=2)
    detected_map = torch.from_numpy(image).float() / 255.0
    hint = detected_map.permute(2, 0, 1)
    return hint

depth_estimator = pipeline("depth-estimation")
hint = make_hint(img, depth_estimator).unsqueeze(0).half().to("cuda")

文本到图像

加载先验 pipeline 和 KandinskyV22ControlnetPipeline

from diffusers import KandinskyV22PriorPipeline, KandinskyV22ControlnetPipeline

prior_pipeline = KandinskyV22PriorPipeline.from_pretrained(
    "kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True
).to("cuda")

pipeline = KandinskyV22ControlnetPipeline.from_pretrained(
    "kandinsky-community/kandinsky-2-2-controlnet-depth", torch_dtype=torch.float16
).to("cuda")

根据提示词和负面提示词生成图像嵌入

prompt = "A robot, 4k photo"
negative_prior_prompt = "lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature"

generator = torch.Generator(device="cuda").manual_seed(43)

image_emb, zero_image_emb = prior_pipeline(
    prompt=prompt, negative_prompt=negative_prior_prompt, generator=generator
).to_tuple()

最后,将图像嵌入和深度图像传递给 KandinskyV22ControlnetPipeline 来生成一张图像

image = pipeline(image_embeds=image_emb, negative_image_embeds=zero_image_emb, hint=hint, num_inference_steps=50, generator=generator, height=768, width=768).images[0]
image

图像到图像

对于使用 ControlNet 的图像到图像任务,您需要使用

使用 🤗 Transformers 中的 depth-estimation Pipeline 处理并提取一张初始猫图像的深度图

import torch
import numpy as np

from diffusers import KandinskyV22PriorEmb2EmbPipeline, KandinskyV22ControlnetImg2ImgPipeline
from diffusers.utils import load_image
from transformers import pipeline

img = load_image(
    "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png"
).resize((768, 768))

def make_hint(image, depth_estimator):
    image = depth_estimator(image)["depth"]
    image = np.array(image)
    image = image[:, :, None]
    image = np.concatenate([image, image, image], axis=2)
    detected_map = torch.from_numpy(image).float() / 255.0
    hint = detected_map.permute(2, 0, 1)
    return hint

depth_estimator = pipeline("depth-estimation")
hint = make_hint(img, depth_estimator).unsqueeze(0).half().to("cuda")

加载先验 pipeline 和 KandinskyV22ControlnetImg2ImgPipeline

prior_pipeline = KandinskyV22PriorEmb2EmbPipeline.from_pretrained(
    "kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True
).to("cuda")

pipeline = KandinskyV22ControlnetImg2ImgPipeline.from_pretrained(
    "kandinsky-community/kandinsky-2-2-controlnet-depth", torch_dtype=torch.float16
).to("cuda")

将文本提示词和初始图像传递给先验 pipeline 以生成图像嵌入

prompt = "A robot, 4k photo"
negative_prior_prompt = "lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature"

generator = torch.Generator(device="cuda").manual_seed(43)

img_emb = prior_pipeline(prompt=prompt, image=img, strength=0.85, generator=generator)
negative_emb = prior_pipeline(prompt=negative_prior_prompt, image=img, strength=1, generator=generator)

现在您可以运行 KandinskyV22ControlnetImg2ImgPipeline,根据初始图像和图像嵌入生成图像

image = pipeline(image=img, strength=0.5, image_embeds=img_emb.image_embeds, negative_image_embeds=negative_emb.image_embeds, hint=hint, num_inference_steps=50, generator=generator, height=768, width=768).images[0]
make_image_grid([img.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)

优化

Kandinsky 的独特之处在于它需要一个先验 pipeline 来生成映射,以及第二个 pipeline 来将潜在向量解码成图像。优化工作应集中在第二个 pipeline 上,因为大部分计算都在这里完成。以下是一些在推理期间改进 Kandinsky 的技巧。

  1. 如果您使用的是 PyTorch < 2.0,请启用 xFormers
  from diffusers import DiffusionPipeline
  import torch

  pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
+ pipe.enable_xformers_memory_efficient_attention()
  1. 如果您使用的是 PyTorch >= 2.0,请启用 torch.compile 以自动使用缩放点积注意力(SDPA)
  pipe.unet.to(memory_format=torch.channels_last)
+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

这与显式地将注意力处理器设置为使用 AttnAddedKVProcessor2_0 是一样的

from diffusers.models.attention_processor import AttnAddedKVProcessor2_0

pipe.unet.set_attn_processor(AttnAddedKVProcessor2_0())
  1. 使用 enable_model_cpu_offload() 将模型卸载到 CPU,以避免内存不足错误
  from diffusers import DiffusionPipeline
  import torch

  pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
+ pipe.enable_model_cpu_offload()
  1. 默认情况下,文本到图像 pipeline 使用 DDIMScheduler,但您可以将其替换为其他调度器,如 DDPMScheduler,以观察其如何影响推理速度和图像质量之间的权衡
from diffusers import DDPMScheduler
from diffusers import DiffusionPipeline

scheduler = DDPMScheduler.from_pretrained("kandinsky-community/kandinsky-2-1", subfolder="ddpm_scheduler")
pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", scheduler=scheduler, torch_dtype=torch.float16, use_safetensors=True).to("cuda")
< > 在 GitHub 上更新

© . This site is unofficial and not affiliated with Hugging Face, Inc.