Diffusers 文档

ControlNet

Diffusers

加入 Hugging Face 社区

并获得增强的文档体验

协作处理模型、数据集和 Spaces

通过加速推理获得更快的示例

切换文档主题

开始使用

ControlNet

ControlNet 是一种模型，通过使用额外的输入图像来调节模型，从而控制图像扩散模型。有许多类型的调节输入（canny 边缘、用户草图、人体姿势、深度等），你可以使用它们来控制扩散模型。这非常有用，因为它使你能够更好地控制图像生成，从而更容易生成特定的图像，而无需过多地试验不同的文本提示或去噪值。

查看 ControlNet 论文 v1 的第 3.5 节，了解关于各种调节输入的 ControlNet 实现列表。ControlNet 你可以在 lllyasviel 的 Hub 个人资料中找到官方的 Stable Diffusion ControlNet 条件模型，并在 Hub 上找到更多社区训练的模型。

对于 Stable Diffusion XL (SDXL) ControlNet 模型，你可以在 🤗 Diffusers Hub 组织上找到它们，或者你可以在 Hub 上浏览社区训练的模型。

ControlNet 模型具有两组权重（或块），通过零卷积层连接

一个锁定副本保留了大型预训练扩散模型已学习的所有内容
一个可训练副本在额外的调节输入上进行训练

由于锁定副本保留了预训练模型，因此在新的调节输入上训练和实施 ControlNet 与微调任何其他模型一样快，因为你无需从头开始训练模型。

本指南将向你展示如何使用 ControlNet 进行文本到图像、图像到图像、图像修复等操作！有许多类型的 ControlNet 调节输入可供选择，但在本指南中，我们只关注其中的几种。欢迎随意尝试其他调节输入！

在开始之前，请确保你已安装以下库

# uncomment to install the necessary libraries in Colab
#!pip install -q diffusers transformers accelerate opencv-python

文本到图像

对于文本到图像，你通常将文本提示传递给模型。但是使用 ControlNet，你可以指定额外的调节输入。让我们使用 canny 图像（黑色背景上的白色图像轮廓）来调节模型。这样，ControlNet 可以使用 canny 图像作为控制来引导模型生成具有相同轮廓的图像。

加载图像并使用 opencv-python 库提取 canny 图像

from diffusers.utils import load_image, make_image_grid
from PIL import Image
import cv2
import numpy as np

original_image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png"
)

image = np.array(original_image)

low_threshold = 100
high_threshold = 200

image = cv2.Canny(image, low_threshold, high_threshold)
image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
canny_image = Image.fromarray(image)

原始图像

canny 图像

接下来，加载以 canny 边缘检测为条件的 ControlNet 模型，并将其传递给 StableDiffusionControlNetPipeline。使用更快的 UniPCMultistepScheduler 并启用模型卸载以加速推理并减少内存使用。

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, UniPCMultistepScheduler
import torch

controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16, use_safetensors=True)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, use_safetensors=True
)

pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

现在将你的提示和 canny 图像传递给 pipeline

output = pipe(
    "the mona lisa", image=canny_image
).images[0]
make_image_grid([original_image, canny_image, output], rows=1, cols=3)

图像到图像

对于图像到图像，你通常会将初始图像和提示传递给 pipeline 以生成新图像。使用 ControlNet，你可以传递额外的调节输入来引导模型。让我们使用深度图（包含空间信息的图像）来调节模型。这样，ControlNet 可以使用深度图作为控制来引导模型生成保留空间信息的图像。

你将为此任务使用 StableDiffusionControlNetImg2ImgPipeline，它与 StableDiffusionControlNetPipeline 不同，因为它允许你传递初始图像作为图像生成过程的起点。

加载图像并使用来自 🤗 Transformers 的 depth-estimation Pipeline 提取图像的深度图

import torch
import numpy as np

from transformers import pipeline
from diffusers.utils import load_image, make_image_grid

image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-img2img.jpg"
)

def get_depth_map(image, depth_estimator):
    image = depth_estimator(image)["depth"]
    image = np.array(image)
    image = image[:, :, None]
    image = np.concatenate([image, image, image], axis=2)
    detected_map = torch.from_numpy(image).float() / 255.0
    depth_map = detected_map.permute(2, 0, 1)
    return depth_map

depth_estimator = pipeline("depth-estimation")
depth_map = get_depth_map(image, depth_estimator).unsqueeze(0).half().to("cuda")

接下来，加载以深度图为条件的 ControlNet 模型，并将其传递给 StableDiffusionControlNetImg2ImgPipeline。使用更快的 UniPCMultistepScheduler 并启用模型卸载以加速推理并减少内存使用。

from diffusers import StableDiffusionControlNetImg2ImgPipeline, ControlNetModel, UniPCMultistepScheduler
import torch

controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11f1p_sd15_depth", torch_dtype=torch.float16, use_safetensors=True)
pipe = StableDiffusionControlNetImg2ImgPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, use_safetensors=True
)

pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

现在将你的提示、初始图像和深度图传递给 pipeline

output = pipe(
    "lego batman and robin", image=image, control_image=depth_map,
).images[0]
make_image_grid([image, output], rows=1, cols=2)

原始图像

生成的图像

图像修复

对于图像修复，你需要初始图像、蒙版图像和描述要替换蒙版内容的提示。ControlNet 模型允许你添加另一个控制图像来调节模型。让我们使用图像修复蒙版来调节模型。这样，ControlNet 可以使用图像修复蒙版作为控制来引导模型在蒙版区域内生成图像。

加载初始图像和蒙版图像

from diffusers.utils import load_image, make_image_grid

init_image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-inpaint.jpg"
)
init_image = init_image.resize((512, 512))

mask_image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-inpaint-mask.jpg"
)
mask_image = mask_image.resize((512, 512))
make_image_grid([init_image, mask_image], rows=1, cols=2)

创建一个函数，从初始图像和蒙版图像准备控制图像。这将创建一个张量，如果 mask_image 中相应像素超过某个阈值，则将 init_image 中的像素标记为已蒙版。

import numpy as np
import torch

def make_inpaint_condition(image, image_mask):
    image = np.array(image.convert("RGB")).astype(np.float32) / 255.0
    image_mask = np.array(image_mask.convert("L")).astype(np.float32) / 255.0

    assert image.shape[0:1] == image_mask.shape[0:1]
    image[image_mask > 0.5] = -1.0  # set as masked pixel
    image = np.expand_dims(image, 0).transpose(0, 3, 1, 2)
    image = torch.from_numpy(image)
    return image

control_image = make_inpaint_condition(init_image, mask_image)

原始图像

蒙版图像

加载以图像修复为条件的 ControlNet 模型，并将其传递给 StableDiffusionControlNetInpaintPipeline。使用更快的 UniPCMultistepScheduler 并启用模型卸载以加速推理并减少内存使用。

from diffusers import StableDiffusionControlNetInpaintPipeline, ControlNetModel, UniPCMultistepScheduler

controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11p_sd15_inpaint", torch_dtype=torch.float16, use_safetensors=True)
pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, use_safetensors=True
)

pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

现在将你的提示、初始图像、蒙版图像和控制图像传递给 pipeline

output = pipe(
    "corgi face with large ears, detailed, pixar, animated, disney",
    num_inference_steps=20,
    eta=1.0,
    image=init_image,
    mask_image=mask_image,
    control_image=control_image,
).images[0]
make_image_grid([init_image, mask_image, output], rows=1, cols=3)

猜测模式

猜测模式完全不需要向 ControlNet 提供提示！这迫使 ControlNet 编码器尽力“猜测”输入控制图（深度图、姿势估计、canny 边缘等）的内容。

猜测模式根据块深度按固定比率调整 ControlNet 输出残差的比例。最浅的 DownBlock 对应于 0.1，并且随着块变得更深，比例呈指数增长，使得 MidBlock 输出的比例变为 1.0。

猜测模式对提示调节没有任何影响，如果你愿意，仍然可以提供提示。

在 pipeline 中设置 guess_mode=True，建议将 guidance_scale 值设置为 3.0 到 5.0 之间。

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers.utils import load_image, make_image_grid
import numpy as np
import torch
from PIL import Image
import cv2

controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", use_safetensors=True)
pipe = StableDiffusionControlNetPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", controlnet=controlnet, use_safetensors=True).to("cuda")

original_image = load_image("https://huggingface.co/takuma104/controlnet_dev/resolve/main/bird_512x512.png")

image = np.array(original_image)

low_threshold = 100
high_threshold = 200

image = cv2.Canny(image, low_threshold, high_threshold)
image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
canny_image = Image.fromarray(image)

image = pipe("", image=canny_image, guess_mode=True, guidance_scale=3.0).images[0]
make_image_grid([original_image, canny_image, image], rows=1, cols=3)

带有提示的常规模式

没有提示的猜测模式

ControlNet 与 Stable Diffusion XL

目前与 Stable Diffusion XL (SDXL) 兼容的 ControlNet 模型不多，但我们已经为 SDXL 训练了两个全尺寸 ControlNet 模型，分别以 canny 边缘检测和深度图为条件。我们还在尝试创建这些 SDXL 兼容的 ControlNet 模型的较小版本，以便更容易在资源受限的硬件上运行。你可以在 🤗 Diffusers Hub 组织上找到这些检查点！

让我们使用以 canny 图像为条件的 SDXL ControlNet 生成图像。首先加载图像并准备 canny 图像

from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL
from diffusers.utils import load_image, make_image_grid
from PIL import Image
import cv2
import numpy as np
import torch

original_image = load_image(
    "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png"
)

image = np.array(original_image)

low_threshold = 100
high_threshold = 200

image = cv2.Canny(image, low_threshold, high_threshold)
image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
canny_image = Image.fromarray(image)
make_image_grid([original_image, canny_image], rows=1, cols=2)

原始图像

canny 图像

加载以 canny 边缘检测为条件的 SDXL ControlNet 模型，并将其传递给 StableDiffusionXLControlNetPipeline。你还可以启用模型卸载以减少内存使用。

controlnet = ControlNetModel.from_pretrained(
    "diffusers/controlnet-canny-sdxl-1.0",
    torch_dtype=torch.float16,
    use_safetensors=True
)
vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16, use_safetensors=True)
pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    controlnet=controlnet,
    vae=vae,
    torch_dtype=torch.float16,
    use_safetensors=True
)
pipe.enable_model_cpu_offload()

现在将你的提示（以及可选的负面提示，如果你正在使用）和 canny 图像传递给 pipeline

controlnet_conditioning_scale 参数确定分配给调节输入的权重。建议值为 0.5 以获得良好的泛化性，但欢迎随意尝试这个数字！

prompt = "aerial view, a futuristic research complex in a bright foggy jungle, hard lighting"
negative_prompt = 'low quality, bad quality, sketches'

image = pipe(
    prompt,
    negative_prompt=negative_prompt,
    image=canny_image,
    controlnet_conditioning_scale=0.5,
).images[0]
make_image_grid([original_image, canny_image, image], rows=1, cols=3)

你也可以在猜测模式下使用 StableDiffusionXLControlNetPipeline，只需将参数设置为 True 即可

from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL
from diffusers.utils import load_image, make_image_grid
import numpy as np
import torch
import cv2
from PIL import Image

prompt = "aerial view, a futuristic research complex in a bright foggy jungle, hard lighting"
negative_prompt = "low quality, bad quality, sketches"

original_image = load_image(
    "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png"
)

controlnet = ControlNetModel.from_pretrained(
    "diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16, use_safetensors=True
)
vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16, use_safetensors=True)
pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", controlnet=controlnet, vae=vae, torch_dtype=torch.float16, use_safetensors=True
)
pipe.enable_model_cpu_offload()

image = np.array(original_image)
image = cv2.Canny(image, 100, 200)
image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
canny_image = Image.fromarray(image)

image = pipe(
    prompt, negative_prompt=negative_prompt, controlnet_conditioning_scale=0.5, image=canny_image, guess_mode=True,
).images[0]
make_image_grid([original_image, canny_image, image], rows=1, cols=3)

你可以将精炼器模型与 StableDiffusionXLControlNetPipeline 一起使用以提高图像质量，就像使用常规 StableDiffusionXLPipeline 一样。请参阅优化图像质量部分，了解如何使用精炼器模型。请务必使用 StableDiffusionXLControlNetPipeline 并传递 image 和 controlnet_conditioning_scale。

base = StableDiffusionXLControlNetPipeline(...)
image = base(
    prompt=prompt,
    controlnet_conditioning_scale=0.5,
    image=canny_image,
    num_inference_steps=40,
    denoising_end=0.8,
    output_type="latent",
).images
# rest exactly as with StableDiffusionXLPipeline

MultiControlNet

将 SDXL 模型替换为类似 stable-diffusion-v1-5/stable-diffusion-v1-5 的模型，以便在 Stable Diffusion 模型中使用多个条件输入。

您可以从不同的图像输入组合多个 ControlNet 条件，以创建一个 MultiControlNet。为了获得更好的结果，通常很有帮助的是

对条件进行遮罩，使其不重叠（例如，遮罩 canny 图像中姿势条件所在的区域）
尝试使用 controlnet_conditioning_scale 参数来确定为每个条件输入分配多少权重

在此示例中，您将组合 canny 图像和人体姿势估计图像以生成新图像。

准备 canny 图像条件

from diffusers.utils import load_image, make_image_grid
from PIL import Image
import numpy as np
import cv2

original_image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/landscape.png"
)
image = np.array(original_image)

low_threshold = 100
high_threshold = 200

image = cv2.Canny(image, low_threshold, high_threshold)

# zero out middle columns of image where pose will be overlaid
zero_start = image.shape[1] // 4
zero_end = zero_start + image.shape[1] // 2
image[:, zero_start:zero_end] = 0

image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
canny_image = Image.fromarray(image)
make_image_grid([original_image, canny_image], rows=1, cols=2)

原始图像

canny 图像

对于人体姿势估计，安装 controlnet_aux

# uncomment to install the necessary library in Colab
#!pip install -q controlnet-aux

准备人体姿势估计条件

from controlnet_aux import OpenposeDetector

openpose = OpenposeDetector.from_pretrained("lllyasviel/ControlNet")
original_image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/person.png"
)
openpose_image = openpose(original_image)
make_image_grid([original_image, openpose_image], rows=1, cols=2)

原始图像

人体姿势图像

加载与每个条件对应的 ControlNet 模型列表，并将它们传递给 StableDiffusionXLControlNetPipeline。使用更快的 UniPCMultistepScheduler 并启用模型卸载以减少内存使用。

from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL, UniPCMultistepScheduler
import torch

controlnets = [
    ControlNetModel.from_pretrained(
        "thibaud/controlnet-openpose-sdxl-1.0", torch_dtype=torch.float16
    ),
    ControlNetModel.from_pretrained(
        "diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16, use_safetensors=True
    ),
]

vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16, use_safetensors=True)
pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", controlnet=controlnets, vae=vae, torch_dtype=torch.float16, use_safetensors=True
)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

现在您可以将您的提示（如果您正在使用，则为可选的负面提示）、canny 图像和姿势图像传递给 pipeline

prompt = "a giant standing in a fantasy landscape, best quality"
negative_prompt = "monochrome, lowres, bad anatomy, worst quality, low quality"

generator = torch.manual_seed(1)

images = [openpose_image.resize((1024, 1024)), canny_image.resize((1024, 1024))]

images = pipe(
    prompt,
    image=images,
    num_inference_steps=25,
    generator=generator,
    negative_prompt=negative_prompt,
    num_images_per_prompt=3,
    controlnet_conditioning_scale=[1.0, 0.8],
).images
make_image_grid([original_image, canny_image, openpose_image,
                images[0].resize((512, 512)), images[1].resize((512, 512)), images[2].resize((512, 512))], rows=2, cols=3)

< > 在 GitHub 上更新

←PAG T2I-Adapter→