Stable Diffusion 简介

本章介绍 Stable Diffusion 的构建模块，Stable Diffusion 是一种生成式人工智能 (generative AI) 模型，可根据文本和图像提示生成独特的照片级真实感图像。它最初于 2022 年推出，这要归功于 Stability AI、RunwayML 和慕尼黑大学 LMU 的 CompVis Group 之间的合作，并遵循了这篇论文。

你将从本章学到什么？

Stable Diffusion 的基本组成部分
如何使用 text-to-image、image2image、inpainting 管道

Stable Diffusion 工作需要什么？

为了使本节有趣，我们将尝试回答一些问题，以了解 Stable Diffusion 过程的基本组成部分。我们将简要讨论每个组件，因为它们已在我们的 Diffusers 课程中介绍过。此外，您可以访问我们之前的章节，其中详细介绍了 GAN 和 Diffusion 模型。

Stable Diffusion 采用哪些策略来学习新信息？
- 它使用 diffusion 模型的前向和反向过程。在前向过程中，我们向图像添加高斯噪声，直到剩下的只是随机噪声。通常我们无法识别图像的最终噪声版本。
- 在反向过程中，我们有一个经过训练的神经网络，可以从纯噪声开始逐步对图像进行去噪，直到最终得到实际图像。

这两个过程都发生在有限的步数 T 中（根据 DDPM 论文，T=1000）。您在时间开始该过程 $t_0$ 通过从您的数据分布中采样真实图像，前向过程在每个时间步 t 从高斯分布中采样一些噪声，并将其添加到前一个时间步的图像中。要获得更多数学直觉，请阅读 Hugging Face Blog 上关于 Diffusion 模型的文章。

由于我们的图像可能非常大，我们如何压缩它？

当您有大型图像时，它们需要更多的计算能力来处理。这在称为自注意力（self-attention）的特定操作中变得非常明显。图像越大，需要的计算就越多，并且这些计算随着图像大小的增加而非常迅速地增加（数学家称之为“二次方”）。例如，如果您的图像宽 128 像素、高 128 像素，那么它比宽 64 像素、高 64 像素的图像像素多四倍。由于自注意力（self-attention）的工作方式，处理这个更大的图像不仅需要四倍的内存和计算能力，实际上需要十六倍（因为 4 乘以 4 等于 16）。这使得处理超高分辨率图像具有挑战性，因为它们需要大量资源来处理。潜在 diffusion 模型通过使用变分自编码器 (VAE) 将图像缩小到更易于管理的大小，从而解决了处理大型图像的高计算需求。其思想是许多图像具有重复或不必要的信息。VAE 在经过大量数据训练后，可以将图像压缩成更小、更浓缩的形式。这个较小的版本仍然保留了原始图像的基本特征。

既然我们使用 prompts，我们是如何将文本与图像融合的？

我们知道，在推理时，我们可以输入我们想要看到的图像的描述和一些纯噪声作为起点，模型会尽力将随机输入“去噪”成与标题匹配的内容。SD 利用了基于称为 CLIP 的预训练 transformer 模型。CLIP 的文本编码器旨在将图像标题处理成可用于比较图像和文本的形式，因此它非常适合从图像描述中创建有用的表示的任务。输入 prompt 首先被标记化（基于一个大型词汇表，其中每个单词或子词都被分配一个特定的标记），然后通过 CLIP 文本编码器输入，为每个标记生成一个 768 维（在 SD 1.X 的情况下）或 1024 维（SD 2.X 的情况下）的向量。为了保持一致性，prompt 始终被填充/截断为 77 个标记的长度，因此我们用作条件的最终表示是每个 prompt 形状为 77x1024 的张量。

我们如何添加良好的归纳偏置？

由于我们试图生成一些新的东西（例如，逼真的宝可梦），我们需要一种方法来超越我们之前见过的图像（例如，动漫宝可梦）。这就是 U-Net 和自注意力发挥作用的地方。给定图像的噪声版本，模型的任务是根据图像的文本描述等附加线索来预测去噪版本。好的，我们实际上如何将这种条件信息输入到 U-Net 中，以便它在进行预测时使用？答案是称为交叉注意力的东西。U-Net 中散布着交叉注意力层。U-Net 中的每个空间位置都可以“关注”文本条件中的不同标记，从而从 prompt 中引入相关信息。

如何在 Diffusers 中使用 text-to-image、image-to-image、Inpainting 模型

本节介绍有用的用例以及如何使用 Diffusers 库执行这些任务。

text-to-image 推理的步骤：其思想是传入文本 prompt，将其转换为输出图像。

使用 diffusers 库，您可以在 2 个步骤中使 text-to-image 工作。

让我们首先安装 diffusers 库。

pip install diffusers

现在我们将初始化 pipeline，并在其中传入我们的 prompt 并进行推理。

from diffusers import AutoPipelineForText2Image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained(
    "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16"
).to("cuda")
generator = torch.Generator(device="cuda").manual_seed(31)
image = pipeline(
    "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k",
    generator=generator,
).images[0]

image-to-image 推理的步骤：以类似的方式，我们可以初始化 pipeline，但传递图像和文本 prompt。

import torch
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import load_image, make_image_grid

pipeline = AutoPipelineForImage2Image.from_pretrained(
    "kandinsky-community/kandinsky-2-2-decoder",
    torch_dtype=torch.float16,
    use_safetensors=True,
)
pipeline.enable_model_cpu_offload()
# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
pipeline.enable_xformers_memory_efficient_attention()

# Load an image to pass to the pipeline:
init_image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png"
)

# Pass a prompt and image to the pipeline to generate an image:
prompt = "cat wizard, gandalf, lord of the rings, detailed, fantasy, cute, adorable, Pixar, Disney, 8k"
image = pipeline(prompt, image=init_image).images[0]
make_image_grid([init_image, image], rows=1, cols=2)

Inpainting 的步骤：对于 inpainting pipeline，我们需要传递图像、文本 prompt 和基于该图像中对象的 mask，该 mask 指示要在图像中 inpaint 的内容。在本例中，我们还传递了一个负面 prompt，以进一步影响我们想要避免的内容的推理。

# Load the pipeline
import torch
from diffusers import AutoPipelineForInpainting
from diffusers.utils import load_image, make_image_grid

pipeline = AutoPipelineForInpainting.from_pretrained(
    "kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16
)
pipeline.enable_model_cpu_offload()
# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
pipeline.enable_xformers_memory_efficient_attention()

# Load the base and mask images:
init_image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"
)
mask_image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png"
)

# Create a prompt to inpaint the image with and pass it to the pipeline with the base and mask images:
prompt = (
    "a black cat with glowing eyes, cute, adorable, disney, pixar, highly detailed, 8k"
)
negative_prompt = "bad anatomy, deformed, ugly, disfigured"
image = pipeline(
    prompt=prompt,
    negative_prompt=negative_prompt,
    image=init_image,
    mask_image=mask_image,
).images[0]
make_image_grid([init_image, mask_image, image], rows=1, cols=3)

进一步阅读

< > 在 GitHub 上更新

社区计算机视觉课程

Stable Diffusion 简介

Stable Diffusion 工作需要什么？

如何在 Diffusers 中使用 text-to-image、image-to-image、Inpainting 模型

进一步阅读