微调和引导

在本笔记本中，我们将介绍两种主要的调整现有扩散模型的方法

通过微调，我们将使用新数据重新训练现有模型，以更改它们生成的输出类型
通过引导，我们将使用现有模型，并在推理时引导生成过程，以实现额外的控制

你将学到什么：

在本笔记本结束时，你将了解如何：

创建一个采样循环，并使用新的调度器更快地生成样本
在新数据上微调现有的扩散模型，包括：
- 使用梯度累积来解决小批量的一些问题
- 在训练期间将样本记录到 Weights and Biases 以监控进度（通过随附的示例脚本）
- 保存生成的 pipeline 并将其上传到 hub
使用额外的损失函数引导采样过程，以增加对现有模型的控制，包括：
- 使用基于颜色的简单损失探索不同的引导方法
- 使用 CLIP 通过文本提示引导生成
- 使用 Gradio 和 🤗 Spaces 分享自定义采样循环

❓如果您有任何问题，请在 Hugging Face Discord 服务器上的 #diffusion-models-class 频道上发布。如果您尚未注册，可以在这里注册：https://huggingface.co/join/discord

设置和导入

要将微调后的模型保存到 Hugging Face Hub，您需要使用具有写入权限的令牌登录。下面的代码将提示您输入此令牌，并链接到您帐户的相关令牌页面。如果您想使用训练脚本在模型训练时记录样本，您还需要一个 Weights and Biases 帐户 - 同样，代码应在需要时提示您登录。

除此之外，唯一的设置是安装一些依赖项，导入我们需要的所有内容，并指定我们将使用的设备

%pip install -qq diffusers datasets accelerate wandb open-clip-torch

>>> # Code to log in to the Hugging Face Hub, needed for sharing models
>>> # Make sure you use a token with WRITE access
>>> from huggingface_hub import notebook_login

>>> notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.huggingface/token
Login successful

import numpy as np
import torch
import torch.nn.functional as F
import torchvision
from datasets import load_dataset
from diffusers import DDIMScheduler, DDPMPipeline
from matplotlib import pyplot as plt
from PIL import Image
from torchvision import transforms
from tqdm.auto import tqdm

device = "mps" if torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu"

加载预训练 Pipeline

为了开始本笔记本，让我们加载一个现有的 pipeline，看看我们可以用它做什么

image_pipe = DDPMPipeline.from_pretrained("google/ddpm-celebahq-256")
image_pipe.to(device)

生成图像就像运行 pipeline 的 __call__ 方法一样简单，只需像调用函数一样调用它即可

>>> images = image_pipe().images
>>> images[0]

不错，但是太慢了！所以，在我们开始今天的正题之前，让我们先看一下实际的采样循环，看看如何使用更高级的采样器来加速这个过程

使用 DDIM 进行更快的采样

在每个步骤中，模型都会接收一个噪声输入，并被要求预测噪声（从而估计完全去噪后的图像可能是什么样子）。最初，这些预测效果不佳，这就是为什么我们将过程分解为许多步骤的原因。然而，研究发现，使用 1000 多个步骤是不必要的，最近的一系列研究探索了如何在尽可能少的步骤中获得良好的样本。

在 🤗 Diffusers 库中，这些采样方法由调度器处理，调度器必须通过 step() 函数执行每个更新。要生成图像，我们从随机噪声 $x$ 开始。然后，对于调度器噪声计划中的每个时间步，我们将噪声输入 $x$ 馈送到模型，并将生成的预测传递给 step() 函数。这将返回一个带有 prev_sample 属性的输出 - previous 是因为我们正在从高噪声到低噪声“向后”时间推移（与正向扩散过程相反）。

让我们看看实际效果！首先，我们加载一个调度器，这里是一个基于论文 Denoising Diffusion Implicit Models 的 DDIMScheduler，与原始 DDPM 实现相比，它可以在更少的步骤中给出不错的样本

# Create new scheduler and set num inference steps
scheduler = DDIMScheduler.from_pretrained("google/ddpm-celebahq-256")
scheduler.set_timesteps(num_inference_steps=40)

您可以看到此模型总共执行 40 个步骤，每个步骤跳过相当于原始 1000 步计划的 25 个步骤

scheduler.timesteps

让我们创建 4 个随机图像并运行采样循环，查看过程进行时当前的 $x$ 和预测的去噪版本

>>> # The random starting point
>>> x = torch.randn(4, 3, 256, 256).to(device)  # Batch of 4, 3-channel 256 x 256 px images

>>> # Loop through the sampling timesteps
>>> for i, t in tqdm(enumerate(scheduler.timesteps)):

...     # Prepare model input
...     model_input = scheduler.scale_model_input(x, t)

...     # Get the prediction
...     with torch.no_grad():
...         noise_pred = image_pipe.unet(model_input, t)["sample"]

...     # Calculate what the updated sample should look like with the scheduler
...     scheduler_output = scheduler.step(noise_pred, t, x)

...     # Update x
...     x = scheduler_output.prev_sample

...     # Occasionally display both x and the predicted denoised images
...     if i % 10 == 0 or i == len(scheduler.timesteps) - 1:
...         fig, axs = plt.subplots(1, 2, figsize=(12, 5))

...         grid = torchvision.utils.make_grid(x, nrow=4).permute(1, 2, 0)
...         axs[0].imshow(grid.cpu().clip(-1, 1) * 0.5 + 0.5)
...         axs[0].set_title(f"Current x (step {i})")

...         pred_x0 = scheduler_output.pred_original_sample  # Not available for all schedulers
...         grid = torchvision.utils.make_grid(pred_x0, nrow=4).permute(1, 2, 0)
...         axs[1].imshow(grid.cpu().clip(-1, 1) * 0.5 + 0.5)
...         axs[1].set_title(f"Predicted denoised images (step {i})")
...         plt.show()

正如您所看到的，最初的预测效果不佳，但随着过程的进行，预测的输出变得越来越精细。如果您好奇 step() 函数内部发生了什么数学运算，请使用以下命令检查（注释良好的）代码

# ??scheduler.step

您也可以将这个新的调度器替换为 pipeline 原带的调度器，并像这样进行采样

>>> image_pipe.scheduler = scheduler
>>> images = image_pipe(num_inference_steps=40).images
>>> images[0]

好了 - 我们现在可以在合理的时间内获得样本了！这应该会在我们浏览本笔记本的其余部分时加快速度 :)

微调

现在到了有趣的部分！给定这个预训练的 pipeline，我们如何重新训练模型以根据新的训练数据生成图像？

事实证明，这看起来几乎与从头开始训练模型（正如我们在 Unit 1 中看到的那样）几乎相同，只不过我们从现有模型开始。让我们看看实际效果，并在进行过程中讨论一些额外的注意事项。

首先，是数据集：您可以尝试这个老式面孔数据集或这些动漫面孔，它们更接近此面孔模型的原始训练数据，但为了好玩，我们改为使用我们在 Unit 1 中从头开始训练时使用的小蝴蝶数据集。运行下面的代码以下载蝴蝶数据集并创建一个数据加载器，我们可以从中采样一批图像

>>> # @markdown load and prepare a dataset:
>>> # Not on Colab? Comments with #@ enable UI tweaks like headings or user inputs
>>> # but can safely be ignored if you're working on a different platform.

>>> dataset_name = "huggan/smithsonian_butterflies_subset"  # @param
>>> dataset = load_dataset(dataset_name, split="train")
>>> image_size = 256  # @param
>>> batch_size = 4  # @param
>>> preprocess = transforms.Compose(
...     [
...         transforms.Resize((image_size, image_size)),
...         transforms.RandomHorizontalFlip(),
...         transforms.ToTensor(),
...         transforms.Normalize([0.5], [0.5]),
...     ]
... )


>>> def transform(examples):
...     images = [preprocess(image.convert("RGB")) for image in examples["image"]]
...     return {"images": images}


>>> dataset.set_transform(transform)

>>> train_dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)

>>> print("Previewing batch:")
>>> batch = next(iter(train_dataloader))
>>> grid = torchvision.utils.make_grid(batch["images"], nrow=4)
>>> plt.imshow(grid.permute(1, 2, 0).cpu().clip(-1, 1) * 0.5 + 0.5)

Previewing batch:

注意事项 1： 我们的批量大小（4）在这里非常小，因为我们正在使用相当大的模型以大图像尺寸（256px）进行训练，如果我们将批量大小推得太高，将会耗尽 GPU RAM。您可以减小图像尺寸以加快速度并允许更大的批量，但这些模型是为 256px 生成而设计和最初训练的。

现在开始训练循环。我们将通过将优化目标设置为 image_pipe.unet.parameters() 来更新预训练模型的权重。其余部分几乎与 Unit 1 中的示例训练循环相同。这大约需要 10 分钟才能在 Colab 上运行，所以现在是喝咖啡或茶的好时机，请耐心等待

>>> num_epochs = 2  # @param
>>> lr = 1e-5  # 2param
>>> grad_accumulation_steps = 2  # @param

>>> optimizer = torch.optim.AdamW(image_pipe.unet.parameters(), lr=lr)

>>> losses = []

>>> for epoch in range(num_epochs):
...     for step, batch in tqdm(enumerate(train_dataloader), total=len(train_dataloader)):
...         clean_images = batch["images"].to(device)
...         # Sample noise to add to the images
...         noise = torch.randn(clean_images.shape).to(clean_images.device)
...         bs = clean_images.shape[0]

...         # Sample a random timestep for each image
...         timesteps = torch.randint(
...             0,
...             image_pipe.scheduler.num_train_timesteps,
...             (bs,),
...             device=clean_images.device,
...         ).long()

...         # Add noise to the clean images according to the noise magnitude at each timestep
...         # (this is the forward diffusion process)
...         noisy_images = image_pipe.scheduler.add_noise(clean_images, noise, timesteps)

...         # Get the model prediction for the noise
...         noise_pred = image_pipe.unet(noisy_images, timesteps, return_dict=False)[0]

...         # Compare the prediction with the actual noise:
...         loss = F.mse_loss(
...             noise_pred, noise
...         )  # NB - trying to predict noise (eps) not (noisy_ims-clean_ims) or just (clean_ims)

...         # Store for later plotting
...         losses.append(loss.item())

...         # Update the model parameters with the optimizer based on this loss
...         loss.backward(loss)

...         # Gradient accumulation:
...         if (step + 1) % grad_accumulation_steps == 0:
...             optimizer.step()
...             optimizer.zero_grad()

...     print(f"Epoch {epoch} average loss: {sum(losses[-len(train_dataloader):])/len(train_dataloader)}")

>>> # Plot the loss curve:
>>> plt.plot(losses)

Epoch 0 average loss: 0.013324214214226231

注意事项 2： 我们的损失信号非常嘈杂，因为我们每个步骤只处理四个随机噪声水平的示例。这对于训练来说并不理想。一种解决方法是使用极低的学习率来限制每个步骤的更新大小。如果我们能找到某种方法来获得与使用更大批量大小相同的好处，而无需内存需求激增，那就更好了……

进入梯度累积。如果在运行 optimizer.step() 和 optimizer.zero_grad() 之前多次调用 loss.backward()，则 PyTorch 会累积（求和）梯度，有效地合并来自多个批次的信号，以提供单个（更好）的估计值，然后用于更新参数。这导致进行的总体更新更少，就像我们使用更大的批量大小看到的那样。许多框架都会为您处理这种情况（例如，🤗 Accelerate 使这变得容易），但很高兴看到从头开始实现它，因为这是一种处理 GPU 内存约束下训练的有用技术！正如您从上面的代码中看到的（在 # Gradient accumulation 注释之后），实际上不需要太多代码。

# Exercise: See if you can add gradient accumulation to the training loop in Unit 1.
# How does it perform? Think how you might adjust the learning rate based on the
# number of gradient accumulation steps - should it stay the same as before?

注意事项 3： 这仍然需要大量时间，并且每 epoch 打印一行更新不足以让我们很好地了解正在发生的事情。我们可能应该

偶尔生成一些样本，以便在模型训练时定性地检查性能
记录损失和样本生成等内容，可能使用 Weights and Biases 或 tensorboard 等工具。

我创建了一个快速脚本 (finetune_model.py)，它采用了上面的训练代码并添加了最少的日志记录功能。您可以在下面看到一次训练运行的日志

%wandb johnowhitaker/dm_finetune/2upaa341 # You'll need a W&B account for this to work - skip if you don't want to log in

有趣的是看到生成的样本如何随着训练的进行而变化 - 即使损失似乎没有太大改善，我们也可以看到从原始领域（卧室图像）到新训练数据（wikiart）的进展。在本笔记本的末尾，注释掉了使用此脚本微调模型的代码，作为运行上述单元格的替代方法。

# Exercise: see if you can modify the official example training script we saw
# in Unit 1 to begin with a pre-trained model rather than training from scratch.
# Compare it to the minimal script linked above - what extra features is the minimal script missing?

使用此模型生成一些图像，我们可以看到这些面孔已经看起来非常奇怪！

>>> # @markdown Generate and plot some images:
>>> x = torch.randn(8, 3, 256, 256).to(device)  # Batch of 8
>>> for i, t in tqdm(enumerate(scheduler.timesteps)):
...     model_input = scheduler.scale_model_input(x, t)
...     with torch.no_grad():
...         noise_pred = image_pipe.unet(model_input, t)["sample"]
...     x = scheduler.step(noise_pred, t, x).prev_sample
>>> grid = torchvision.utils.make_grid(x, nrow=4)
>>> plt.imshow(grid.permute(1, 2, 0).cpu().clip(-1, 1) * 0.5 + 0.5)

注意事项 4： 微调可能非常不可预测！如果我们训练的时间更长，我们可能会看到一些完美的蝴蝶。但中间步骤本身可能非常有趣，特别是如果您的兴趣更偏向艺术方面！探索非常短或非常长的训练时间，并改变学习率，看看这如何影响最终模型产生的输出类型。

用于微调模型的代码，使用我们在 WikiArt 演示模型上使用的最小示例脚本

如果您想训练一个类似于我在 WikiArt 上制作的模型，您可以取消注释并运行下面的单元格。由于这需要一段时间并且可能会耗尽您的 GPU 内存，我建议在完成本笔记本的其余部分之后再执行此操作。

## To download the fine-tuning script:
# !wget https://github.com/huggingface/diffusion-models-class/raw/main/unit2/finetune_model.py

## To run the script, training the face model on some vintage faces
## (ideally run this in a terminal):
# !python finetune_model.py --image_size 128 --batch_size 8 --num_epochs 16\
#     --grad_accumulation_steps 2 --start_model "google/ddpm-celebahq-256"\
#     --dataset_name "Norod78/Vintage-Faces-FFHQAligned" --wandb_project 'dm-finetune'\
#     --log_samples_every 100 --save_model_every 1000 --model_save_name 'vintageface'

保存和加载微调后的 Pipeline

现在我们已经微调了扩散模型中的 U-Net，让我们通过运行以下命令将其保存到本地文件夹

image_pipe.save_pretrained("my-finetuned-model")

正如我们在 Unit 1 中看到的那样，这将保存配置、模型、调度器

>>> !ls {"my-finetuned-model"}

model_index.json  scheduler  unet

接下来，您可以按照 Unit 1 的 Diffusers 简介中概述的相同步骤将模型推送到 Hub 以供以后使用

# @title Upload a locally saved pipeline to the hub

# Code to upload a pipeline saved locally to the hub
from huggingface_hub import HfApi, ModelCard, create_repo, get_full_repo_name

# Set up repo and upload files
model_name = "ddpm-celebahq-finetuned-butterflies-2epochs"  # @param What you want it called on the hub
local_folder_name = (
    "my-finetuned-model"  # @param Created by the script or one you created via image_pipe.save_pretrained('save_name')
)
description = "Describe your model here"  # @param
hub_model_id = get_full_repo_name(model_name)
create_repo(hub_model_id)
api = HfApi()
api.upload_folder(folder_path=f"{local_folder_name}/scheduler", path_in_repo="", repo_id=hub_model_id)
api.upload_folder(folder_path=f"{local_folder_name}/unet", path_in_repo="", repo_id=hub_model_id)
api.upload_file(
    path_or_fileobj=f"{local_folder_name}/model_index.json",
    path_in_repo="model_index.json",
    repo_id=hub_model_id,
)

# Add a model card (optional but nice!)
content = f"""
---
license: mit
tags:
- pytorch
- diffusers
- unconditional-image-generation
- diffusion-models-class
---

# Example Fine-Tuned Model for Unit 2 of the [Diffusion Models Class 🧨](https://github.com/huggingface/diffusion-models-class)

{description}

## Usage

```python
from diffusers import DDPMPipeline

pipeline = DDPMPipeline.from_pretrained('{hub_model_id}')
image = pipeline().images[0]
image

"""

card = ModelCard(content) card.push_to_hub(hub_model_id)


Congratulations, you've now fine-tuned your first diffusion model!

For the rest of this notebook we'll use a [model](https://huggingface.co/johnowhitaker/sd-class-wikiart-from-bedrooms) I fine-tuned from [this model trained on LSUN bedrooms](https://huggingface.co/google/ddpm-bedroom-256) approximately one epoch on the [WikiArt dataset](https://huggingface.co/datasets/huggan/wikiart). If you'd prefer, you can skip this cell and use the faces/butterflies pipeline we fine-tuned in the previous section or load one from the Hub instead:

```python
>>> # Load the pretrained pipeline
>>> pipeline_name = "johnowhitaker/sd-class-wikiart-from-bedrooms"
>>> image_pipe = DDPMPipeline.from_pretrained(pipeline_name).to(device)

>>> # Sample some images with a DDIM Scheduler over 40 steps
>>> scheduler = DDIMScheduler.from_pretrained(pipeline_name)
>>> scheduler.set_timesteps(num_inference_steps=40)

>>> # Random starting point (batch of 8 images)
>>> x = torch.randn(8, 3, 256, 256).to(device)

>>> # Minimal sampling loop
>>> for i, t in tqdm(enumerate(scheduler.timesteps)):
...     model_input = scheduler.scale_model_input(x, t)
...     with torch.no_grad():
...         noise_pred = image_pipe.unet(model_input, t)["sample"]
...     x = scheduler.step(noise_pred, t, x).prev_sample

>>> # View the results
>>> grid = torchvision.utils.make_grid(x, nrow=4)
>>> plt.imshow(grid.permute(1, 2, 0).cpu().clip(-1, 1) * 0.5 + 0.5)

注意事项 5： 通常很难判断微调的效果如何，以及“良好性能”的含义可能因用例而异。例如，如果您在小型数据集上微调像 stable diffusion 这样的文本条件模型，您可能希望它保留其大部分原始训练，以便它可以理解您的新数据集未涵盖的任意提示，同时适应以更好地匹配新训练数据的风格。这可能意味着使用低学习率以及类似指数模型平均的技术，正如这篇关于创建 pokemon 版本 stable diffusion 的精彩博客文章中演示的那样。在不同的情况下，您可能希望在新数据上完全重新训练模型（例如我们的卧室 -> wikiart 示例），在这种情况下，更大的学习率和更多的训练是有意义的。即使损失图没有显示太多改进，样本也清楚地表明从原始数据转向更“艺术性”的输出，尽管它们仍然大多不连贯。

这将我们引向下一节，我们将研究如何为此类模型添加额外的引导，以更好地控制输出……

引导

如果我们想要对生成的样本进行一些控制，我们该怎么办？例如，假设我们想要使生成的图像偏向于特定颜色。我们该如何实现？进入引导，这是一种为采样过程添加额外控制的技术。

第一步是创建我们的条件函数：我们想要最小化的某种度量（损失）。这是颜色示例的一个函数，它将图像的像素与目标颜色（默认为一种浅青色）进行比较，并返回平均误差

def color_loss(images, target_color=(0.1, 0.9, 0.5)):
    """Given a target color (R, G, B) return a loss for how far away on average
    the images' pixels are from that color. Defaults to a light teal: (0.1, 0.9, 0.5)"""
    target = torch.tensor(target_color).to(images.device) * 2 - 1  # Map target color to (-1, 1)
    target = target[None, :, None, None]  # Get shape right to work with the images (b, c, h, w)
    error = torch.abs(images - target).mean()  # Mean absolute difference between the image pixels and the target color
    return error

接下来，我们将制作一个修改后的采样循环版本，在每个步骤中，我们执行以下操作

创建 x 的新版本，其中 requires_grad = True
计算去噪版本 (x0)
通过我们的损失函数馈送预测的 x0
找到此损失函数相对于 x 的梯度
在我们使用调度器步进之前，使用此条件梯度修改 x，希望将 x 推向根据我们的引导函数将导致更低损失的方向

这里有两种变体可以探索。在第一种变体中，我们在从 UNet 获得噪声预测之后在 x 上设置 requires_grad，这更节省内存（因为我们不必通过扩散模型回溯梯度），但会产生不太准确的梯度。在第二种变体中，我们首先在 x 上设置 requires_grad，然后将其馈送到 UNet 并计算预测的 x0。

>>> # Variant 1: shortcut method

>>> # The guidance scale determines the strength of the effect
>>> guidance_loss_scale = 40  # Explore changing this to 5, or 100

>>> x = torch.randn(8, 3, 256, 256).to(device)

>>> for i, t in tqdm(enumerate(scheduler.timesteps)):

...     # Prepare the model input
...     model_input = scheduler.scale_model_input(x, t)

...     # predict the noise residual
...     with torch.no_grad():
...         noise_pred = image_pipe.unet(model_input, t)["sample"]

...     # Set x.requires_grad to True
...     x = x.detach().requires_grad_()

...     # Get the predicted x0
...     x0 = scheduler.step(noise_pred, t, x).pred_original_sample

...     # Calculate loss
...     loss = color_loss(x0) * guidance_loss_scale
...     if i % 10 == 0:
...         print(i, "loss:", loss.item())

...     # Get gradient
...     cond_grad = -torch.autograd.grad(loss, x)[0]

...     # Modify x based on this gradient
...     x = x.detach() + cond_grad

...     # Now step with scheduler
...     x = scheduler.step(noise_pred, t, x).prev_sample

>>> # View the output
>>> grid = torchvision.utils.make_grid(x, nrow=4)
>>> im = grid.permute(1, 2, 0).cpu().clip(-1, 1) * 0.5 + 0.5
>>> Image.fromarray(np.array(im * 255).astype(np.uint8))

0 loss: 27.279136657714844
10 loss: 11.286816596984863
20 loss: 10.683112144470215
30 loss: 10.942476272583008

即使我们只生成一批四个图像而不是八个图像，第二种选项也需要近两倍的 GPU RAM 才能运行。看看您是否能发现其中的差异，并思考为什么这种方式更“准确”

>>> # Variant 2: setting x.requires_grad before calculating the model predictions

>>> guidance_loss_scale = 40
>>> x = torch.randn(4, 3, 256, 256).to(device)

>>> for i, t in tqdm(enumerate(scheduler.timesteps)):

...     # Set requires_grad before the model forward pass
...     x = x.detach().requires_grad_()
...     model_input = scheduler.scale_model_input(x, t)

...     # predict (with grad this time)
...     noise_pred = image_pipe.unet(model_input, t)["sample"]

...     # Get the predicted x0:
...     x0 = scheduler.step(noise_pred, t, x).pred_original_sample

...     # Calculate loss
...     loss = color_loss(x0) * guidance_loss_scale
...     if i % 10 == 0:
...         print(i, "loss:", loss.item())

...     # Get gradient
...     cond_grad = -torch.autograd.grad(loss, x)[0]

...     # Modify x based on this gradient
...     x = x.detach() + cond_grad

...     # Now step with scheduler
...     x = scheduler.step(noise_pred, t, x).prev_sample


>>> grid = torchvision.utils.make_grid(x, nrow=4)
>>> im = grid.permute(1, 2, 0).cpu().clip(-1, 1) * 0.5 + 0.5
>>> Image.fromarray(np.array(im * 255).astype(np.uint8))

0 loss: 30.750328063964844
10 loss: 18.550724029541016
20 loss: 17.515094757080078
30 loss: 17.55681037902832

在第二种变体中，内存需求更高，效果不太明显，因此您可能会认为这种变体较差。但是，输出可以说更接近模型训练的图像类型，并且您始终可以增加引导比例以获得更强的效果。您最终使用哪种方法将取决于实验中效果最佳的方法。

# Exercise: pick your favourite colour and look up it's values in RGB space.
# Edit the `color_loss()` line in the cell above to receive these new RGB values and examine the outputs - do they match what you expect?

CLIP 引导

引导到某种颜色可以给我们一些控制权，但是如果我们只需输入一些文本来描述我们想要的内容呢？

CLIP 是 OpenAI 创建的模型，它允许我们将图像与文本描述进行比较。这非常强大，因为它允许我们量化图像与提示的匹配程度。由于该过程是可微分的，我们可以将其用作损失函数来引导我们的扩散模型！

我们不会在这里过多地介绍细节。基本方法如下

嵌入文本提示以获得文本的 512 维 CLIP 嵌入
对于扩散模型过程的每个步骤
- 制作预测的去噪图像的几个变体（具有多个变体可以提供更干净的损失信号）
- 对于每个变体，使用 CLIP 嵌入图像，并将此嵌入与提示的文本嵌入进行比较（使用称为“大圆距离平方”的度量）
计算此损失相对于当前噪声 x 的梯度，并在使用调度器更新 x 之前使用此梯度修改 x。

有关 CLIP 的更深入解释，请查看关于该主题的课程或关于我们用于加载 CLIP 模型的 OpenCLIP 项目的这份报告。运行下一个单元格以加载 CLIP 模型

# @markdown load a CLIP model and define the loss function
import open_clip

clip_model, _, preprocess = open_clip.create_model_and_transforms("ViT-B-32", pretrained="openai")
clip_model.to(device)

# Transforms to resize and augment an image + normalize to match CLIP's training data
tfms = torchvision.transforms.Compose(
    [
        torchvision.transforms.RandomResizedCrop(224),  # Random CROP each time
        torchvision.transforms.RandomAffine(5),  # One possible random augmentation: skews the image
        torchvision.transforms.RandomHorizontalFlip(),  # You can add additional augmentations if you like
        torchvision.transforms.Normalize(
            mean=(0.48145466, 0.4578275, 0.40821073),
            std=(0.26862954, 0.26130258, 0.27577711),
        ),
    ]
)


# And define a loss function that takes an image, embeds it and compares with
# the text features of the prompt
def clip_loss(image, text_features):
    image_features = clip_model.encode_image(tfms(image))  # Note: applies the above transforms
    input_normed = torch.nn.functional.normalize(image_features.unsqueeze(1), dim=2)
    embed_normed = torch.nn.functional.normalize(text_features.unsqueeze(0), dim=2)
    dists = input_normed.sub(embed_normed).norm(dim=2).div(2).arcsin().pow(2).mul(2)  # Squared Great Circle Distance
    return dists.mean()

定义了损失函数后，我们的引导采样循环看起来与之前的示例类似，将 color_loss() 替换为我们新的基于 clip 的损失函数

>>> # @markdown applying guidance using CLIP

>>> prompt = "Red Rose (still life), red flower painting"  # @param

>>> # Explore changing this
>>> guidance_scale = 8  # @param
>>> n_cuts = 4  # @param

>>> # More steps -> more time for the guidance to have an effect
>>> scheduler.set_timesteps(50)

>>> # We embed a prompt with CLIP as our target
>>> text = open_clip.tokenize([prompt]).to(device)
>>> with torch.no_grad(), torch.cuda.amp.autocast():
...     text_features = clip_model.encode_text(text)


>>> x = torch.randn(4, 3, 256, 256).to(device)  # RAM usage is high, you may want only 1 image at a time

>>> for i, t in tqdm(enumerate(scheduler.timesteps)):

...     model_input = scheduler.scale_model_input(x, t)

...     # predict the noise residual
...     with torch.no_grad():
...         noise_pred = image_pipe.unet(model_input, t)["sample"]

...     cond_grad = 0

...     for cut in range(n_cuts):

...         # Set requires grad on x
...         x = x.detach().requires_grad_()

...         # Get the predicted x0:
...         x0 = scheduler.step(noise_pred, t, x).pred_original_sample

...         # Calculate loss
...         loss = clip_loss(x0, text_features) * guidance_scale

...         # Get gradient (scale by n_cuts since we want the average)
...         cond_grad -= torch.autograd.grad(loss, x)[0] / n_cuts

...     if i % 25 == 0:
...         print("Step:", i, ", Guidance loss:", loss.item())

...     # Modify x based on this gradient
...     alpha_bar = scheduler.alphas_cumprod[i]
...     x = x.detach() + cond_grad * alpha_bar.sqrt()  # Note the additional scaling factor here!

...     # Now step with scheduler
...     x = scheduler.step(noise_pred, t, x).prev_sample


>>> grid = torchvision.utils.make_grid(x.detach(), nrow=4)
>>> im = grid.permute(1, 2, 0).cpu().clip(-1, 1) * 0.5 + 0.5
>>> Image.fromarray(np.array(im * 255).astype(np.uint8))

Step: 0 , Guidance loss: 7.437869548797607
Step: 25 , Guidance loss: 7.174620628356934

这些看起来有点像玫瑰！它并不完美，但如果您调整设置，您可以使用它获得一些令人愉悦的图像。

如果您检查上面的代码，您会看到我正在将条件梯度按 alpha_bar.sqrt() 的因子进行缩放。有一些理论表明缩放这些梯度的“正确”方法，但在实践中，这也是您可以试验的东西。对于某些类型的引导，您可能希望大部分效果集中在早期步骤中，对于其他类型的引导（例如，专注于纹理的风格损失），您可能更希望它们仅在生成过程的末尾才开始起作用。下面显示了一些可能的计划

>>> # @markdown Plotting some possible schedules:
>>> plt.plot([1 for a in scheduler.alphas_cumprod], label="no scaling")
>>> plt.plot([a for a in scheduler.alphas_cumprod], label="alpha_bar")
>>> plt.plot([a.sqrt() for a in scheduler.alphas_cumprod], label="alpha_bar.sqrt()")
>>> plt.plot([(1 - a).sqrt() for a in scheduler.alphas_cumprod], label="(1-alpha_bar).sqrt()")
>>> plt.legend()
>>> plt.title("Possible guidance scaling schedules")

尝试不同的计划、引导比例和您可以想到的任何其他技巧（在某个范围内裁剪梯度是一种流行的修改），看看您可以将效果提高到多好！还要确保您尝试更换其他模型。也许是我们开始时加载的面孔模型 - 您可以可靠地引导它生成男性面孔吗？如果您将 CLIP 引导与我们之前使用的颜色损失结合起来会怎么样？等等。

如果您查看实践中 CLIP 引导扩散的一些代码，您会看到一种更复杂的方法，它具有更好的类来从图像中挑选随机裁剪区域，并且对损失函数进行了许多额外的调整以获得更好的性能。在文本条件扩散模型出现之前，这是当时最好的文本到图像系统！我们这里的小型玩具版本还有很大的改进空间，但它抓住了核心思想：感谢引导以及 CLIP 的惊人功能，我们可以为无条件扩散模型添加文本控制 🎨。

将自定义采样循环作为 Gradio 演示分享

也许您已经找到了一种有趣的损失来引导生成，并且您现在想与世界分享您的微调模型和这种自定义采样策略……

进入 Gradio。Gradio 是一款免费的开源工具，用户可以通过简单的 Web 界面轻松创建和共享交互式机器学习模型。借助 Gradio，用户可以为其机器学习模型构建自定义界面，然后可以通过唯一的 URL 与他人共享。它还集成到 🤗 Spaces 中，这使得托管演示并与他人共享变得容易。

我们将把我们的核心逻辑放在一个函数中，该函数接受一些输入并生成一个图像作为输出。然后，可以将其包装在一个简单的界面中，该界面允许用户指定一些参数（这些参数作为输入传递给主生成函数）。有许多组件可用 - 在此示例中，我们将使用滑块来调整引导比例，并使用颜色选择器来定义目标颜色。

%pip install -q gradio # Install the library

import gradio as gr
from PIL import Image, ImageColor


# The function that does the hard work
def generate(color, guidance_loss_scale):
    target_color = ImageColor.getcolor(color, "RGB")  # Target color as RGB
    target_color = [a / 255 for a in target_color]  # Rescale from (0, 255) to (0, 1)
    x = torch.randn(1, 3, 256, 256).to(device)
    for i, t in tqdm(enumerate(scheduler.timesteps)):
        model_input = scheduler.scale_model_input(x, t)
        with torch.no_grad():
            noise_pred = image_pipe.unet(model_input, t)["sample"]
        x = x.detach().requires_grad_()
        x0 = scheduler.step(noise_pred, t, x).pred_original_sample
        loss = color_loss(x0, target_color) * guidance_loss_scale
        cond_grad = -torch.autograd.grad(loss, x)[0]
        x = x.detach() + cond_grad
        x = scheduler.step(noise_pred, t, x).prev_sample
    grid = torchvision.utils.make_grid(x, nrow=4)
    im = grid.permute(1, 2, 0).cpu().clip(-1, 1) * 0.5 + 0.5
    im = Image.fromarray(np.array(im * 255).astype(np.uint8))
    im.save("test.jpeg")
    return im


# See the gradio docs for the types of inputs and outputs available
inputs = [
    gr.ColorPicker(label="color", value="55FFAA"),  # Add any inputs you need here
    gr.Slider(label="guidance_scale", minimum=0, maximum=30, value=3),
]
outputs = gr.Image(label="result")

# And the minimal interface
demo = gr.Interface(
    fn=generate,
    inputs=inputs,
    outputs=outputs,
    examples=[
        ["#BB2266", 3],
        ["#44CCAA", 5],  # You can provide some example inputs to get people started
    ],
)
demo.launch(debug=True)  # debug=True allows you to see errors and output in Colab

可以构建更复杂的界面，具有精美的样式和各种可能的输入，但对于此演示，我们使其尽可能简单。

🤗 Spaces 上的演示默认在 CPU 上运行，因此最好在迁移之前在 Colab 中（如上所示）原型化你的界面。当你准备好分享你的演示时，你需要创建一个 space，设置一个 requirements.txt 文件列出你的代码将使用的库，然后将所有代码放在一个 app.py 文件中，该文件定义了相关函数和界面。

Screenshot from 2022-12-11 10-28-26.png

幸运的是，还有一个“复制” space 的选项。你可以访问我的演示 space 这里（如上所示），然后点击“复制此 Space”以获得一个模板，你可以修改该模板以使用你自己的模型和引导函数。

在设置中，你可以配置你的 space 在更高级的硬件上运行（按小时收费）。做出了一些很棒的东西并想在更好的硬件上分享，但没有资金？请通过 Discord 告知我们，我们会看看是否可以提供帮助！

总结与下一步

我们在本 notebook 中涵盖了很多内容！让我们回顾一下核心思想

加载现有模型并使用不同的调度器对它们进行采样相对容易
微调看起来就像从头开始训练一样，除了从现有模型开始，我们希望更快地获得更好的结果
为了在大型图像上微调大型模型，我们可以使用梯度累积等技巧来绕过批次大小限制
记录采样图像对于微调非常重要，因为损失曲线可能不会显示太多有用的信息
引导使我们能够采用无条件模型，并根据一些引导/损失函数来引导生成过程，在每个步骤中，我们找到损失相对于噪声图像 x 的梯度，并根据该梯度更新它，然后再继续到下一个时间步
使用 CLIP 引导让我们能够用文本控制无条件模型！

为了将此付诸实践，以下是你可采取的一些具体下一步措施

微调你自己的模型并将其推送到 hub。这将涉及选择一个起点（例如，在面孔、卧室、猫或上面的 wikiart 示例上训练的模型）和一个数据集（可能是这些动物面孔或你自己的图像），然后运行此 notebook 中的代码或示例脚本（下面的演示用法）。
使用你微调后的模型探索引导，可以使用示例引导函数之一（color_loss 或 CLIP）或发明你自己的。
基于此使用 Gradio 分享一个演示，可以修改示例 space 以使用你自己的模型，或者创建你自己的具有更多功能的自定义版本。

我们期待在 Discord、Twitter 和其他地方看到你的成果 🤗！

< > 在 GitHub 上更新

Diffusion 课程