DreamBooth 黑客松 🏆

欢迎参加 DreamBooth 黑客松！在本次竞赛中，您将通过在少量您自己的图像上微调 Stable Diffusion 模型来个性化模型。 为此，我们将使用一种名为 DreamBooth 的技术，该技术允许将主体（例如，您的宠物或最喜欢的菜肴）植入模型的输出域中，以便可以使用提示中的唯一标识符合成该主体。

让我们开始吧！

先决条件

在深入学习本笔记本之前，您应该阅读

Unit 3 README，其中深入探讨了 Stable Diffusion
DreamBooth 博客文章，以了解这项技术的潜力
Hugging Face 关于使用 DreamBooth 微调 Stable Diffusion 的最佳实践的博客文章

🚨 注意： 本笔记本中的代码至少需要 14GB 的 GPU vRAM，并且是 🤗 Diffusers 中提供的官方训练脚本的简化版本。它可以为大多数应用生成不错的模型，但如果您有至少 24GB 的 vRAM 可用，我们建议您尝试使用高级功能，例如类先验保持损失和微调文本编码器。请查看 🤗 Diffusers 文档了解更多详情。

什么是 DreamBooth？

DreamBooth 是一种使用专门形式的微调来教 Stable Diffusion 新概念的技术。如果您在 Twitter 或 Reddit 上，您可能已经看到人们使用这项技术来创建自己的（通常很搞笑的）头像。例如，这是 Andrej Karpathy 作为牛仔的样子（您可能需要运行单元格才能看到输出）

%%html
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Stableboost auto-suggests a few hundred prompts by default but you can generate additional variations for any one prompt that seems to be giving fun/interesting results, or adjust it in any way: <a href="https://#/qWmadiXftP">pic.twitter.com/qWmadiXftP</a></p>&mdash; Andrej Karpathy (@karpathy) <a href="https://twitter.com/karpathy/status/1600578187141840896?ref_src=twsrc%5Etfw">December 7, 2022</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

DreamBooth 的工作方式如下

收集大约 10-20 张主体（例如，您的狗）的输入图像，并定义一个唯一标识符 [V] 来指代该主体。此标识符通常是一些编造的词，例如 flffydog，它在推理时被植入不同的文本提示中，以将主体放置在不同的上下文中。
通过提供图像以及文本提示（例如“一张 [V] 狗的照片”），来微调 diffusion 模型，该文本提示包含唯一标识符和类名（在本例中为“狗”）。
（可选）应用特殊的类特定先验保持损失，该损失利用模型对该类的语义先验，并通过在文本提示中注入类名来鼓励模型生成属于该主题类的多样化实例。在实践中，此步骤仅对人脸真正需要，并且可以跳过我们在本次黑客松中将要探索的主题。

下图显示了 DreamBooth 技术的概述

DreamBooth 可以做什么？

除了将您的主体放置在有趣的位置之外，DreamBooth 还可以用于文本引导的视图合成，其中从不同的视点查看主体，如下例所示

DreamBooth 还可以用于修改主体的属性，例如颜色或混合动物物种！

现在我们已经了解了 DreamBooth 可以做的一些很酷的事情，让我们开始训练我们自己的模型吧！

步骤 1：设置

如果您在 Google Colab 或 Kaggle 上运行此笔记本，请运行以下单元格以安装所需的库

%pip install -qqU diffusers transformers bitsandbytes accelerate ftfy datasets

如果您在 Kaggle 上运行，您需要安装最新的 PyTorch 版本才能与 🤗 Accelerate 一起使用

# Uncomment and run if using Kaggle's notebooks. You may need to restart the notebook afterwards
# %pip install -U torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116

为了能够将您的模型推送到 Hub 并使其出现在 DreamBooth 排行榜上，还需要执行几个步骤。首先，您必须从您的 Hugging Face 帐户创建一个具有写入权限的访问令牌，然后执行以下单元格并输入您的令牌

from huggingface_hub import notebook_login

notebook_login()

最后一步是安装 Git LFS

%%capture
!sudo apt -qq install git-lfs
!git config --global credential.helper store

步骤 2：选择主题

本次竞赛由 5 个主题组成，每个主题将收集属于以下类别的模型

动物 🐨： 使用此主题生成您的宠物或最喜欢的动物在雅典卫城闲逛、游泳或在太空飞行的图像。
科学 🔬： 使用此主题生成星系、蛋白质或自然科学和医学科学任何领域的酷炫合成图像。
食物 🍔： 使用此主题在您最喜欢的菜肴或美食上调整 Stable Diffusion。
风景 🏔： 使用此主题生成您最喜欢的山脉、湖泊或花园的美丽风景。
百搭 🔥： 使用此主题尽情发挥，并为您选择的任何类别创建 Stable Diffusion 模型！

我们将为每个主题中最受欢迎的前 3 个模型颁发奖品，并鼓励您提交尽可能多的模型！运行以下单元格以创建一个下拉小部件，您可以在其中选择要提交的主题

import ipywidgets as widgets

theme = "animal"
drop_down = widgets.Dropdown(
    options=["animal", "science", "food", "landscape", "wildcard"],
    description="Pick a theme",
    disabled=False,
)


def dropdown_handler(change):
    global theme
    theme = change.new


drop_down.observe(dropdown_handler, names="value")
display(drop_down)

>>> print(f"You've selected the {theme} theme!")

You've selected the animal theme!

步骤 3：创建图像数据集并将其上传到 Hub

选择主题后，下一步是为该主题创建图像数据集并将其上传到 Hugging Face Hub

您将需要大约 10-20 张您希望植入模型的主体的图像。这些可以是您拍摄的照片，也可以是从 Unsplash 等平台下载的照片。或者，您可以查看 Hugging Face Hub 上的任何图像数据集以获取灵感。
为了获得最佳效果，我们建议使用来自不同角度和视角的主体图像。

将图像收集到一个文件夹后，您可以使用 UI 拖放图像将其上传到 Hub。有关更多详细信息，请参阅本指南，或观看以下视频

>>> from IPython.display import YouTubeVideo

>>> YouTubeVideo("HaN6qCr_Afc")

或者，您可以使用 🤗 Datasets 的 imagefolder 功能在本地加载数据集，然后将其推送到 Hub

from datasets import load_dataset

dataset = load_dataset("imagefolder", data_dir="your_folder_of_images")
# Push to Hub
dataset.push_to_hub("dreambooth-hackathon-images")
dataset = dataset['train']

创建数据集后，您可以使用 load_dataset() 函数下载它，如下所示

from datasets import load_dataset

dataset_id = "lewtun/corgi"  # CHANGE THIS TO YOUR {hub_username}/{dataset_id}
dataset = load_dataset(dataset_id, split="train")
dataset

现在我们有了数据集，让我们定义一个辅助函数来查看一些图像

>>> from PIL import Image


>>> def image_grid(imgs, rows, cols):
...     assert len(imgs) == rows * cols
...     w, h = imgs[0].size
...     grid = Image.new("RGB", size=(cols * w, rows * h))
...     grid_w, grid_h = grid.size
...     for i, img in enumerate(imgs):
...         grid.paste(img, box=(i % cols * w, i // cols * h))
...     return grid


>>> num_samples = 4
>>> image_grid(dataset["image"][:num_samples], rows=1, cols=num_samples)

如果看起来不错，您可以继续下一步 - 创建用于 DreamBooth 训练的 PyTorch 数据集。

步骤 3：创建训练数据集

要为我们的图像创建训练集，我们需要几个组件

一个实例提示，用于在训练开始时启动模型。在大多数情况下，使用“一张 [标识符][类名词] 的照片”效果很好，例如，对于我们可爱的柯基犬照片，使用“一张 ccorgi 狗的照片”。
- 注意： 建议您选择一个唯一的/编造的词，例如 ccorgi 来描述您的主体。这将确保模型词汇表中的常用词不会被覆盖。
一个分词器，用于将实例提示转换为可以馈送到 Stable Diffusion 文本编码器的输入 ID。
一组图像转换，特别是将图像调整为通用形状并将像素值归一化为通用均值和标准分布。

考虑到这一点，让我们首先定义实例提示

>>> name_of_your_concept = "ccorgi"  # CHANGE THIS ACCORDING TO YOUR SUBJECT
>>> type_of_thing = "dog"  # CHANGE THIS ACCORDING TO YOUR SUBJECT
>>> instance_prompt = f"a photo of {name_of_your_concept} {type_of_thing}"
>>> print(f"Instance prompt: {instance_prompt}")

Instance prompt: a photo of ccorgi dog

接下来，我们需要创建一个 PyTorch Dataset 对象，该对象实现 __len__ 和 __getitem__ dunder 方法

from torch.utils.data import Dataset
from torchvision import transforms


class DreamBoothDataset(Dataset):
    def __init__(self, dataset, instance_prompt, tokenizer, size=512):
        self.dataset = dataset
        self.instance_prompt = instance_prompt
        self.tokenizer = tokenizer
        self.size = size
        self.transforms = transforms.Compose(
            [
                transforms.Resize(size),
                transforms.CenterCrop(size),
                transforms.ToTensor(),
                transforms.Normalize([0.5], [0.5]),
            ]
        )

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, index):
        example = {}
        image = self.dataset[index]["image"]
        example["instance_images"] = self.transforms(image)
        example["instance_prompt_ids"] = self.tokenizer(
            self.instance_prompt,
            padding="do_not_pad",
            truncation=True,
            max_length=self.tokenizer.model_max_length,
        ).input_ids
        return example

太棒了，现在让我们通过加载与原始 Stable Diffusion 模型的文本编码器关联的 CLIP 分词器来检查这是否有效，然后创建训练数据集

from transformers import CLIPTokenizer

# The Stable Diffusion checkpoint we'll fine-tune
model_id = "CompVis/stable-diffusion-v1-4"
tokenizer = CLIPTokenizer.from_pretrained(
    model_id,
    subfolder="tokenizer",
)

train_dataset = DreamBoothDataset(dataset, instance_prompt, tokenizer)
train_dataset[0]

步骤 4：定义数据整理器

现在我们有了训练数据集，接下来我们需要定义一个数据整理器。数据整理器是一个函数，它收集一批数据中的元素并应用一些逻辑来形成我们可以提供给模型的单个张量。如果您想了解更多信息，可以查看 Hugging Face 课程中的这段视频

>>> YouTubeVideo("-RPeakdlHYo")

对于 DreamBooth，我们的数据整理器需要向模型提供来自分词器的输入 ID 和来自图像的像素值，作为堆叠张量。以下函数可以做到这一点

import torch


def collate_fn(examples):
    input_ids = [example["instance_prompt_ids"] for example in examples]
    pixel_values = [example["instance_images"] for example in examples]
    pixel_values = torch.stack(pixel_values)
    pixel_values = pixel_values.to(memory_format=torch.contiguous_format).float()

    input_ids = tokenizer.pad({"input_ids": input_ids}, padding=True, return_tensors="pt").input_ids

    batch = {
        "input_ids": input_ids,
        "pixel_values": pixel_values,
    }
    return batch

步骤 5：加载 Stable Diffusion 管道的组件

我们几乎准备好所有训练所需的组件了！正如您在关于 Stable Diffusion 的 Unit 3 笔记本中看到的那样，该管道由多个模型组成

一个文本编码器，用于将提示转换为文本嵌入。这里我们使用 CLIP，因为它是在训练 Stable Diffusion v1-4 时使用的编码器。
一个 VAE 或变分自动编码器，用于将图像转换为压缩表示（即潜在空间），并在推理时对其进行解压缩。
一个 UNet，它对 VAE 的潜在空间应用去噪操作。

我们可以使用 🤗 Diffusers 和 🤗 Transformers 库加载所有这些组件，如下所示

from diffusers import AutoencoderKL, UNet2DConditionModel
from transformers import CLIPFeatureExtractor, CLIPTextModel

text_encoder = CLIPTextModel.from_pretrained(model_id, subfolder="text_encoder")
vae = AutoencoderKL.from_pretrained(model_id, subfolder="vae")
unet = UNet2DConditionModel.from_pretrained(model_id, subfolder="unet")
feature_extractor = CLIPFeatureExtractor.from_pretrained("openai/clip-vit-base-patch32")

步骤 6：微调模型

现在到了有趣的部分 - 使用 DreamBooth 训练我们的模型！正如 Hugging Face 的博客文章中所示，最需要调整的超参数是学习率和训练步数。

一般来说，您将以需要增加训练步数为代价，以较低的学习率获得更好的结果。以下值是一个很好的起点，但您可能需要根据您的数据集调整它们

learning_rate = 2e-06
max_train_steps = 400

接下来，让我们将我们需要的其他超参数包装到一个 Namespace 对象中，以便更轻松地配置训练运行

from argparse import Namespace

args = Namespace(
    pretrained_model_name_or_path=model_id,
    resolution=512,  # Reduce this if you want to save some memory
    train_dataset=train_dataset,
    instance_prompt=instance_prompt,
    learning_rate=learning_rate,
    max_train_steps=max_train_steps,
    train_batch_size=1,
    gradient_accumulation_steps=1,  # Increase this if you want to lower memory usage
    max_grad_norm=1.0,
    gradient_checkpointing=True,  # Set this to True to lower the memory usage
    use_8bit_adam=True,  # Use 8bit optimizer from bitsandbytes
    seed=3434554,
    sample_batch_size=2,
    output_dir="my-dreambooth",  # Where to save the pipeline
)

最后一步是定义一个 training_function() 函数，该函数包装训练逻辑，并且可以传递给 🤗 Accelerate 以处理在 1 个或多个 GPU 上的训练。如果这是您第一次使用 🤗 Accelerate，请查看此视频以快速了解它可以做什么

>>> YouTubeVideo("s7dy8QRgjJ0")

详细信息应该与我们在 Unit 1 和 Unit 2 中从头开始训练我们自己的 diffusion 模型时看到的内容类似

import math

import torch.nn.functional as F
from accelerate import Accelerator
from accelerate.utils import set_seed
from diffusers import DDPMScheduler, PNDMScheduler, StableDiffusionPipeline
from diffusers.pipelines.stable_diffusion import StableDiffusionSafetyChecker
from torch.utils.data import DataLoader
from tqdm.auto import tqdm


def training_function(text_encoder, vae, unet):

    accelerator = Accelerator(
        gradient_accumulation_steps=args.gradient_accumulation_steps,
    )

    set_seed(args.seed)

    if args.gradient_checkpointing:
        unet.enable_gradient_checkpointing()

    # Use 8-bit Adam for lower memory usage or to fine-tune the model in 16GB GPUs
    if args.use_8bit_adam:
        import bitsandbytes as bnb

        optimizer_class = bnb.optim.AdamW8bit
    else:
        optimizer_class = torch.optim.AdamW

    optimizer = optimizer_class(
        unet.parameters(),  # Only optimize unet
        lr=args.learning_rate,
    )

    noise_scheduler = DDPMScheduler(
        beta_start=0.00085,
        beta_end=0.012,
        beta_schedule="scaled_linear",
        num_train_timesteps=1000,
    )

    train_dataloader = DataLoader(
        args.train_dataset,
        batch_size=args.train_batch_size,
        shuffle=True,
        collate_fn=collate_fn,
    )

    unet, optimizer, train_dataloader = accelerator.prepare(unet, optimizer, train_dataloader)

    # Move text_encode and vae to gpu
    text_encoder.to(accelerator.device)
    vae.to(accelerator.device)

    # We need to recalculate our total training steps as the size of the training dataloader may have changed
    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
    num_train_epochs = math.ceil(args.max_train_steps / num_update_steps_per_epoch)

    # Train!
    total_batch_size = args.train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps
    # Only show the progress bar once on each machine
    progress_bar = tqdm(range(args.max_train_steps), disable=not accelerator.is_local_main_process)
    progress_bar.set_description("Steps")
    global_step = 0

    for epoch in range(num_train_epochs):
        unet.train()
        for step, batch in enumerate(train_dataloader):
            with accelerator.accumulate(unet):
                # Convert images to latent space
                with torch.no_grad():
                    latents = vae.encode(batch["pixel_values"]).latent_dist.sample()
                    latents = latents * 0.18215

                # Sample noise that we'll add to the latents
                noise = torch.randn(latents.shape).to(latents.device)
                bsz = latents.shape[0]
                # Sample a random timestep for each image
                timesteps = torch.randint(
                    0,
                    noise_scheduler.config.num_train_timesteps,
                    (bsz,),
                    device=latents.device,
                ).long()

                # Add noise to the latents according to the noise magnitude at each timestep
                # (this is the forward diffusion process)
                noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)

                # Get the text embedding for conditioning
                with torch.no_grad():
                    encoder_hidden_states = text_encoder(batch["input_ids"])[0]

                # Predict the noise residual
                noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
                loss = F.mse_loss(noise_pred, noise, reduction="none").mean([1, 2, 3]).mean()

                accelerator.backward(loss)
                if accelerator.sync_gradients:
                    accelerator.clip_grad_norm_(unet.parameters(), args.max_grad_norm)
                optimizer.step()
                optimizer.zero_grad()

            # Checks if the accelerator has performed an optimization step behind the scenes
            if accelerator.sync_gradients:
                progress_bar.update(1)
                global_step += 1

            logs = {"loss": loss.detach().item()}
            progress_bar.set_postfix(**logs)

            if global_step >= args.max_train_steps:
                break

        accelerator.wait_for_everyone()

    # Create the pipeline using the trained modules and save it
    if accelerator.is_main_process:
        print(f"Loading pipeline and saving to {args.output_dir}...")
        scheduler = PNDMScheduler(
            beta_start=0.00085,
            beta_end=0.012,
            beta_schedule="scaled_linear",
            skip_prk_steps=True,
            steps_offset=1,
        )
        pipeline = StableDiffusionPipeline(
            text_encoder=text_encoder,
            vae=vae,
            unet=accelerator.unwrap_model(unet),
            tokenizer=tokenizer,
            scheduler=scheduler,
            safety_checker=StableDiffusionSafetyChecker.from_pretrained("CompVis/stable-diffusion-safety-checker"),
            feature_extractor=feature_extractor,
        )
        pipeline.save_pretrained(args.output_dir)

现在我们已经定义了该函数，让我们训练它！根据数据集的大小和 GPU 类型，这可能需要 5 分钟到 1 小时才能运行

>>> from accelerate import notebook_launcher

>>> num_of_gpus = 1  # CHANGE THIS TO MATCH THE NUMBER OF GPUS YOU HAVE
>>> notebook_launcher(training_function, args=(text_encoder, vae, unet), num_processes=num_of_gpus)

Launching training on one GPU.

如果您在单个 GPU 上运行，您可以通过将以下代码复制到一个新单元格并运行它来为下一节释放一些内存。对于多 GPU 机器，🤗 Accelerate 不允许任何单元格直接使用 torch.cuda 访问 GPU，因此我们不建议在这些情况下使用此技巧

with torch.no_grad():
    torch.cuda.empty_cache()

步骤 7：运行推理并检查生成结果

现在我们已经训练了模型，让我们用它生成一些图像，看看效果如何！首先，我们将从我们保存模型的输出目录加载管道

pipe = StableDiffusionPipeline.from_pretrained(
    args.output_dir,
    torch_dtype=torch.float16,
).to("cuda")

接下来，让我们生成一些图像。 prompt 变量稍后将用于在 Hugging Face Hub 小部件上设置默认值，因此请稍作实验以找到一个好的提示。您可能还想尝试使用 CLIP Interrogator 创建详细的提示

>>> # Pick a funny prompt here and it will be used as the widget's default
>>> # when we push to the Hub in the next section
>>> prompt = f"a photo of {name_of_your_concept} {type_of_thing} in the Acropolis"

>>> # Tune the guidance to control how closely the generations follow the prompt
>>> # Values between 7-11 usually work best
>>> guidance_scale = 7

>>> num_cols = 2
>>> all_images = []
>>> for _ in range(num_cols):
...     images = pipe(prompt, guidance_scale=guidance_scale).images
...     all_images.extend(images)

>>> image_grid(all_images, 1, num_cols)

步骤 8：将您的模型推送到 Hub

如果您对您的模型感到满意，最后一步是将其推送到 Hub 并在 DreamBooth 排行榜上查看它！

首先，您需要为您的模型仓库定义一个名称。默认情况下，我们使用唯一标识符和类名，但如果您愿意，可以随意更改此名称

# Create a name for your model on the Hub. No spaces allowed.
model_name = f"{name_of_your_concept}-{type_of_thing}"

接下来，添加关于您训练的模型类型或您想分享的任何其他信息的简要描述

# Describe the theme and model you've trained
description = f"""
This is a Stable Diffusion model fine-tuned on `{type_of_thing}` images for the {theme} theme.
"""

最后，运行以下单元格以在 Hub 上创建一个 repo 并推送我们所有的文件，并启动一个漂亮的模型卡

>>> # Code to upload a pipeline saved locally to the hub
>>> from huggingface_hub import HfApi, ModelCard, create_repo, get_full_repo_name

>>> # Set up repo and upload files
>>> hub_model_id = get_full_repo_name(model_name)
>>> create_repo(hub_model_id)
>>> api = HfApi()
>>> api.upload_folder(folder_path=args.output_dir, path_in_repo="", repo_id=hub_model_id)

>>> content = f"""
... ---
... license: creativeml-openrail-m
... tags:
... - pytorch
... - diffusers
... - stable-diffusion
... - text-to-image
... - diffusion-models-class
... - dreambooth-hackathon
... - {theme}
... widget:
... - text: {prompt}
... ---

... # DreamBooth model for the {name_of_your_concept} concept trained by {api.whoami()["name"]} on the {dataset_id} dataset.

... This is a Stable Diffusion model fine-tuned on the {name_of_your_concept} concept with DreamBooth. It can be used by modifying the `instance_prompt`: **{instance_prompt}**

... This model was created as part of the DreamBooth Hackathon 🔥. Visit the [organisation page](https://huggingface.co/dreambooth-hackathon) for instructions on how to take part!

... ## Description

... {description}

... ## Usage

... ```python
... from diffusers import StableDiffusionPipeline

... pipeline = StableDiffusionPipeline.from_pretrained('{hub_model_id}')
... image = pipeline().images[0]
... image
... ```
... """

>>> card = ModelCard(content)
>>> hub_url = card.push_to_hub(hub_model_id)
>>> print(f"Upload successful! Model can be found here: {hub_url}")
>>> print(
...     f"View your submission on the public leaderboard here: https://huggingface.co/spaces/dreambooth-hackathon/leaderboard"
... )

Upload successful! Model can be found here: https://huggingface.co/lewtun/test-dogs/blob/main/README.md
View your submission on the public leaderboard here: https://huggingface.co/spaces/dreambooth-hackathon/leaderboard

步骤 9：庆祝 🥳

恭喜，您已经训练了您的第一个 DreamBooth 模型！您可以为比赛训练任意数量的模型 - 重要的事情是最受欢迎的模型将赢得奖品，所以不要忘记广泛分享您的作品以获得最多的投票！

< > 更新在 GitHub 上

Diffusion 课程