Diffusers 文档

评估扩散模型

Diffusers

加入 Hugging Face 社区

并获得增强的文档体验

在模型、数据集和 Spaces 上协作

通过加速推理获得更快的示例

在文档主题之间切换

开始使用

评估扩散模型

对诸如 Stable Diffusion 等生成模型的评估本质上是主观的。但是作为从业者和研究人员，我们经常不得不在许多不同的可能性中做出谨慎的选择。因此，当使用不同的生成模型（如 GAN、Diffusion 等）时，我们如何选择一个模型而不是另一个模型？

对此类模型进行定性评估可能容易出错，并可能对决策产生不正确的影响。但是，定量指标不一定与图像质量相对应。因此，通常，定性和定量评估的结合在使用一个模型而不是另一个模型时提供了更强的信号。

在本文档中，我们提供了用于评估扩散模型的定性和定量方法的非详尽概述。对于定量方法，我们特别关注如何与 diffusers 一起实施它们。

本文档中显示的方法也可用于评估不同的噪声调度器，同时保持底层生成模型不变。

场景

我们涵盖具有以下 pipelines 的扩散模型

文本引导的图像生成（例如 StableDiffusionPipeline）。
文本引导的图像生成，另外以输入图像为条件（例如 StableDiffusionImg2ImgPipeline 和 StableDiffusionInstructPix2PixPipeline）。
类条件图像生成模型（例如 DiTPipeline）。

定性评估

定性评估通常涉及对生成图像的人工评估。质量从构图、图像-文本对齐和空间关系等方面衡量。常用 prompts 为主观指标提供了一定程度的统一性。DrawBench 和 PartiPrompts 是用于定性基准测试的 prompt 数据集。DrawBench 和 PartiPrompts 分别由 Imagen 和 Parti 引入。

来自 Parti 官方网站

PartiPrompts (P2) 是一组丰富的超过 1600 个英语 prompts，我们将其作为这项工作的一部分发布。P2 可用于衡量模型在各种类别和挑战方面的能力。

parti-prompts

PartiPrompts 具有以下列

Prompt
prompt 的类别（例如“抽象”、“世界知识”等）
反映难度的挑战（例如“基本”、“复杂”、“书写与符号”等）

这些基准允许对不同图像生成模型进行并排人工评估。

为此，🧨 Diffusers 团队构建了 Open Parti Prompts，这是一个社区驱动的定性基准，基于 Parti Prompts 来比较最先进的开源扩散模型

Open Parti Prompts 游戏：对于 10 个 parti prompts，显示 4 个生成的图像，用户选择最适合 prompt 的图像。
Open Parti Prompts 排行榜：比较当前最佳开源扩散模型的排行榜。

为了手动比较图像，让我们看看如何在几个 PartiPrompts 上使用 diffusers。

下面我们展示了一些跨不同挑战采样的 prompts：基本、复杂、语言结构、想象力和书写与符号。这里我们使用 PartiPrompts 作为数据集。

from datasets import load_dataset

# prompts = load_dataset("nateraw/parti-prompts", split="train")
# prompts = prompts.shuffle()
# sample_prompts = [prompts[i]["Prompt"] for i in range(5)]

# Fixing these sample prompts in the interest of reproducibility.
sample_prompts = [
    "a corgi",
    "a hot air balloon with a yin-yang symbol, with the moon visible in the daytime sky",
    "a car with no windows",
    "a cube made of porcupine",
    'The saying "BE EXCELLENT TO EACH OTHER" written on a red brick wall with a graffiti image of a green alien wearing a tuxedo. A yellow fire hydrant is on a sidewalk in the foreground.',
]

现在我们可以使用这些 prompts 来生成一些使用 Stable Diffusion 的图像 (v1-4 checkpoint)

import torch

seed = 0
generator = torch.manual_seed(seed)

images = sd_pipeline(sample_prompts, num_images_per_prompt=1, generator=generator).images

parti-prompts-14

我们还可以相应地设置 num_images_per_prompt 以比较同一 prompt 的不同图像。运行相同的 pipeline 但使用不同的 checkpoint (v1-5)，产生

parti-prompts-15

一旦使用多个模型（在评估中）从所有 prompts 生成了几个图像，这些结果将呈现给人工评估员进行评分。有关 DrawBench 和 PartiPrompts 基准的更多详细信息，请参阅它们各自的论文。

在模型训练时查看一些推理样本以衡量训练进度非常有用。在我们的训练脚本中，我们支持此实用程序，并额外支持记录到 TensorBoard 和 Weights & Biases。

定量评估

在本节中，我们将引导您了解如何使用以下方法评估三种不同的扩散 pipelines

CLIP 分数
CLIP 方向相似度
FID

文本引导的图像生成

CLIP 分数衡量图像-标题对的兼容性。较高的 CLIP 分数意味着较高的兼容性 🔼。CLIP 分数是对定性概念“兼容性”的定量衡量。图像-标题对兼容性也可以被认为是图像和标题之间的语义相似性。CLIP 分数被发现与人类判断具有高度相关性。

让我们首先加载一个 StableDiffusionPipeline

from diffusers import StableDiffusionPipeline
import torch

model_ckpt = "CompVis/stable-diffusion-v1-4"
sd_pipeline = StableDiffusionPipeline.from_pretrained(model_ckpt, torch_dtype=torch.float16).to("cuda")

使用多个 prompts 生成一些图像

prompts = [
    "a photo of an astronaut riding a horse on mars",
    "A high tech solarpunk utopia in the Amazon rainforest",
    "A pikachu fine dining with a view to the Eiffel Tower",
    "A mecha robot in a favela in expressionist style",
    "an insect robot preparing a delicious meal",
    "A small cabin on top of a snowy mountain in the style of Disney, artstation",
]

images = sd_pipeline(prompts, num_images_per_prompt=1, output_type="np").images

print(images.shape)
# (6, 512, 512, 3)

然后，我们计算 CLIP 分数。

from torchmetrics.functional.multimodal import clip_score
from functools import partial

clip_score_fn = partial(clip_score, model_name_or_path="openai/clip-vit-base-patch16")

def calculate_clip_score(images, prompts):
    images_int = (images * 255).astype("uint8")
    clip_score = clip_score_fn(torch.from_numpy(images_int).permute(0, 3, 1, 2), prompts).detach()
    return round(float(clip_score), 4)

sd_clip_score = calculate_clip_score(images, prompts)
print(f"CLIP score: {sd_clip_score}")
# CLIP score: 35.7038

在上面的示例中，我们为每个 prompt 生成一个图像。如果我们为每个 prompt 生成多个图像，我们将不得不从每个 prompt 生成的图像中取平均分数。

现在，如果我们想比较与 StableDiffusionPipeline 兼容的两个 checkpoints，我们应该在调用 pipeline 时传递一个生成器。首先，我们使用固定的种子和 v1-4 Stable Diffusion checkpoint 生成图像

seed = 0
generator = torch.manual_seed(seed)

images = sd_pipeline(prompts, num_images_per_prompt=1, generator=generator, output_type="np").images

然后我们加载 v1-5 checkpoint 以生成图像

model_ckpt_1_5 = "stable-diffusion-v1-5/stable-diffusion-v1-5"
sd_pipeline_1_5 = StableDiffusionPipeline.from_pretrained(model_ckpt_1_5, torch_dtype=torch.float16).to("cuda")

images_1_5 = sd_pipeline_1_5(prompts, num_images_per_prompt=1, generator=generator, output_type="np").images

最后，我们比较它们的 CLIP 分数

sd_clip_score_1_4 = calculate_clip_score(images, prompts)
print(f"CLIP Score with v-1-4: {sd_clip_score_1_4}")
# CLIP Score with v-1-4: 34.9102

sd_clip_score_1_5 = calculate_clip_score(images_1_5, prompts)
print(f"CLIP Score with v-1-5: {sd_clip_score_1_5}")
# CLIP Score with v-1-5: 36.2137

看起来 v1-5 checkpoint 的性能优于其前身。但是请注意，我们用于计算 CLIP 分数的 prompts 数量非常少。为了进行更实际的评估，这个数字应该更高，并且 prompts 应该多样化。

从构造上看，此分数存在一些限制。训练数据集中的标题是从网络上抓取的，并从与互联网上图像关联的 alt 和类似标签中提取的。它们不一定代表人类用来描述图像的内容。因此，我们不得不在此处“设计”一些 prompts。

图像条件文本到图像生成

在这种情况下，我们使用输入图像以及文本 prompt 来调节生成 pipeline。让我们以 StableDiffusionInstructPix2PixPipeline 为例。它将编辑指令作为输入 prompt，并将要编辑的输入图像作为输入。

这是一个示例

edit-instruction

评估此类模型的一种策略是测量两个图像之间变化的 CLIP 空间中的一致性与两个图像标题之间变化的一致性（如 CLIP-Guided Domain Adaptation of Image Generators 中所示）。这被称为“CLIP 方向相似度”。

标题 1 对应于要编辑的输入图像（图像 1）。
标题 2 对应于编辑后的图像（图像 2）。它应该反映编辑指令。

以下是图形概述

edit-consistency

我们准备了一个小型数据集来实现此指标。让我们首先加载数据集。

from datasets import load_dataset

dataset = load_dataset("sayakpaul/instructpix2pix-demo", split="train")
dataset.features

{'input': Value(dtype='string', id=None),
 'edit': Value(dtype='string', id=None),
 'output': Value(dtype='string', id=None),
 'image': Image(decode=True, id=None)}

在这里我们有

input 是对应于 image 的标题。
edit 表示编辑指令。
output 表示反映 edit 指令的修改后的标题。

让我们看一下一个示例。

idx = 0
print(f"Original caption: {dataset[idx]['input']}")
print(f"Edit instruction: {dataset[idx]['edit']}")
print(f"Modified caption: {dataset[idx]['output']}")

Original caption: 2. FAROE ISLANDS: An archipelago of 18 mountainous isles in the North Atlantic Ocean between Norway and Iceland, the Faroe Islands has 'everything you could hope for', according to Big 7 Travel. It boasts 'crystal clear waterfalls, rocky cliffs that seem to jut out of nowhere and velvety green hills'
Edit instruction: make the isles all white marble
Modified caption: 2. WHITE MARBLE ISLANDS: An archipelago of 18 mountainous white marble isles in the North Atlantic Ocean between Norway and Iceland, the White Marble Islands has 'everything you could hope for', according to Big 7 Travel. It boasts 'crystal clear waterfalls, rocky cliffs that seem to jut out of nowhere and velvety green hills'

这是图像

dataset[idx]["image"]

edit-dataset

我们将首先使用编辑指令编辑我们数据集的图像，并计算方向相似度。

让我们首先加载 StableDiffusionInstructPix2PixPipeline

from diffusers import StableDiffusionInstructPix2PixPipeline

instruct_pix2pix_pipeline = StableDiffusionInstructPix2PixPipeline.from_pretrained(
    "timbrooks/instruct-pix2pix", torch_dtype=torch.float16
).to("cuda")

现在，我们执行编辑

import numpy as np


def edit_image(input_image, instruction):
    image = instruct_pix2pix_pipeline(
        instruction,
        image=input_image,
        output_type="np",
        generator=generator,
    ).images[0]
    return image

input_images = []
original_captions = []
modified_captions = []
edited_images = []

for idx in range(len(dataset)):
    input_image = dataset[idx]["image"]
    edit_instruction = dataset[idx]["edit"]
    edited_image = edit_image(input_image, edit_instruction)

    input_images.append(np.array(input_image))
    original_captions.append(dataset[idx]["input"])
    modified_captions.append(dataset[idx]["output"])
    edited_images.append(edited_image)

为了测量方向相似度，我们首先加载 CLIP 的图像和文本编码器

from transformers import (
    CLIPTokenizer,
    CLIPTextModelWithProjection,
    CLIPVisionModelWithProjection,
    CLIPImageProcessor,
)

clip_id = "openai/clip-vit-large-patch14"
tokenizer = CLIPTokenizer.from_pretrained(clip_id)
text_encoder = CLIPTextModelWithProjection.from_pretrained(clip_id).to("cuda")
image_processor = CLIPImageProcessor.from_pretrained(clip_id)
image_encoder = CLIPVisionModelWithProjection.from_pretrained(clip_id).to("cuda")

请注意，我们正在使用特定的 CLIP checkpoint，即 openai/clip-vit-large-patch14。这是因为 Stable Diffusion 预训练是使用此 CLIP 变体执行的。有关更多详细信息，请参阅文档。

接下来，我们准备一个 PyTorch nn.Module 来计算方向相似度

import torch.nn as nn
import torch.nn.functional as F


class DirectionalSimilarity(nn.Module):
    def __init__(self, tokenizer, text_encoder, image_processor, image_encoder):
        super().__init__()
        self.tokenizer = tokenizer
        self.text_encoder = text_encoder
        self.image_processor = image_processor
        self.image_encoder = image_encoder

    def preprocess_image(self, image):
        image = self.image_processor(image, return_tensors="pt")["pixel_values"]
        return {"pixel_values": image.to("cuda")}

    def tokenize_text(self, text):
        inputs = self.tokenizer(
            text,
            max_length=self.tokenizer.model_max_length,
            padding="max_length",
            truncation=True,
            return_tensors="pt",
        )
        return {"input_ids": inputs.input_ids.to("cuda")}

    def encode_image(self, image):
        preprocessed_image = self.preprocess_image(image)
        image_features = self.image_encoder(**preprocessed_image).image_embeds
        image_features = image_features / image_features.norm(dim=1, keepdim=True)
        return image_features

    def encode_text(self, text):
        tokenized_text = self.tokenize_text(text)
        text_features = self.text_encoder(**tokenized_text).text_embeds
        text_features = text_features / text_features.norm(dim=1, keepdim=True)
        return text_features

    def compute_directional_similarity(self, img_feat_one, img_feat_two, text_feat_one, text_feat_two):
        sim_direction = F.cosine_similarity(img_feat_two - img_feat_one, text_feat_two - text_feat_one)
        return sim_direction

    def forward(self, image_one, image_two, caption_one, caption_two):
        img_feat_one = self.encode_image(image_one)
        img_feat_two = self.encode_image(image_two)
        text_feat_one = self.encode_text(caption_one)
        text_feat_two = self.encode_text(caption_two)
        directional_similarity = self.compute_directional_similarity(
            img_feat_one, img_feat_two, text_feat_one, text_feat_two
        )
        return directional_similarity

现在让我们使用 DirectionalSimilarity。

dir_similarity = DirectionalSimilarity(tokenizer, text_encoder, image_processor, image_encoder)
scores = []

for i in range(len(input_images)):
    original_image = input_images[i]
    original_caption = original_captions[i]
    edited_image = edited_images[i]
    modified_caption = modified_captions[i]

    similarity_score = dir_similarity(original_image, edited_image, original_caption, modified_caption)
    scores.append(float(similarity_score.detach().cpu()))

print(f"CLIP directional similarity: {np.mean(scores)}")
# CLIP directional similarity: 0.0797976553440094

与 CLIP 分数一样，CLIP 方向相似度越高越好。

应该注意的是，StableDiffusionInstructPix2PixPipeline 公开了两个参数，即 image_guidance_scale 和 guidance_scale，它们使您可以控制最终编辑图像的质量。我们鼓励您尝试这两个参数，并查看这对方向相似度的影响。

我们可以扩展此指标的思想，以衡量原始图像和编辑版本之间的相似程度。为此，我们只需执行 F.cosine_similarity(img_feat_two, img_feat_one)。对于这些类型的编辑，我们仍然希望图像的主要语义尽可能地保留，即较高的相似度得分。

我们可以将这些指标用于类似的 pipelines，例如 StableDiffusionPix2PixZeroPipeline。

CLIP 分数和 CLIP 方向相似度都依赖于 CLIP 模型，这可能会使评估产生偏差。

当评估的模型在大型图像-标题数据集（例如 LAION-5B 数据集）上预训练时，扩展诸如 IS、FID（稍后讨论）或 KID 等指标可能会很困难。这是因为这些指标的基础是 InceptionNet（在 ImageNet-1k 数据集上预训练）用于提取中间图像特征。Stable Diffusion 的预训练数据集可能与 InceptionNet 的预训练数据集的重叠有限，因此在这里它不是特征提取的理想选择。

使用上述指标有助于评估类条件模型。例如，DiT。它在 ImageNet-1k 类上预训练。

类条件图像生成

类条件生成模型通常在类标签数据集上进行预训练，例如 ImageNet-1k。用于评估这些模型的常用指标包括 Fréchet Inception Distance (FID)、Kernel Inception Distance (KID) 和 Inception Score (IS)。在本文档中，我们重点关注 FID (Heusel et al.)。我们将展示如何使用 DiTPipeline 计算它，该管道在底层使用了 DiT 模型。

FID 旨在衡量两个图像数据集的相似程度。根据此资源

Fréchet Inception Distance (FID) 是衡量两个图像数据集之间相似性的指标。它被证明与人类对视觉质量的判断具有良好的相关性，并且最常用于评估生成对抗网络样本的质量。FID 的计算方法是计算拟合到 Inception 网络特征表示的两个高斯分布之间的 Fréchet 距离。

这两个数据集本质上是真实图像数据集和虚假图像数据集（在我们的例子中是生成的图像）。FID 通常使用两个大型数据集计算。但是，在本文档中，我们将使用两个小型数据集。

让我们首先从 ImageNet-1k 训练集中下载一些图像

from zipfile import ZipFile
import requests


def download(url, local_filepath):
    r = requests.get(url)
    with open(local_filepath, "wb") as f:
        f.write(r.content)
    return local_filepath

dummy_dataset_url = "https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/sample-imagenet-images.zip"
local_filepath = download(dummy_dataset_url, dummy_dataset_url.split("/")[-1])

with ZipFile(local_filepath, "r") as zipper:
    zipper.extractall(".")

from PIL import Image
import os
import numpy as np

dataset_path = "sample-imagenet-images"
image_paths = sorted([os.path.join(dataset_path, x) for x in os.listdir(dataset_path)])

real_images = [np.array(Image.open(path).convert("RGB")) for path in image_paths]

这些是来自以下 ImageNet-1k 类的 10 张图像：“盒式录音机”、“链锯”（x2）、“教堂”、“汽油泵”（x3）、“降落伞”（x2）和“丁鲷”。

real-images
真实图像。

现在图像已加载，让我们对它们应用一些轻量级的预处理，以便用于 FID 计算。

from torchvision.transforms import functional as F
import torch


def preprocess_image(image):
    image = torch.tensor(image).unsqueeze(0)
    image = image.permute(0, 3, 1, 2) / 255.0
    return F.center_crop(image, (256, 256))

real_images = torch.cat([preprocess_image(image) for image in real_images])
print(real_images.shape)
# torch.Size([10, 3, 256, 256])

我们现在加载 DiTPipeline 以生成以上述类别为条件的图像。

from diffusers import DiTPipeline, DPMSolverMultistepScheduler

dit_pipeline = DiTPipeline.from_pretrained("facebook/DiT-XL-2-256", torch_dtype=torch.float16)
dit_pipeline.scheduler = DPMSolverMultistepScheduler.from_config(dit_pipeline.scheduler.config)
dit_pipeline = dit_pipeline.to("cuda")

seed = 0
generator = torch.manual_seed(seed)


words = [
    "cassette player",
    "chainsaw",
    "chainsaw",
    "church",
    "gas pump",
    "gas pump",
    "gas pump",
    "parachute",
    "parachute",
    "tench",
]

class_ids = dit_pipeline.get_label_ids(words)
output = dit_pipeline(class_labels=class_ids, generator=generator, output_type="np")

fake_images = output.images
fake_images = torch.tensor(fake_images)
fake_images = fake_images.permute(0, 3, 1, 2)
print(fake_images.shape)
# torch.Size([10, 3, 256, 256])

现在，我们可以使用 torchmetrics 计算 FID。

from torchmetrics.image.fid import FrechetInceptionDistance

fid = FrechetInceptionDistance(normalize=True)
fid.update(real_images, real=True)
fid.update(fake_images, real=False)

print(f"FID: {float(fid.compute())}")
# FID: 177.7147216796875

FID 越低越好。以下几点会影响 FID

图像数量（真实图像和虚假图像）
扩散过程中引入的随机性
扩散过程中的推理步骤数
扩散过程中使用的调度器

对于后两点，因此，良好的做法是在不同的种子和推理步骤中运行评估，然后报告平均结果。

FID 结果往往很脆弱，因为它们取决于许多因素

计算期间使用的特定 Inception 模型。
计算的实现准确性。
图像格式（如果我们从 PNG 开始与从 JPG 开始不同）。

考虑到这一点，FID 在比较相似的运行结果时通常最有用，但除非作者仔细披露 FID 测量代码，否则很难重现论文结果。

这些要点也适用于其他相关指标，例如 KID 和 IS。

作为最后一步，让我们直观地检查 fake_images。

fake-images
虚假图像。

< > 更新在 GitHub 上

←Diffusers 伦理指南使用 Diffusers 构建的项目→