使用 Stable Diffusion 进行图像插值

本 Notebook 展示了如何使用 Stable Diffusion 在图像之间进行插值。使用 Stable Diffusion 进行图像插值是利用基于扩散的生成模型，创建从一张给定图像平滑过渡到另一张图像的中间图像的过程。

以下是使用 Stable Diffusion 进行图像插值的一些不同用例：

数据增强：Stable Diffusion 可以通过生成介于现有数据点之间的合成图像来增强机器学习模型的训练数据。这可以提高机器学习模型的泛化能力和鲁棒性，尤其是在图像生成、分类或目标检测等任务中。
产品设计和原型制作：Stable Diffusion 可以通过生成具有细微差异的产品设计或原型变体来辅助产品设计。这对于探索设计替代方案、进行用户研究或在提交物理原型之前可视化设计迭代非常有用。
媒体制作的内容生成：在电影和视频编辑等媒体制作中，Stable Diffusion 可用于在关键帧之间生成中间帧，从而实现更平滑的过渡并增强视觉叙事。与手动逐帧编辑相比，这可以节省时间和资源。

在图像插值的情境中，Stable Diffusion 模型通常用于在高维潜在空间中进行导航。每个维度代表模型学习到的特定特征。通过遍历这个潜在空间并在图像的不同潜在表示之间进行插值，模型能够生成一系列中间图像，这些图像显示了原始图像之间的平滑过渡。Stable Diffusion 中有两种类型的潜在变量：提示潜在变量和图像潜在变量。

潜在空间漫游涉及沿着由两个或更多点（表示图像）定义的路径在潜在空间中移动。通过仔细选择这些点及其之间的路径，可以控制生成图像的特征，例如样式、内容和其他视觉方面。

在本 Notebook 中，我们将探讨使用 Stable Diffusion 进行图像插值的示例，并演示如何实现和利用潜在空间漫游以在图像之间创建平滑过渡。我们将提供代码片段和可视化，以说明此过程的实际应用，从而更深入地了解生成模型如何以有意义的方式操纵和变形图像表示。

首先，让我们安装所有必需的模块。

!pip install -q diffusers transformers xformers accelerate
!pip install -q numpy scipy ftfy Pillow

导入模块

import torch
import numpy as np
import os

import time

from PIL import Image
from IPython import display as IPdisplay
from tqdm.auto import tqdm

from diffusers import StableDiffusionPipeline
from diffusers import (
    DDIMScheduler,
    PNDMScheduler,
    LMSDiscreteScheduler,
    DPMSolverMultistepScheduler,
    EulerAncestralDiscreteScheduler,
    EulerDiscreteScheduler,
)
from transformers import logging

logging.set_verbosity_error()

让我们检查 CUDA 是否可用。

print(torch.cuda.is_available())

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

这些设置用于优化 PyTorch 模型在启用 CUDA 的 GPU 上的性能，尤其是在使用混合精度训练或推理时，这在速度和内存使用方面都可能带来好处。
来源：https://huggingface.co/docs/diffusers/optimization/fp16#memory-efficient-attention

torch.backends.cudnn.benchmark = True
torch.backends.cuda.matmul.allow_tf32 = True

模型

选择 runwayml/stable-diffusion-v1-5 模型和 LMSDiscreteScheduler 调度器来生成图像。尽管它是一种较旧的技术，但由于其快速的性能、极低的内存需求以及基于 SD1.5 构建的大量社区微调模型的可用性，它仍然广受欢迎。但是，您可以自由尝试其他模型和调度器来比较结果。

model_name_or_path = "runwayml/stable-diffusion-v1-5"

scheduler = LMSDiscreteScheduler(
    beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000
)


pipe = StableDiffusionPipeline.from_pretrained(
    model_name_or_path,
    scheduler=scheduler,
    torch_dtype=torch.float32,
).to(device)

# Disable image generation progress bar, we'll display our own
pipe.set_progress_bar_config(disable=True)

这些方法旨在减少 GPU 消耗的内存。如果您的显存足够，可以跳过此单元格。

可以在此处找到更详细的信息：https://huggingface.co/docs/diffusers/en/optimization/opt_overview
特别是，可以在此处找到有关以下方法的信息：https://huggingface.co/docs/diffusers/optimization/memory

# Offloading the weights to the CPU and only loading them on the GPU can reduce memory consumption to less than 3GB.
pipe.enable_model_cpu_offload()

# Tighter ordering of memory tensors.
pipe.unet.to(memory_format=torch.channels_last)

# Decoding large batches of images with limited VRAM or batches with 32 images or more by decoding the batches of latents one image at a time.
pipe.enable_vae_slicing()

# Splitting the image into overlapping tiles, decoding the tiles, and then blending the outputs together to compose the final image.
pipe.enable_vae_tiling()

# Using Flash Attention; If you have PyTorch >= 2.0 installed, you should not expect a speed-up for inference when enabling xformers.
pipe.enable_xformers_memory_efficient_attention()

display_images 函数将图像数组列表转换为 GIF，将其保存到指定路径并返回 GIF 对象以供显示。它使用当前时间命名 GIF 文件并处理任何错误，通过打印出来。

def display_images(images, save_path):
    try:
        # Convert each image in the 'images' list from an array to an Image object.
        images = [Image.fromarray(np.array(image[0], dtype=np.uint8)) for image in images]

        # Generate a file name based on the current time, replacing colons with hyphens
        # to ensure the filename is valid for file systems that don't allow colons.
        filename = time.strftime("%H:%M:%S", time.localtime()).replace(":", "-")
        # Save the first image in the list as a GIF file at the 'save_path' location.
        # The rest of the images in the list are added as subsequent frames to the GIF.
        # The GIF will play each frame for 100 milliseconds and will loop indefinitely.
        images[0].save(
            f"{save_path}/{filename}.gif",
            save_all=True,
            append_images=images[1:],
            duration=100,
            loop=0,
        )
    except Exception as e:
        # If there is an error during the process, print the exception message.
        print(e)

    # Return the saved GIF as an IPython display object so it can be displayed in a notebook.
    return IPdisplay.Image(f"{save_path}/{filename}.gif")

生成参数

seed：此变量用于设置特定的随机种子以实现可重现性。
generator：如果提供了种子，则将其设置为 PyTorch 随机数生成器对象，否则为 None。它确保使用它的操作具有可重现的结果。
guidance_scale：此参数控制模型在文本到图像生成任务中应遵循提示的程度，值越高，对提示的遵循程度越强。
num_inference_steps：这指定了模型生成图像所需的步数。更多步数可以生成更高质量的图像，但生成时间更长。
num_interpolation_steps：这决定了在潜在空间中两个点之间插值时使用的步数，影响生成动画中过渡的平滑度。
height：生成图像的高度（像素）。
width：生成图像的宽度（像素）。
save_path：生成 GIF 将保存到的文件系统路径。

# The seed is set to "None", because we want different results each time we run the generation.
seed = None

if seed is not None:
    generator = torch.manual_seed(seed)
else:
    generator = None

# The guidance scale is set to its normal range (7 - 10).
guidance_scale = 8

# The number of inference steps was chosen empirically to generate an acceptable picture within an acceptable time.
num_inference_steps = 15

# The higher you set this value, the smoother the interpolations will be. However, the generation time will increase. This value was chosen empirically.
num_interpolation_steps = 30

# I would not recommend less than 512 on either dimension. This is because this model was trained on 512x512 image resolution.
height = 512
width = 512

# The path where the generated GIFs will be saved
save_path = "/output"

if not os.path.exists(save_path):
    os.makedirs(save_path)

示例 1：提示插值

在此示例中，正向和负向提示嵌入之间的插值允许探索由提示定义的两个概念点之间的空间，这可能导致各种图像逐渐融合提示所规定的特征。在这种情况下，插值涉及向原始嵌入添加按比例缩放的增量，创建一系列新的嵌入，这些嵌入稍后将用于基于原始提示生成具有平滑过渡的不同状态的图像。

Example 1

首先，我们需要对正向和负向文本提示进行分词并获取它们的嵌入。正向提示引导图像生成朝着所需特征方向发展，而负向提示则引导图像远离不想要的特征。

# The text prompt that describes the desired output image.
prompt = "Epic shot of Sweden, ultra detailed lake with an ren dear, nostalgic vintage, ultra cozy and inviting, wonderful light atmosphere, fairy, little photorealistic, digital painting, sharp focus, ultra cozy and inviting, wish to be there. very detailed, arty, should rank high on youtube for a dream trip."
# A negative prompt that can be used to steer the generation away from certain features; here, it is empty.
negative_prompt = "poorly drawn,cartoon, 2d, disfigured, bad art, deformed, poorly drawn, extra limbs, close up, b&w, weird colors, blurry"

# The step size for the interpolation in the latent space.
step_size = 0.001

# Tokenizing and encoding the prompt into embeddings.
prompt_tokens = pipe.tokenizer(
    prompt,
    padding="max_length",
    max_length=pipe.tokenizer.model_max_length,
    truncation=True,
    return_tensors="pt",
)
prompt_embeds = pipe.text_encoder(prompt_tokens.input_ids.to(device))[0]


# Tokenizing and encoding the negative prompt into embeddings.
if negative_prompt is None:
    negative_prompt = [""]

negative_prompt_tokens = pipe.tokenizer(
    negative_prompt,
    padding="max_length",
    max_length=pipe.tokenizer.model_max_length,
    truncation=True,
    return_tensors="pt",
)
negative_prompt_embeds = pipe.text_encoder(negative_prompt_tokens.input_ids.to(device))[0]

现在让我们看看代码部分，该部分使用正态分布生成一个随机初始向量，其结构与扩散模型 (UNet) 预期维度匹配。这允许通过选择性地使用随机数生成器来重现结果。创建初始向量后，代码通过逐步为每次迭代添加一个小步长，在两个嵌入（正向和负向提示）之间执行一系列插值。结果存储在一个名为“walked_embeddings”的列表中。

# Generating initial latent vectors from a random normal distribution, with the option to use a generator for reproducibility.
latents = torch.randn(
    (1, pipe.unet.config.in_channels, height // 8, width // 8),
    generator=generator,
)

walked_embeddings = []

# Interpolating between embeddings for the given number of interpolation steps.
for i in range(num_interpolation_steps):
    walked_embeddings.append([prompt_embeds + step_size * i, negative_prompt_embeds + step_size * i])

最后，让我们根据插值嵌入生成一系列图像，然后显示这些图像。我们将遍历嵌入数组，使用每个嵌入来生成具有指定特征（如高度、宽度和其他与图像生成相关的参数）的图像。然后，我们将这些图像收集到一个列表中。生成完成后，我们将调用 display_image 函数，将这些图像保存并显示为给定保存路径的 GIF。

# Generating images using the interpolated embeddings.
images = []
for latent in tqdm(walked_embeddings):
    images.append(
        pipe(
            height=height,
            width=width,
            num_images_per_prompt=1,
            prompt_embeds=latent[0],
            negative_prompt_embeds=latent[1],
            num_inference_steps=num_inference_steps,
            guidance_scale=guidance_scale,
            generator=generator,
            latents=latents,
        ).images
    )

# Display of saved generated images.
display_images(images, save_path)

示例 2：单个提示的扩散潜在空间插值

与第一个示例不同，在此示例中，我们正在执行扩散模型本身的两个嵌入之间的插值，而不是提示。请注意，在这种情况下，我们使用 slerp 函数进行插值。但是，没有什么能阻止我们向一个嵌入添加常量值。

Example 2

下面介绍的函数代表球形线性插值。这是一种在球面上进行插值的方法。此函数通常用于计算机图形学中以平滑方式动画化旋转，也可用于机器学习中高维数据点（如生成模型中使用的潜在向量）之间的插值。

来源来自 Andrej Karpathy 的 gist：https://gist.github.com/karpathy/00103b0037c5aaea32fe1da1af553355。
有关此方法的更详细解释，请参阅：https://en.wikipedia.org/wiki/Slerp。

def slerp(v0, v1, num, t0=0, t1=1):
    v0 = v0.detach().cpu().numpy()
    v1 = v1.detach().cpu().numpy()

    def interpolation(t, v0, v1, DOT_THRESHOLD=0.9995):
        """helper function to spherically interpolate two arrays v1 v2"""
        dot = np.sum(v0 * v1 / (np.linalg.norm(v0) * np.linalg.norm(v1)))
        if np.abs(dot) > DOT_THRESHOLD:
            v2 = (1 - t) * v0 + t * v1
        else:
            theta_0 = np.arccos(dot)
            sin_theta_0 = np.sin(theta_0)
            theta_t = theta_0 * t
            sin_theta_t = np.sin(theta_t)
            s0 = np.sin(theta_0 - theta_t) / sin_theta_0
            s1 = sin_theta_t / sin_theta_0
            v2 = s0 * v0 + s1 * v1
        return v2

    t = np.linspace(t0, t1, num)

    v3 = torch.tensor(np.array([interpolation(t[i], v0, v1) for i in range(num)]))

    return v3

# The text prompt that describes the desired output image.
prompt = (
    "Sci-fi digital painting of an alien landscape with otherworldly plants, strange creatures, and distant planets."
)
# A negative prompt that can be used to steer the generation away from certain features.
negative_prompt = "poorly drawn,cartoon, 3d, disfigured, bad art, deformed, poorly drawn, extra limbs, close up, b&w, weird colors, blurry"

# Generating initial latent vectors from a random normal distribution. In this example two latent vectors are generated, which will serve as start and end points for the interpolation.
# These vectors are shaped to fit the input requirements of the diffusion model's U-Net architecture.
latents = torch.randn(
    (2, pipe.unet.config.in_channels, height // 8, width // 8),
    generator=generator,
)

# Getting our latent embeddings
interpolated_latents = slerp(latents[0], latents[1], num_interpolation_steps)

# Generating images using the interpolated embeddings.
images = []
for latent_vector in tqdm(interpolated_latents):
    images.append(
        pipe(
            prompt,
            height=height,
            width=width,
            negative_prompt=negative_prompt,
            num_images_per_prompt=1,
            num_inference_steps=num_inference_steps,
            guidance_scale=guidance_scale,
            generator=generator,
            latents=latent_vector[None, ...],
        ).images
    )

# Display of saved generated images.
display_images(images, save_path)

示例 3：多个提示之间的插值

与第一个示例（我们偏离了单个提示）相反，在此示例中，我们将在任意数量的提示之间进行插值。为此，我们将获取连续的提示对，并在它们之间创建平滑过渡。然后，我们将结合这些连续对的插值，并指示模型根据它们生成图像。对于插值，我们将使用 slerp 函数，如第二个示例所示。

Example 3

再次，让我们对多个正向和负向文本提示进行分词并获取它们的嵌入。

# Text prompts that describes the desired output image.
prompts = [
    "A cute dog in a beautiful field of lavander colorful flowers everywhere, perfect lighting, leica summicron 35mm f2.0, kodak portra 400, film grain",
    "A cute cat in a beautiful field of lavander colorful flowers everywhere, perfect lighting, leica summicron 35mm f2.0, kodak portra 400, film grain",
]
# Negative prompts that can be used to steer the generation away from certain features.
negative_prompts = [
    "poorly drawn,cartoon, 2d, sketch, cartoon, drawing, anime, disfigured, bad art, deformed, poorly drawn, extra limbs, close up, b&w, weird colors, blurry",
    "poorly drawn,cartoon, 2d, sketch, cartoon, drawing, anime, disfigured, bad art, deformed, poorly drawn, extra limbs, close up, b&w, weird colors, blurry",
]

# NOTE: The number of prompts must match the number of negative prompts

batch_size = len(prompts)

# Tokenizing and encoding prompts into embeddings.
prompts_tokens = pipe.tokenizer(
    prompts,
    padding="max_length",
    max_length=pipe.tokenizer.model_max_length,
    truncation=True,
    return_tensors="pt",
)
prompts_embeds = pipe.text_encoder(prompts_tokens.input_ids.to(device))[0]

# Tokenizing and encoding negative prompts into embeddings.
if negative_prompts is None:
    negative_prompts = [""] * batch_size

negative_prompts_tokens = pipe.tokenizer(
    negative_prompts,
    padding="max_length",
    max_length=pipe.tokenizer.model_max_length,
    truncation=True,
    return_tensors="pt",
)
negative_prompts_embeds = pipe.text_encoder(negative_prompts_tokens.input_ids.to(device))[0]

如前所述，我们将获取连续的提示对，并使用 slerp 函数在它们之间创建平滑过渡。

# Generating initial U-Net latent vectors from a random normal distribution.
latents = torch.randn(
    (1, pipe.unet.config.in_channels, height // 8, width // 8),
    generator=generator,
)

# Interpolating between embeddings pairs for the given number of interpolation steps.
interpolated_prompt_embeds = []
interpolated_negative_prompts_embeds = []
for i in range(batch_size - 1):
    interpolated_prompt_embeds.append(slerp(prompts_embeds[i], prompts_embeds[i + 1], num_interpolation_steps))
    interpolated_negative_prompts_embeds.append(
        slerp(
            negative_prompts_embeds[i],
            negative_prompts_embeds[i + 1],
            num_interpolation_steps,
        )
    )

interpolated_prompt_embeds = torch.cat(interpolated_prompt_embeds, dim=0).to(device)

interpolated_negative_prompts_embeds = torch.cat(interpolated_negative_prompts_embeds, dim=0).to(device)

最后，我们需要根据嵌入生成图像。

# Generating images using the interpolated embeddings.
images = []
for prompt_embeds, negative_prompt_embeds in tqdm(
    zip(interpolated_prompt_embeds, interpolated_negative_prompts_embeds),
    total=len(interpolated_prompt_embeds),
):
    images.append(
        pipe(
            height=height,
            width=width,
            num_images_per_prompt=1,
            prompt_embeds=prompt_embeds[None, ...],
            negative_prompt_embeds=negative_prompt_embeds[None, ...],
            num_inference_steps=num_inference_steps,
            guidance_scale=guidance_scale,
            generator=generator,
            latents=latents,
        ).images
    )

# Display of saved generated images.
display_images(images, save_path)

示例 4：单个提示的扩散潜在空间循环漫游

此示例取自：https://keras.org.cn/examples/generative/random_walks_with_stable_diffusion/

让我们想象一下我们有两个噪声分量，我们称之为 x 和 y。我们从 0 移动到 2π，并且在每一步中，我们将 x 的余弦和 y 的正弦添加到结果中。使用这种方法，在运动结束时，我们最终会得到与我们开始时相同的噪声值。这意味着向量最终会变成它们自己，从而结束我们的运动。

Example 4

# The text prompt that describes the desired output image.
prompt = "Beautiful sea sunset, warm light, Aivazovsky style"
# A negative prompt that can be used to steer the generation away from certain features
negative_prompt = "picture frames"

# Generating initial latent vectors from a random normal distribution to create a loop interpolation between them.
latents = torch.randn(
    (2, 1, pipe.unet.config.in_channels, height // 8, width // 8),
    generator=generator,
)


# Calculation of looped embeddings
walk_noise_x = latents[0].to(device)
walk_noise_y = latents[1].to(device)

# Walking on a trigonometric circle
walk_scale_x = torch.cos(torch.linspace(0, 2, num_interpolation_steps) * np.pi).to(device)
walk_scale_y = torch.sin(torch.linspace(0, 2, num_interpolation_steps) * np.pi).to(device)

# Applying interpolation to noise
noise_x = torch.tensordot(walk_scale_x, walk_noise_x, dims=0)
noise_y = torch.tensordot(walk_scale_y, walk_noise_y, dims=0)

circular_latents = noise_x + noise_y

# Generating images using the interpolated embeddings.
images = []
for latent_vector in tqdm(circular_latents):
    images.append(
        pipe(
            prompt,
            height=height,
            width=width,
            negative_prompt=negative_prompt,
            num_images_per_prompt=1,
            num_inference_steps=num_inference_steps,
            guidance_scale=guidance_scale,
            generator=generator,
            latents=latent_vector,
        ).images
    )

# Display of saved generated images.
display_images(images, save_path)

下一步

接下来，您可以探索各种参数，例如指导比例、种子和插值步数，以观察它们如何影响生成的图像。此外，考虑尝试不同的提示和调度器，以进一步改进您的结果。另一个有价值的步骤是实现线性插值（linspace）而不是球形线性插值（slerp），并比较结果以更深入地了解插值过程。

< > 在 GitHub 上更新