使用 Stable Diffusion 进行图像插值
此笔记本展示了如何使用 Stable Diffusion 在图像之间进行插值。使用 Stable Diffusion 进行图像插值是创建中间图像的过程,这些图像使用基于扩散的生成模型从一个给定图像平滑过渡到另一个图像。
以下是使用 Stable Diffusion 进行图像插值的一些用例
- 数据增强:Stable Diffusion 可以通过生成位于现有数据点之间的合成图像来增强机器学习模型的训练数据。这可以提高机器学习模型的泛化能力和鲁棒性,尤其是在图像生成、分类或目标检测等任务中。
- 产品设计和原型设计:Stable Diffusion 可以通过生成具有细微差异的产品设计或原型的变体来帮助产品设计。这对于探索设计替代方案、进行用户研究或在投入物理原型之前可视化设计迭代很有用。
- 媒体制作内容生成:在媒体制作(如电影和视频编辑)中,Stable Diffusion 可用于在关键帧之间生成中间帧,从而实现更平滑的过渡并增强视觉叙事。与手动逐帧编辑相比,这可以节省时间和资源。
在图像插值的背景下,Stable Diffusion 模型通常用于导航高维潜在空间。每个维度都表示模型学习到的特定特征。通过遍历此潜在空间并在图像的不同潜在表示之间进行插值,模型能够生成一系列中间图像,这些图像显示了原始图像之间的平滑过渡。Stable Diffusion 中有两种类型的潜在表示:提示潜在表示和图像潜在表示。
潜在空间行走涉及沿着由两个或多个点(表示图像)定义的路径穿过潜在空间。通过仔细选择这些点以及它们之间的路径,可以控制生成图像的特征,例如样式、内容和其他视觉方面。
在此笔记本中,我们将探讨使用 Stable Diffusion 进行图像插值的示例,并演示如何实现和利用潜在空间行走来创建图像之间的平滑过渡。我们将提供代码片段和可视化效果来说明此过程的实际应用,从而更深入地了解生成模型如何以有意义的方式操纵和变形图像表示。
首先,让我们安装所有必需的模块。
!pip install -q diffusers transformers xformers accelerate !pip install -q numpy scipy ftfy Pillow
导入模块
import torch
import numpy as np
import os
import time
from PIL import Image
from IPython import display as IPdisplay
from tqdm.auto import tqdm
from diffusers import StableDiffusionPipeline
from diffusers import (
DDIMScheduler,
PNDMScheduler,
LMSDiscreteScheduler,
DPMSolverMultistepScheduler,
EulerAncestralDiscreteScheduler,
EulerDiscreteScheduler,
)
from transformers import logging
logging.set_verbosity_error()
让我们检查 CUDA 是否可用。
print(torch.cuda.is_available())
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
这些设置用于优化 PyTorch 模型在支持 CUDA 的 GPU 上的性能,尤其是在使用混合精度训练或推理时,这在速度和内存使用方面可能会有益。
来源:https://huggingface.co/docs/diffusers/optimization/fp16#memory-efficient-attention
torch.backends.cudnn.benchmark = True
torch.backends.cuda.matmul.allow_tf32 = True
模型
选择 runwayml/stable-diffusion-v1-5
模型和 LMSDiscreteScheduler
调度器来生成图像。尽管它是一项较旧的技术,但由于其快速性能、最小的内存需求以及大量基于 SD1.5 构建的社区微调模型的可用性,它仍然很受欢迎。但是,您可以随意尝试其他模型和调度器以比较结果。
model_name_or_path = "runwayml/stable-diffusion-v1-5"
scheduler = LMSDiscreteScheduler(
beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000
)
pipe = StableDiffusionPipeline.from_pretrained(
model_name_or_path,
scheduler=scheduler,
torch_dtype=torch.float32,
).to(device)
# Disable image generation progress bar, we'll display our own
pipe.set_progress_bar_config(disable=True)
这些方法旨在减少 GPU 消耗的内存。如果您有足够的 VRAM,可以跳过此单元格。
更多详细信息,请参阅此处:https://huggingface.co/docs/diffusers/en/optimization/opt_overview
特别是,有关以下方法的信息可以在此处找到:https://huggingface.co/docs/diffusers/optimization/memory
# Offloading the weights to the CPU and only loading them on the GPU can reduce memory consumption to less than 3GB.
pipe.enable_model_cpu_offload()
# Tighter ordering of memory tensors.
pipe.unet.to(memory_format=torch.channels_last)
# Decoding large batches of images with limited VRAM or batches with 32 images or more by decoding the batches of latents one image at a time.
pipe.enable_vae_slicing()
# Splitting the image into overlapping tiles, decoding the tiles, and then blending the outputs together to compose the final image.
pipe.enable_vae_tiling()
# Using Flash Attention; If you have PyTorch >= 2.0 installed, you should not expect a speed-up for inference when enabling xformers.
pipe.enable_xformers_memory_efficient_attention()
display_images
函数将图像数组列表转换为 GIF,将其保存到指定路径并返回 GIF 对象以进行显示。它使用当前时间命名 GIF 文件,并通过打印错误来处理任何错误。
def display_images(images, save_path):
try:
# Convert each image in the 'images' list from an array to an Image object.
images = [Image.fromarray(np.array(image[0], dtype=np.uint8)) for image in images]
# Generate a file name based on the current time, replacing colons with hyphens
# to ensure the filename is valid for file systems that don't allow colons.
filename = time.strftime("%H:%M:%S", time.localtime()).replace(":", "-")
# Save the first image in the list as a GIF file at the 'save_path' location.
# The rest of the images in the list are added as subsequent frames to the GIF.
# The GIF will play each frame for 100 milliseconds and will loop indefinitely.
images[0].save(
f"{save_path}/{filename}.gif",
save_all=True,
append_images=images[1:],
duration=100,
loop=0,
)
except Exception as e:
# If there is an error during the process, print the exception message.
print(e)
# Return the saved GIF as an IPython display object so it can be displayed in a notebook.
return IPdisplay.Image(f"{save_path}/{filename}.gif")
生成参数
seed
:此变量用于设置特定随机种子以实现可重复性。generator
:如果提供了种子,则将其设置为 PyTorch 随机数生成器对象,否则为 None。它确保使用它的操作具有可重复的结果。guidance_scale
:此参数控制模型在文本到图像生成任务中遵循提示的程度,较高的值会导致更强地遵循提示。num_inference_steps
:这指定模型生成图像所采取的步骤数。更多步骤可以产生更高质量的图像,但生成时间更长。num_interpolation_steps
:这决定在潜在空间中的两个点之间插值时使用的步数,影响生成动画中过渡的平滑度。height
:生成图像的高度(以像素为单位)。width
:生成图像的宽度(以像素为单位)。save_path
:将生成的 gif 保存到的文件系统路径。
# The seed is set to "None", because we want different results each time we run the generation.
seed = None
if seed is not None:
generator = torch.manual_seed(seed)
else:
generator = None
# The guidance scale is set to its normal range (7 - 10).
guidance_scale = 8
# The number of inference steps was chosen empirically to generate an acceptable picture within an acceptable time.
num_inference_steps = 15
# The higher you set this value, the smoother the interpolations will be. However, the generation time will increase. This value was chosen empirically.
num_interpolation_steps = 30
# I would not recommend less than 512 on either dimension. This is because this model was trained on 512x512 image resolution.
height = 512
width = 512
# The path where the generated GIFs will be saved
save_path = "/output"
if not os.path.exists(save_path):
os.makedirs(save_path)
示例 1:提示插值
在此示例中,正提示和负提示嵌入之间的插值允许探索由提示定义的两个概念点之间的空间,可能导致各种图像逐渐混合由提示决定的特征。在这种情况下,插值涉及添加缩放的增量到原始嵌入中,创建一个一系列新的嵌入,这些嵌入将在稍后用于生成图像,这些图像将在基于原始提示的不同状态之间具有平滑的过渡。
首先,我们需要对正文本提示和负文本提示进行标记并获取嵌入。正提示引导图像生成朝向所需的特征,而负提示则将其引导远离不需要的特征。
# The text prompt that describes the desired output image.
prompt = "Epic shot of Sweden, ultra detailed lake with an ren dear, nostalgic vintage, ultra cozy and inviting, wonderful light atmosphere, fairy, little photorealistic, digital painting, sharp focus, ultra cozy and inviting, wish to be there. very detailed, arty, should rank high on youtube for a dream trip."
# A negative prompt that can be used to steer the generation away from certain features; here, it is empty.
negative_prompt = "poorly drawn,cartoon, 2d, disfigured, bad art, deformed, poorly drawn, extra limbs, close up, b&w, weird colors, blurry"
# The step size for the interpolation in the latent space.
step_size = 0.001
# Tokenizing and encoding the prompt into embeddings.
prompt_tokens = pipe.tokenizer(
prompt,
padding="max_length",
max_length=pipe.tokenizer.model_max_length,
truncation=True,
return_tensors="pt",
)
prompt_embeds = pipe.text_encoder(prompt_tokens.input_ids.to(device))[0]
# Tokenizing and encoding the negative prompt into embeddings.
if negative_prompt is None:
negative_prompt = [""]
negative_prompt_tokens = pipe.tokenizer(
negative_prompt,
padding="max_length",
max_length=pipe.tokenizer.model_max_length,
truncation=True,
return_tensors="pt",
)
negative_prompt_embeds = pipe.text_encoder(negative_prompt_tokens.input_ids.to(device))[0]
现在让我们看看代码部分,它使用正态分布生成一个随机初始向量,该向量结构化以匹配扩散模型(UNet)期望的维度。这允许通过可选地使用随机数生成器来重现结果。在创建初始向量后,代码执行一系列两个嵌入(正提示和负提示)之间的插值,通过在每次迭代中逐步增加一个小步长。结果存储在一个名为“walked_embeddings”的列表中。
# Generating initial latent vectors from a random normal distribution, with the option to use a generator for reproducibility.
latents = torch.randn(
(1, pipe.unet.config.in_channels, height // 8, width // 8),
generator=generator,
)
walked_embeddings = []
# Interpolating between embeddings for the given number of interpolation steps.
for i in range(num_interpolation_steps):
walked_embeddings.append([prompt_embeds + step_size * i, negative_prompt_embeds + step_size * i])
最后,让我们根据插值后的嵌入生成一系列图像,然后显示这些图像。我们将迭代一个嵌入数组,使用每个嵌入生成具有指定特征(如高度、宽度和其他与图像生成相关的参数)的图像。然后我们将这些图像收集到一个列表中。生成完成后,我们将调用display_image
函数以给定的保存路径将这些图像保存并显示为GIF。
# Generating images using the interpolated embeddings.
images = []
for latent in tqdm(walked_embeddings):
images.append(
pipe(
height=height,
width=width,
num_images_per_prompt=1,
prompt_embeds=latent[0],
negative_prompt_embeds=latent[1],
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
generator=generator,
latents=latents,
).images
)
# Display of saved generated images.
display_images(images, save_path)
示例 2:单个提示的扩散潜在变量插值
与第一个示例不同,在这个示例中,我们执行的是扩散模型本身两个嵌入之间的插值,而不是提示之间的插值。请注意,在这种情况下,我们使用slerp函数进行插值。但是,没有任何东西阻止我们向一个嵌入添加一个常数值。
下面介绍的函数代表球面线性插值。它是在球体表面上进行插值的一种方法。此函数通常用于计算机图形学中以平滑的方式动画旋转,也可用于插值机器学习中高维数据点(例如生成模型中使用的潜在向量)。
来源来自Andrej Karpathy的gist:https://gist.github.com/karpathy/00103b0037c5aaea32fe1da1af553355。
有关此方法的更详细解释,请参阅:https://en.wikipedia.org/wiki/Slerp。
def slerp(v0, v1, num, t0=0, t1=1):
v0 = v0.detach().cpu().numpy()
v1 = v1.detach().cpu().numpy()
def interpolation(t, v0, v1, DOT_THRESHOLD=0.9995):
"""helper function to spherically interpolate two arrays v1 v2"""
dot = np.sum(v0 * v1 / (np.linalg.norm(v0) * np.linalg.norm(v1)))
if np.abs(dot) > DOT_THRESHOLD:
v2 = (1 - t) * v0 + t * v1
else:
theta_0 = np.arccos(dot)
sin_theta_0 = np.sin(theta_0)
theta_t = theta_0 * t
sin_theta_t = np.sin(theta_t)
s0 = np.sin(theta_0 - theta_t) / sin_theta_0
s1 = sin_theta_t / sin_theta_0
v2 = s0 * v0 + s1 * v1
return v2
t = np.linspace(t0, t1, num)
v3 = torch.tensor(np.array([interpolation(t[i], v0, v1) for i in range(num)]))
return v3
# The text prompt that describes the desired output image.
prompt = (
"Sci-fi digital painting of an alien landscape with otherworldly plants, strange creatures, and distant planets."
)
# A negative prompt that can be used to steer the generation away from certain features.
negative_prompt = "poorly drawn,cartoon, 3d, disfigured, bad art, deformed, poorly drawn, extra limbs, close up, b&w, weird colors, blurry"
# Generating initial latent vectors from a random normal distribution. In this example two latent vectors are generated, which will serve as start and end points for the interpolation.
# These vectors are shaped to fit the input requirements of the diffusion model's U-Net architecture.
latents = torch.randn(
(2, pipe.unet.config.in_channels, height // 8, width // 8),
generator=generator,
)
# Getting our latent embeddings
interpolated_latents = slerp(latents[0], latents[1], num_interpolation_steps)
# Generating images using the interpolated embeddings.
images = []
for latent_vector in tqdm(interpolated_latents):
images.append(
pipe(
prompt,
height=height,
width=width,
negative_prompt=negative_prompt,
num_images_per_prompt=1,
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
generator=generator,
latents=latent_vector[None, ...],
).images
)
# Display of saved generated images.
display_images(images, save_path)
示例 3:多个提示之间的插值
与第一个示例(我们从单个提示中移开)相反,在这个示例中,我们将插值任意数量的提示。为此,我们将获取连续的提示对并在它们之间创建平滑的过渡。然后,我们将组合这些连续对的插值,并指示模型基于它们生成图像。对于插值,我们将使用slerp函数,就像在第二个示例中一样。
再次,让我们标记化并获取嵌入,但这次是针对多个正文本提示和负文本提示。
# Text prompts that describes the desired output image.
prompts = [
"A cute dog in a beautiful field of lavander colorful flowers everywhere, perfect lighting, leica summicron 35mm f2.0, kodak portra 400, film grain",
"A cute cat in a beautiful field of lavander colorful flowers everywhere, perfect lighting, leica summicron 35mm f2.0, kodak portra 400, film grain",
]
# Negative prompts that can be used to steer the generation away from certain features.
negative_prompts = [
"poorly drawn,cartoon, 2d, sketch, cartoon, drawing, anime, disfigured, bad art, deformed, poorly drawn, extra limbs, close up, b&w, weird colors, blurry",
"poorly drawn,cartoon, 2d, sketch, cartoon, drawing, anime, disfigured, bad art, deformed, poorly drawn, extra limbs, close up, b&w, weird colors, blurry",
]
# NOTE: The number of prompts must match the number of negative prompts
batch_size = len(prompts)
# Tokenizing and encoding prompts into embeddings.
prompts_tokens = pipe.tokenizer(
prompts,
padding="max_length",
max_length=pipe.tokenizer.model_max_length,
truncation=True,
return_tensors="pt",
)
prompts_embeds = pipe.text_encoder(prompts_tokens.input_ids.to(device))[0]
# Tokenizing and encoding negative prompts into embeddings.
if negative_prompts is None:
negative_prompts = [""] * batch_size
negative_prompts_tokens = pipe.tokenizer(
negative_prompts,
padding="max_length",
max_length=pipe.tokenizer.model_max_length,
truncation=True,
return_tensors="pt",
)
negative_prompts_embeds = pipe.text_encoder(negative_prompts_tokens.input_ids.to(device))[0]
如前所述,我们将获取连续的提示对,并使用slerp
函数在它们之间创建平滑的过渡。
# Generating initial U-Net latent vectors from a random normal distribution.
latents = torch.randn(
(1, pipe.unet.config.in_channels, height // 8, width // 8),
generator=generator,
)
# Interpolating between embeddings pairs for the given number of interpolation steps.
interpolated_prompt_embeds = []
interpolated_negative_prompts_embeds = []
for i in range(batch_size - 1):
interpolated_prompt_embeds.append(slerp(prompts_embeds[i], prompts_embeds[i + 1], num_interpolation_steps))
interpolated_negative_prompts_embeds.append(
slerp(
negative_prompts_embeds[i],
negative_prompts_embeds[i + 1],
num_interpolation_steps,
)
)
interpolated_prompt_embeds = torch.cat(interpolated_prompt_embeds, dim=0).to(device)
interpolated_negative_prompts_embeds = torch.cat(interpolated_negative_prompts_embeds, dim=0).to(device)
最后,我们需要根据嵌入生成图像。
# Generating images using the interpolated embeddings.
images = []
for prompt_embeds, negative_prompt_embeds in tqdm(
zip(interpolated_prompt_embeds, interpolated_negative_prompts_embeds),
total=len(interpolated_prompt_embeds),
):
images.append(
pipe(
height=height,
width=width,
num_images_per_prompt=1,
prompt_embeds=prompt_embeds[None, ...],
negative_prompt_embeds=negative_prompt_embeds[None, ...],
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
generator=generator,
latents=latents,
).images
)
# Display of saved generated images.
display_images(images, save_path)
示例 4:单个提示的扩散潜在空间循环遍历
此示例取自:https://keras.org.cn/examples/generative/random_walks_with_stable_diffusion/
假设我们有两个噪声分量,我们将其称为x和y。我们从0移动到2π,并在每一步将x的余弦和y的正弦添加到结果中。使用这种方法,在我们运动结束时,我们最终会得到与我们开始时相同的噪声值。这意味着向量最终会变成自身,结束我们的运动。
# The text prompt that describes the desired output image.
prompt = "Beautiful sea sunset, warm light, Aivazovsky style"
# A negative prompt that can be used to steer the generation away from certain features
negative_prompt = "picture frames"
# Generating initial latent vectors from a random normal distribution to create a loop interpolation between them.
latents = torch.randn(
(2, 1, pipe.unet.config.in_channels, height // 8, width // 8),
generator=generator,
)
# Calculation of looped embeddings
walk_noise_x = latents[0].to(device)
walk_noise_y = latents[1].to(device)
# Walking on a trigonometric circle
walk_scale_x = torch.cos(torch.linspace(0, 2, num_interpolation_steps) * np.pi).to(device)
walk_scale_y = torch.sin(torch.linspace(0, 2, num_interpolation_steps) * np.pi).to(device)
# Applying interpolation to noise
noise_x = torch.tensordot(walk_scale_x, walk_noise_x, dims=0)
noise_y = torch.tensordot(walk_scale_y, walk_noise_y, dims=0)
circular_latents = noise_x + noise_y
# Generating images using the interpolated embeddings.
images = []
for latent_vector in tqdm(circular_latents):
images.append(
pipe(
prompt,
height=height,
width=width,
negative_prompt=negative_prompt,
num_images_per_prompt=1,
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
generator=generator,
latents=latent_vector,
).images
)
# Display of saved generated images.
display_images(images, save_path)
后续步骤
展望未来,您可以探索各种参数,例如引导比例、种子和插值步数,以观察它们如何影响生成的图像。此外,请考虑尝试不同的提示和调度器以进一步增强您的结果。另一个有价值的步骤是实现线性插值(linspace
)而不是球面线性插值(slerp
),并将结果进行比较以深入了解插值过程。