开源 AI 食谱文档
使用 Stable Diffusion 进行图像插值
并获得增强的文档体验
开始使用
使用 Stable Diffusion 进行图像插值
本笔记本展示了如何使用 Stable Diffusion 在图像之间进行插值。使用 Stable Diffusion 进行图像插值是指使用基于扩散的生成模型,创建从一个给定图像平滑过渡到另一个给定图像的中间图像的过程。
以下是使用 Stable Diffusion 进行图像插值的一些用例
- 数据增强:Stable Diffusion 可以通过生成位于现有数据点之间的合成图像来增强机器学习模型的训练数据。这可以提高机器学习模型的泛化能力和鲁棒性,尤其是在图像生成、分类或对象检测等任务中。
- 产品设计和原型设计:Stable Diffusion 可以通过生成产品设计的变体或具有细微差别的原型来辅助产品设计。这对于探索设计方案、进行用户研究或在投入物理原型之前可视化设计迭代非常有用。
- 媒体制作的内容生成:在电影和视频编辑等媒体制作中,Stable Diffusion 可用于生成关键帧之间的中间帧,从而实现更平滑的过渡并增强视觉叙事效果。与手动逐帧编辑相比,这可以节省时间和资源。
在图像插值的背景下,Stable Diffusion 模型通常用于在高维潜在空间中导航。每个维度代表模型学习到的特定特征。通过遍历这个潜在空间并在图像的不同潜在表示之间进行插值,模型能够生成一系列中间图像,这些图像显示了原始图像之间的平滑过渡。Stable Diffusion 中有两种类型的潜在空间:提示潜在空间和图像潜在空间。
潜在空间漫步涉及沿着由两个或多个点(代表图像)定义的路径在潜在空间中移动。通过仔细选择这些点以及它们之间的路径,可以控制生成图像的特征,例如风格、内容和其他视觉方面。
在本笔记本中,我们将探索使用 Stable Diffusion 进行图像插值的示例,并演示如何实现和利用潜在空间漫步来创建图像之间的平滑过渡。我们将提供代码片段和可视化效果,以说明此过程的实际操作,从而更深入地了解生成模型如何在有意义的方式中操纵和变形图像表示。
首先,让我们安装所有必需的模块。
!pip install -q diffusers transformers xformers accelerate !pip install -q numpy scipy ftfy Pillow
导入模块
import torch
import numpy as np
import os
import time
from PIL import Image
from IPython import display as IPdisplay
from tqdm.auto import tqdm
from diffusers import StableDiffusionPipeline
from diffusers import (
DDIMScheduler,
PNDMScheduler,
LMSDiscreteScheduler,
DPMSolverMultistepScheduler,
EulerAncestralDiscreteScheduler,
EulerDiscreteScheduler,
)
from transformers import logging
logging.set_verbosity_error()
让我们检查 CUDA 是否可用。
print(torch.cuda.is_available())
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
这些设置用于优化 PyTorch 模型在启用 CUDA 的 GPU 上的性能,尤其是在使用混合精度训练或推理时,这在速度和内存使用方面可能是有益的。
来源:https://huggingface.co/docs/diffusers/optimization/fp16#memory-efficient-attention
torch.backends.cudnn.benchmark = True
torch.backends.cuda.matmul.allow_tf32 = True
模型
选择 runwayml/stable-diffusion-v1-5
模型和 LMSDiscreteScheduler
调度器来生成图像。尽管这是一项较旧的技术,但由于其快速的性能、 минимальные 内存需求以及基于 SD1.5 构建的众多社区微调模型的可用性,它仍然很受欢迎。但是,您可以自由尝试其他模型和调度器来比较结果。
model_name_or_path = "runwayml/stable-diffusion-v1-5"
scheduler = LMSDiscreteScheduler(
beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000
)
pipe = StableDiffusionPipeline.from_pretrained(
model_name_or_path,
scheduler=scheduler,
torch_dtype=torch.float32,
).to(device)
# Disable image generation progress bar, we'll display our own
pipe.set_progress_bar_config(disable=True)
这些方法旨在减少 GPU 消耗的内存。如果您有足够的 VRAM,则可以跳过此单元格。
有关更多详细信息,请参见此处:https://huggingface.co/docs/diffusers/en/optimization/opt_overview
特别是,有关以下方法的信息可以在这里找到:https://huggingface.co/docs/diffusers/optimization/memory
# Offloading the weights to the CPU and only loading them on the GPU can reduce memory consumption to less than 3GB.
pipe.enable_model_cpu_offload()
# Tighter ordering of memory tensors.
pipe.unet.to(memory_format=torch.channels_last)
# Decoding large batches of images with limited VRAM or batches with 32 images or more by decoding the batches of latents one image at a time.
pipe.enable_vae_slicing()
# Splitting the image into overlapping tiles, decoding the tiles, and then blending the outputs together to compose the final image.
pipe.enable_vae_tiling()
# Using Flash Attention; If you have PyTorch >= 2.0 installed, you should not expect a speed-up for inference when enabling xformers.
pipe.enable_xformers_memory_efficient_attention()
display_images
函数将图像数组列表转换为 GIF,将其保存到指定路径并返回 GIF 对象以进行显示。它使用当前时间命名 GIF 文件,并处理任何错误并将其打印出来。
def display_images(images, save_path):
try:
# Convert each image in the 'images' list from an array to an Image object.
images = [Image.fromarray(np.array(image[0], dtype=np.uint8)) for image in images]
# Generate a file name based on the current time, replacing colons with hyphens
# to ensure the filename is valid for file systems that don't allow colons.
filename = time.strftime("%H:%M:%S", time.localtime()).replace(":", "-")
# Save the first image in the list as a GIF file at the 'save_path' location.
# The rest of the images in the list are added as subsequent frames to the GIF.
# The GIF will play each frame for 100 milliseconds and will loop indefinitely.
images[0].save(
f"{save_path}/{filename}.gif",
save_all=True,
append_images=images[1:],
duration=100,
loop=0,
)
except Exception as e:
# If there is an error during the process, print the exception message.
print(e)
# Return the saved GIF as an IPython display object so it can be displayed in a notebook.
return IPdisplay.Image(f"{save_path}/{filename}.gif")
生成参数
seed
:此变量用于设置特定的随机种子以实现可重复性。generator
:如果提供了种子,则将其设置为 PyTorch 随机数生成器对象,否则为 None。它确保使用它的操作具有可重复的结果。guidance_scale
:此参数控制模型应在文本到图像生成任务中遵循提示的程度,值越高,对提示的遵循越强。num_inference_steps
:这指定了模型生成图像所需的步数。更多步骤可以带来更高质量的图像,但生成时间更长。num_interpolation_steps
:这确定了在潜在空间中两个点之间进行插值时使用的步数,从而影响生成的动画中过渡的平滑度。height
:生成的图像的高度(以像素为单位)。width
:生成的图像的宽度(以像素为单位)。save_path
:将生成的 gif 保存到的文件系统路径。
# The seed is set to "None", because we want different results each time we run the generation.
seed = None
if seed is not None:
generator = torch.manual_seed(seed)
else:
generator = None
# The guidance scale is set to its normal range (7 - 10).
guidance_scale = 8
# The number of inference steps was chosen empirically to generate an acceptable picture within an acceptable time.
num_inference_steps = 15
# The higher you set this value, the smoother the interpolations will be. However, the generation time will increase. This value was chosen empirically.
num_interpolation_steps = 30
# I would not recommend less than 512 on either dimension. This is because this model was trained on 512x512 image resolution.
height = 512
width = 512
# The path where the generated GIFs will be saved
save_path = "/output"
if not os.path.exists(save_path):
os.makedirs(save_path)
示例 1:提示插值
在此示例中,正负提示嵌入之间的插值允许探索由提示定义的两个概念点之间的空间,从而可能导致各种图像逐渐融合由提示决定的特征。在这种情况下,插值涉及将缩放的增量添加到原始嵌入,创建一系列新的嵌入,这些嵌入稍后将用于生成图像,这些图像在基于原始提示的不同状态之间具有平滑过渡。
首先,我们需要对正负文本提示进行标记化并获得嵌入。正提示引导图像生成朝着期望的特征发展,而负提示则引导其远离不需要的特征。
# The text prompt that describes the desired output image.
prompt = "Epic shot of Sweden, ultra detailed lake with an ren dear, nostalgic vintage, ultra cozy and inviting, wonderful light atmosphere, fairy, little photorealistic, digital painting, sharp focus, ultra cozy and inviting, wish to be there. very detailed, arty, should rank high on youtube for a dream trip."
# A negative prompt that can be used to steer the generation away from certain features; here, it is empty.
negative_prompt = "poorly drawn,cartoon, 2d, disfigured, bad art, deformed, poorly drawn, extra limbs, close up, b&w, weird colors, blurry"
# The step size for the interpolation in the latent space.
step_size = 0.001
# Tokenizing and encoding the prompt into embeddings.
prompt_tokens = pipe.tokenizer(
prompt,
padding="max_length",
max_length=pipe.tokenizer.model_max_length,
truncation=True,
return_tensors="pt",
)
prompt_embeds = pipe.text_encoder(prompt_tokens.input_ids.to(device))[0]
# Tokenizing and encoding the negative prompt into embeddings.
if negative_prompt is None:
negative_prompt = [""]
negative_prompt_tokens = pipe.tokenizer(
negative_prompt,
padding="max_length",
max_length=pipe.tokenizer.model_max_length,
truncation=True,
return_tensors="pt",
)
negative_prompt_embeds = pipe.text_encoder(negative_prompt_tokens.input_ids.to(device))[0]
现在让我们看一下代码部分,该部分使用正态分布生成随机初始向量,该向量的结构与扩散模型 (UNet) 期望的维度相匹配。这允许通过可选地使用随机数生成器来实现结果的可重复性。创建初始向量后,代码通过为每次迭代增量添加一个小步长,在两个嵌入(正提示和负提示)之间执行一系列插值。结果存储在名为 “walked_embeddings” 的列表中。
# Generating initial latent vectors from a random normal distribution, with the option to use a generator for reproducibility.
latents = torch.randn(
(1, pipe.unet.config.in_channels, height // 8, width // 8),
generator=generator,
)
walked_embeddings = []
# Interpolating between embeddings for the given number of interpolation steps.
for i in range(num_interpolation_steps):
walked_embeddings.append([prompt_embeds + step_size * i, negative_prompt_embeds + step_size * i])
最后,让我们根据插值嵌入生成一系列图像,然后显示这些图像。我们将迭代嵌入数组,使用每个嵌入来生成具有指定特征(如高度、宽度和与图像生成相关的其他参数)的图像。然后,我们将这些图像收集到一个列表中。生成完成后,我们将调用 display_image
函数以将这些图像保存并显示为给定保存路径的 GIF。
# Generating images using the interpolated embeddings.
images = []
for latent in tqdm(walked_embeddings):
images.append(
pipe(
height=height,
width=width,
num_images_per_prompt=1,
prompt_embeds=latent[0],
negative_prompt_embeds=latent[1],
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
generator=generator,
latents=latents,
).images
)
# Display of saved generated images.
display_images(images, save_path)
示例 2:单提示的扩散潜在空间插值
与第一个示例不同,在本示例中,我们执行的是扩散模型自身两个嵌入之间的插值,而不是提示的插值。请注意,在本例中,我们使用 slerp 函数进行插值。但是,没有什么可以阻止我们向一个嵌入添加常量值。
下面介绍的函数代表球面线性插值。这是一种在球体表面上进行插值的方法。此函数通常用于计算机图形学中,以平滑的方式为旋转设置动画,也可以用于在机器学习中高维数据点之间进行插值,例如生成模型中使用的潜在向量。
来源来自 Andrej Karpathy 的 gist:https://gist.github.com/karpathy/00103b0037c5aaea32fe1da1af553355。
有关此方法的更详细说明,请参见:https://en.wikipedia.org/wiki/Slerp。
def slerp(v0, v1, num, t0=0, t1=1):
v0 = v0.detach().cpu().numpy()
v1 = v1.detach().cpu().numpy()
def interpolation(t, v0, v1, DOT_THRESHOLD=0.9995):
"""helper function to spherically interpolate two arrays v1 v2"""
dot = np.sum(v0 * v1 / (np.linalg.norm(v0) * np.linalg.norm(v1)))
if np.abs(dot) > DOT_THRESHOLD:
v2 = (1 - t) * v0 + t * v1
else:
theta_0 = np.arccos(dot)
sin_theta_0 = np.sin(theta_0)
theta_t = theta_0 * t
sin_theta_t = np.sin(theta_t)
s0 = np.sin(theta_0 - theta_t) / sin_theta_0
s1 = sin_theta_t / sin_theta_0
v2 = s0 * v0 + s1 * v1
return v2
t = np.linspace(t0, t1, num)
v3 = torch.tensor(np.array([interpolation(t[i], v0, v1) for i in range(num)]))
return v3
# The text prompt that describes the desired output image.
prompt = (
"Sci-fi digital painting of an alien landscape with otherworldly plants, strange creatures, and distant planets."
)
# A negative prompt that can be used to steer the generation away from certain features.
negative_prompt = "poorly drawn,cartoon, 3d, disfigured, bad art, deformed, poorly drawn, extra limbs, close up, b&w, weird colors, blurry"
# Generating initial latent vectors from a random normal distribution. In this example two latent vectors are generated, which will serve as start and end points for the interpolation.
# These vectors are shaped to fit the input requirements of the diffusion model's U-Net architecture.
latents = torch.randn(
(2, pipe.unet.config.in_channels, height // 8, width // 8),
generator=generator,
)
# Getting our latent embeddings
interpolated_latents = slerp(latents[0], latents[1], num_interpolation_steps)
# Generating images using the interpolated embeddings.
images = []
for latent_vector in tqdm(interpolated_latents):
images.append(
pipe(
prompt,
height=height,
width=width,
negative_prompt=negative_prompt,
num_images_per_prompt=1,
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
generator=generator,
latents=latent_vector[None, ...],
).images
)
# Display of saved generated images.
display_images(images, save_path)
示例 3:多个提示之间的插值
与第一个示例(我们从单个提示移开)相比,在本示例中,我们将在任意数量的提示之间进行插值。为此,我们将选取连续的提示对并在它们之间创建平滑过渡。然后,我们将组合这些连续对的插值,并指示模型基于它们生成图像。对于插值,我们将使用 slerp 函数,如第二个示例中所示。
再次,让我们标记化并获得嵌入,但这次是针对多个正负文本提示。
# Text prompts that describes the desired output image.
prompts = [
"A cute dog in a beautiful field of lavander colorful flowers everywhere, perfect lighting, leica summicron 35mm f2.0, kodak portra 400, film grain",
"A cute cat in a beautiful field of lavander colorful flowers everywhere, perfect lighting, leica summicron 35mm f2.0, kodak portra 400, film grain",
]
# Negative prompts that can be used to steer the generation away from certain features.
negative_prompts = [
"poorly drawn,cartoon, 2d, sketch, cartoon, drawing, anime, disfigured, bad art, deformed, poorly drawn, extra limbs, close up, b&w, weird colors, blurry",
"poorly drawn,cartoon, 2d, sketch, cartoon, drawing, anime, disfigured, bad art, deformed, poorly drawn, extra limbs, close up, b&w, weird colors, blurry",
]
# NOTE: The number of prompts must match the number of negative prompts
batch_size = len(prompts)
# Tokenizing and encoding prompts into embeddings.
prompts_tokens = pipe.tokenizer(
prompts,
padding="max_length",
max_length=pipe.tokenizer.model_max_length,
truncation=True,
return_tensors="pt",
)
prompts_embeds = pipe.text_encoder(prompts_tokens.input_ids.to(device))[0]
# Tokenizing and encoding negative prompts into embeddings.
if negative_prompts is None:
negative_prompts = [""] * batch_size
negative_prompts_tokens = pipe.tokenizer(
negative_prompts,
padding="max_length",
max_length=pipe.tokenizer.model_max_length,
truncation=True,
return_tensors="pt",
)
negative_prompts_embeds = pipe.text_encoder(negative_prompts_tokens.input_ids.to(device))[0]
如前所述,我们将选取连续的提示对,并使用 slerp
函数在它们之间创建平滑过渡。
# Generating initial U-Net latent vectors from a random normal distribution.
latents = torch.randn(
(1, pipe.unet.config.in_channels, height // 8, width // 8),
generator=generator,
)
# Interpolating between embeddings pairs for the given number of interpolation steps.
interpolated_prompt_embeds = []
interpolated_negative_prompts_embeds = []
for i in range(batch_size - 1):
interpolated_prompt_embeds.append(slerp(prompts_embeds[i], prompts_embeds[i + 1], num_interpolation_steps))
interpolated_negative_prompts_embeds.append(
slerp(
negative_prompts_embeds[i],
negative_prompts_embeds[i + 1],
num_interpolation_steps,
)
)
interpolated_prompt_embeds = torch.cat(interpolated_prompt_embeds, dim=0).to(device)
interpolated_negative_prompts_embeds = torch.cat(interpolated_negative_prompts_embeds, dim=0).to(device)
最后,我们需要基于嵌入生成图像。
# Generating images using the interpolated embeddings.
images = []
for prompt_embeds, negative_prompt_embeds in tqdm(
zip(interpolated_prompt_embeds, interpolated_negative_prompts_embeds),
total=len(interpolated_prompt_embeds),
):
images.append(
pipe(
height=height,
width=width,
num_images_per_prompt=1,
prompt_embeds=prompt_embeds[None, ...],
negative_prompt_embeds=negative_prompt_embeds[None, ...],
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
generator=generator,
latents=latents,
).images
)
# Display of saved generated images.
display_images(images, save_path)
示例 4:单提示的扩散潜在空间中的循环漫步
此示例取自:https://keras.org.cn/examples/generative/random_walks_with_stable_diffusion/
让我们想象一下,我们有两个噪声分量,我们称之为 x 和 y。我们从 0 移动到 2π 开始,并在每一步中将 x 的余弦和 y 的正弦添加到结果中。使用这种方法,在我们移动结束时,我们最终得到与开始时相同的噪声值。这意味着向量最终会变成它们自身,从而结束我们的移动。
# The text prompt that describes the desired output image.
prompt = "Beautiful sea sunset, warm light, Aivazovsky style"
# A negative prompt that can be used to steer the generation away from certain features
negative_prompt = "picture frames"
# Generating initial latent vectors from a random normal distribution to create a loop interpolation between them.
latents = torch.randn(
(2, 1, pipe.unet.config.in_channels, height // 8, width // 8),
generator=generator,
)
# Calculation of looped embeddings
walk_noise_x = latents[0].to(device)
walk_noise_y = latents[1].to(device)
# Walking on a trigonometric circle
walk_scale_x = torch.cos(torch.linspace(0, 2, num_interpolation_steps) * np.pi).to(device)
walk_scale_y = torch.sin(torch.linspace(0, 2, num_interpolation_steps) * np.pi).to(device)
# Applying interpolation to noise
noise_x = torch.tensordot(walk_scale_x, walk_noise_x, dims=0)
noise_y = torch.tensordot(walk_scale_y, walk_noise_y, dims=0)
circular_latents = noise_x + noise_y
# Generating images using the interpolated embeddings.
images = []
for latent_vector in tqdm(circular_latents):
images.append(
pipe(
prompt,
height=height,
width=width,
negative_prompt=negative_prompt,
num_images_per_prompt=1,
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
generator=generator,
latents=latent_vector,
).images
)
# Display of saved generated images.
display_images(images, save_path)
下一步
展望未来,您可以探索各种参数,例如引导比例、种子和插值步数,以观察它们如何影响生成的图像。此外,考虑尝试不同的提示和调度器以进一步增强您的结果。另一个有价值的步骤是实施线性插值 (linspace
) 而不是球面线性插值 (slerp
),并比较结果以更深入地了解插值过程。