Diffusers

加入 Hugging Face 社区

并获得增强的文档体验

在模型、数据集和 Spaces 上协作

通过加速推理获得更快的示例

切换文档主题

开始使用

Stable Video Diffusion

Stable Video Diffusion (SVD) 是一个强大的图像到视频生成模型，可以根据输入图像生成 2-4 秒的高分辨率 (576x1024) 视频。

本指南将向您展示如何使用 SVD 从图像生成短视频。

在开始之前，请确保您已安装以下库

# Colab에서 필요한 라이브러리를 설치하기 위해 주석을 제외하세요
!pip install -q -U diffusers transformers accelerate

此模型有两个变体，SVD 和 SVD-XT。 SVD 检查点经过训练生成 14 帧，而 SVD-XT 检查点经过进一步微调以生成 25 帧。

在本指南中，您将使用 SVD-XT 检查点。

import torch

from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video

pipe = StableVideoDiffusionPipeline.from_pretrained(
    "stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16"
)
pipe.enable_model_cpu_offload()

# Load the conditioning image
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png")
image = image.resize((1024, 576))

generator = torch.manual_seed(42)
frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0]

export_to_video(frames, "generated.mp4", fps=7)

“火箭的源图像”

“从源图像生成的视频”

torch.compile

通过编译 UNet，您可以获得 20-25% 的速度提升，但会略微增加内存占用。

- pipe.enable_model_cpu_offload()
+ pipe.to("cuda")
+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

减少内存使用

视频生成非常消耗内存，因为您本质上是在一次性生成 num_frames，类似于高批次大小的文本到图像生成。为了减少内存需求，有多种选项可以在推理速度和较低的内存需求之间进行权衡

启用模型卸载：pipeline 的每个组件在不再需要时都会卸载到 CPU。
启用前馈分块：前馈层在一个循环中运行，而不是运行具有巨大批次大小的单个前馈。
减小 decode_chunk_size：VAE 以块解码帧，而不是将它们一起解码。设置 decode_chunk_size=1 一次解码一帧，并使用最少的内存（我们建议根据您的 GPU 内存调整此值），但视频可能会出现一些闪烁。

- pipe.enable_model_cpu_offload()
- frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0]
+ pipe.enable_model_cpu_offload()
+ pipe.unet.enable_forward_chunking()
+ frames = pipe(image, decode_chunk_size=2, generator=generator, num_frames=25).frames[0]

一起使用所有这些技巧应将内存需求降低到 8GB 显存以下。

微调条件

Stable Diffusion Video 除了条件图像外，还接受微调条件，从而可以更好地控制生成的视频

fps：生成的视频的帧率（每秒帧数）。
motion_bucket_id：用于生成视频的运动 bucket id。这可用于控制生成视频的运动。增加运动 bucket id 会增加生成视频的运动。
noise_aug_strength：添加到条件图像的噪声量。值越高，视频与条件图像的相似度就越低。增加此值也会增加生成视频的运动。

例如，要生成运动更多的视频，请使用 motion_bucket_id 和 noise_aug_strength 微调条件参数

import torch

from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video

pipe = StableVideoDiffusionPipeline.from_pretrained(
  "stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16"
)
pipe.enable_model_cpu_offload()

# Load the conditioning image
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png")
image = image.resize((1024, 576))

generator = torch.manual_seed(42)
frames = pipe(image, decode_chunk_size=8, generator=generator, motion_bucket_id=180, noise_aug_strength=0.1).frames[0]
export_to_video(frames, "generated.mp4", fps=7)

< > 在 GitHub 上更新

←轨迹一致性蒸馏-LoRA Marigold 计算机视觉→