Diffusers 文档

使用 AnimateDiff 进行文本到视频生成

Diffusers

加入 Hugging Face 社区

并获得增强的文档体验

在模型、数据集和 Spaces 上进行协作

通过加速推理获得更快的示例

切换文档主题

开始使用

使用 AnimateDiff 进行文本到视频生成

概览

AnimateDiff: 无需特定调整即可为个性化文本到图像扩散模型添加动画作者：Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, Bo Dai。

论文摘要如下：

随着文本到图像模型（例如 Stable Diffusion）和相应的个性化技术（如 DreamBooth 和 LoRA）的进步，每个人都可以以可承受的成本将自己的想象力转化为高质量的图像。随后，对图像动画技术的需求巨大，以进一步将生成的静态图像与运动动态相结合。在本报告中，我们提出了一个实用的框架，可以一劳永逸地为大多数现有个性化文本到图像模型添加动画，从而省去了模型特定调整的工作。所提出的框架的核心是将新初始化的运动建模模块插入到冻结的文本到图像模型中，并在视频剪辑上对其进行训练，以提取合理的运动先验。一旦训练完成，只需注入这个运动建模模块，所有源自相同基础 T2I 的个性化版本都可以轻松成为文本驱动模型，生成多样化且个性化的动画图像。我们对动漫图片和写实照片的几个公共代表性个性化文本到图像模型进行了评估，并证明我们提出的框架有助于这些模型生成时间上平滑的动画剪辑，同时保留其输出的领域和多样性。代码和预训练权重将在此 https URL 公开提供。

可用管道

流水线	任务	演示
AnimateDiffPipeline	使用 AnimateDiff 进行文本到视频生成
AnimateDiffControlNetPipeline	使用 ControlNet 通过 AnimateDiff 进行受控视频到视频生成
AnimateDiffSparseControlNetPipeline	使用 SparseCtrl 通过 AnimateDiff 进行受控视频到视频生成
AnimateDiffSDXLPipeline	使用 AnimateDiff 进行视频到视频生成
AnimateDiffVideoToVideoPipeline	使用 AnimateDiff 进行视频到视频生成
AnimateDiffVideoToVideoControlNetPipeline	使用 ControlNet 通过 AnimateDiff 进行视频到视频生成

可用检查点

Motion Adapter 检查点可以在 guoyww 下找到。这些检查点旨在与任何基于 Stable Diffusion 1.4/1.5 的模型一起使用。

使用示例

AnimateDiffPipeline

AnimateDiff 配合 MotionAdapter 检查点和 Stable Diffusion 模型检查点使用。MotionAdapter 是一系列 Motion Modules 的集合，负责在图像帧之间添加连贯的运动。这些模块在 Stable Diffusion UNet 中的 Resnet 和 Attention 块之后应用。

以下示例演示了如何将 MotionAdapter 检查点与 Diffusers 结合使用，以进行基于 StableDiffusion-1.4/1.5 的推理。

import torch
from diffusers import AnimateDiffPipeline, DDIMScheduler, MotionAdapter
from diffusers.utils import export_to_gif

# Load the motion adapter
adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16)
# load SD 1.5 based finetuned model
model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
pipe = AnimateDiffPipeline.from_pretrained(model_id, motion_adapter=adapter, torch_dtype=torch.float16)
scheduler = DDIMScheduler.from_pretrained(
    model_id,
    subfolder="scheduler",
    clip_sample=False,
    timestep_spacing="linspace",
    beta_schedule="linear",
    steps_offset=1,
)
pipe.scheduler = scheduler

# enable memory savings
pipe.enable_vae_slicing()
pipe.enable_model_cpu_offload()

output = pipe(
    prompt=(
        "masterpiece, bestquality, highlydetailed, ultradetailed, sunset, "
        "orange sky, warm lighting, fishing boats, ocean waves seagulls, "
        "rippling water, wharf, silhouette, serene atmosphere, dusk, evening glow, "
        "golden hour, coastal landscape, seaside scenery"
    ),
    negative_prompt="bad quality, worse quality",
    num_frames=16,
    guidance_scale=7.5,
    num_inference_steps=25,
    generator=torch.Generator("cpu").manual_seed(42),
)
frames = output.frames[0]
export_to_gif(frames, "animation.gif")

以下是一些样本输出

杰作，最佳质量，日落。
masterpiece, bestquality, sunset

AnimateDiff 更适用于经过微调的 Stable Diffusion 模型。如果您打算使用可以剪裁样本的调度器，请确保通过在调度器中设置 clip_sample=False 来禁用它，因为这也会对生成的样本产生不利影响。此外，AnimateDiff 检查点可能对调度器的 beta 调度敏感。我们建议将其设置为 linear。

AnimateDiffControlNetPipeline

AnimateDiff 也可以与 ControlNets 一起使用。ControlNet 是由 Lvmin Zhang、Anyi Rao 和 Maneesh Agrawala 在《Adding Conditional Control to Text-to-Image Diffusion Models》中引入的。通过 ControlNet 模型，您可以提供一个额外的控制图像来调整和控制 Stable Diffusion 的生成。例如，如果您提供深度图，ControlNet 模型会生成一个视频，该视频将保留深度图中的空间信息。这是一种更灵活、更准确的视频生成控制方式。

import torch
from diffusers import AnimateDiffControlNetPipeline, AutoencoderKL, ControlNetModel, MotionAdapter, LCMScheduler
from diffusers.utils import export_to_gif, load_video

# Additionally, you will need a preprocess videos before they can be used with the ControlNet
# HF maintains just the right package for it: `pip install controlnet_aux`
from controlnet_aux.processor import ZoeDetector

# Download controlnets from https://huggingface.co/lllyasviel/ControlNet-v1-1 to use .from_single_file
# Download Diffusers-format controlnets, such as https://huggingface.co/lllyasviel/sd-controlnet-depth, to use .from_pretrained()
controlnet = ControlNetModel.from_single_file("control_v11f1p_sd15_depth.pth", torch_dtype=torch.float16)

# We use AnimateLCM for this example but one can use the original motion adapters as well (for example, https://huggingface.co/guoyww/animatediff-motion-adapter-v1-5-3)
motion_adapter = MotionAdapter.from_pretrained("wangfuyun/AnimateLCM")

vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16)
pipe: AnimateDiffControlNetPipeline = AnimateDiffControlNetPipeline.from_pretrained(
    "SG161222/Realistic_Vision_V5.1_noVAE",
    motion_adapter=motion_adapter,
    controlnet=controlnet,
    vae=vae,
).to(device="cuda", dtype=torch.float16)
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config, beta_schedule="linear")
pipe.load_lora_weights("wangfuyun/AnimateLCM", weight_name="AnimateLCM_sd15_t2v_lora.safetensors", adapter_name="lcm-lora")
pipe.set_adapters(["lcm-lora"], [0.8])

depth_detector = ZoeDetector.from_pretrained("lllyasviel/Annotators").to("cuda")
video = load_video("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-vid2vid-input-1.gif")
conditioning_frames = []

with pipe.progress_bar(total=len(video)) as progress_bar:
    for frame in video:
        conditioning_frames.append(depth_detector(frame))
        progress_bar.update()

prompt = "a panda, playing a guitar, sitting in a pink boat, in the ocean, mountains in background, realistic, high quality"
negative_prompt = "bad quality, worst quality"

video = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_frames=len(video),
    num_inference_steps=10,
    guidance_scale=2.0,
    conditioning_frames=conditioning_frames,
    generator=torch.Generator().manual_seed(42),
).frames[0]

export_to_gif(video, "animatediff_controlnet.gif", fps=8)

以下是一些样本输出

源视频	输出视频
弹吉他的浣熊	一只熊猫，弹着吉他，坐在粉色的小船上，在大海中，背景是山脉，逼真，高质量

AnimateDiffSparseControlNetPipeline

SparseCtrl: 为文本到视频扩散模型添加稀疏控制，用于通过 Yuwei Guo、Ceyuan Yang、Anyi Rao、Maneesh Agrawala、Dahua Lin 和 Bo Dai 实现文本到视频扩散模型中的受控生成。

论文摘要如下：

近年来，文本到视频（T2V），即根据给定的文本提示生成视频，取得了显著进展。然而，仅仅依靠文本提示往往会导致由于空间不确定性而产生的模糊帧构图。因此，研究界利用密集的结构信号（例如，每帧深度/边缘序列）来增强可控性，但其收集相应地增加了推理负担。在这项工作中，我们提出了 SparseCtrl，以实现对时间稀疏信号的灵活结构控制，如图 1 所示，仅需要一个或几个输入。它整合了一个额外的条件编码器来处理这些稀疏信号，同时保持预训练的 T2V 模型不变。所提出的方法与各种模态兼容，包括草图、深度图和 RGB 图像，为视频生成提供了更实用的控制，并促进了故事板、深度渲染、关键帧动画和插值等应用。大量的实验证明了 SparseCtrl 在原始和个性化 T2V 生成器上的泛化能力。代码和模型将在此 https URL 公开提供。

SparseCtrl 为受控文本到视频生成引入了以下检查点

使用 SparseCtrl Scribble

import torch

from diffusers import AnimateDiffSparseControlNetPipeline
from diffusers.models import AutoencoderKL, MotionAdapter, SparseControlNetModel
from diffusers.schedulers import DPMSolverMultistepScheduler
from diffusers.utils import export_to_gif, load_image


model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
motion_adapter_id = "guoyww/animatediff-motion-adapter-v1-5-3"
controlnet_id = "guoyww/animatediff-sparsectrl-scribble"
lora_adapter_id = "guoyww/animatediff-motion-lora-v1-5-3"
vae_id = "stabilityai/sd-vae-ft-mse"
device = "cuda"

motion_adapter = MotionAdapter.from_pretrained(motion_adapter_id, torch_dtype=torch.float16).to(device)
controlnet = SparseControlNetModel.from_pretrained(controlnet_id, torch_dtype=torch.float16).to(device)
vae = AutoencoderKL.from_pretrained(vae_id, torch_dtype=torch.float16).to(device)
scheduler = DPMSolverMultistepScheduler.from_pretrained(
    model_id,
    subfolder="scheduler",
    beta_schedule="linear",
    algorithm_type="dpmsolver++",
    use_karras_sigmas=True,
)
pipe = AnimateDiffSparseControlNetPipeline.from_pretrained(
    model_id,
    motion_adapter=motion_adapter,
    controlnet=controlnet,
    vae=vae,
    scheduler=scheduler,
    torch_dtype=torch.float16,
).to(device)
pipe.load_lora_weights(lora_adapter_id, adapter_name="motion_lora")
pipe.fuse_lora(lora_scale=1.0)

prompt = "an aerial view of a cyberpunk city, night time, neon lights, masterpiece, high quality"
negative_prompt = "low quality, worst quality, letterboxed"

image_files = [
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-scribble-1.png",
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-scribble-2.png",
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-scribble-3.png"
]
condition_frame_indices = [0, 8, 15]
conditioning_frames = [load_image(img_file) for img_file in image_files]

video = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=25,
    conditioning_frames=conditioning_frames,
    controlnet_conditioning_scale=1.0,
    controlnet_frame_indices=condition_frame_indices,
    generator=torch.Generator().manual_seed(1337),
).frames[0]
export_to_gif(video, "output.gif")

以下是一些样本输出

赛博朋克城市的鸟瞰图，夜间，霓虹灯，杰作，高质量

使用 SparseCtrl RGB

import torch

from diffusers import AnimateDiffSparseControlNetPipeline
from diffusers.models import AutoencoderKL, MotionAdapter, SparseControlNetModel
from diffusers.schedulers import DPMSolverMultistepScheduler
from diffusers.utils import export_to_gif, load_image


model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
motion_adapter_id = "guoyww/animatediff-motion-adapter-v1-5-3"
controlnet_id = "guoyww/animatediff-sparsectrl-rgb"
lora_adapter_id = "guoyww/animatediff-motion-lora-v1-5-3"
vae_id = "stabilityai/sd-vae-ft-mse"
device = "cuda"

motion_adapter = MotionAdapter.from_pretrained(motion_adapter_id, torch_dtype=torch.float16).to(device)
controlnet = SparseControlNetModel.from_pretrained(controlnet_id, torch_dtype=torch.float16).to(device)
vae = AutoencoderKL.from_pretrained(vae_id, torch_dtype=torch.float16).to(device)
scheduler = DPMSolverMultistepScheduler.from_pretrained(
    model_id,
    subfolder="scheduler",
    beta_schedule="linear",
    algorithm_type="dpmsolver++",
    use_karras_sigmas=True,
)
pipe = AnimateDiffSparseControlNetPipeline.from_pretrained(
    model_id,
    motion_adapter=motion_adapter,
    controlnet=controlnet,
    vae=vae,
    scheduler=scheduler,
    torch_dtype=torch.float16,
).to(device)
pipe.load_lora_weights(lora_adapter_id, adapter_name="motion_lora")

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-firework.png")

video = pipe(
    prompt="closeup face photo of man in black clothes, night city street, bokeh, fireworks in background",
    negative_prompt="low quality, worst quality",
    num_inference_steps=25,
    conditioning_frames=image,
    controlnet_frame_indices=[0],
    controlnet_conditioning_scale=1.0,
    generator=torch.Generator().manual_seed(42),
).frames[0]
export_to_gif(video, "output.gif")

以下是一些样本输出

黑衣男子的特写照片，夜间城市街道，散景，背景烟花

closeup face photo of man in black clothes, night city street, bokeh, fireworks in background

AnimateDiffSDXLPipeline

AnimateDiff 也可以与 SDXL 模型一起使用。这目前是一项实验性功能，因为目前只有运动适配器检查点的测试版可用。

import torch
from diffusers.models import MotionAdapter
from diffusers import AnimateDiffSDXLPipeline, DDIMScheduler
from diffusers.utils import export_to_gif

adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-sdxl-beta", torch_dtype=torch.float16)

model_id = "stabilityai/stable-diffusion-xl-base-1.0"
scheduler = DDIMScheduler.from_pretrained(
    model_id,
    subfolder="scheduler",
    clip_sample=False,
    timestep_spacing="linspace",
    beta_schedule="linear",
    steps_offset=1,
)
pipe = AnimateDiffSDXLPipeline.from_pretrained(
    model_id,
    motion_adapter=adapter,
    scheduler=scheduler,
    torch_dtype=torch.float16,
    variant="fp16",
).to("cuda")

# enable memory savings
pipe.enable_vae_slicing()
pipe.enable_vae_tiling()

output = pipe(
    prompt="a panda surfing in the ocean, realistic, high quality",
    negative_prompt="low quality, worst quality",
    num_inference_steps=20,
    guidance_scale=8,
    width=1024,
    height=1024,
    num_frames=16,
)

frames = output.frames[0]
export_to_gif(frames, "animation.gif")

AnimateDiffVideoToVideoPipeline

AnimateDiff 也可用于生成视觉上相似的视频，或启用从初始视频开始的风格/角色/背景或其他编辑，让您无缝探索创意可能性。

import imageio
import requests
import torch
from diffusers import AnimateDiffVideoToVideoPipeline, DDIMScheduler, MotionAdapter
from diffusers.utils import export_to_gif
from io import BytesIO
from PIL import Image

# Load the motion adapter
adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16)
# load SD 1.5 based finetuned model
model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
pipe = AnimateDiffVideoToVideoPipeline.from_pretrained(model_id, motion_adapter=adapter, torch_dtype=torch.float16)
scheduler = DDIMScheduler.from_pretrained(
    model_id,
    subfolder="scheduler",
    clip_sample=False,
    timestep_spacing="linspace",
    beta_schedule="linear",
    steps_offset=1,
)
pipe.scheduler = scheduler

# enable memory savings
pipe.enable_vae_slicing()
pipe.enable_model_cpu_offload()

# helper function to load videos
def load_video(file_path: str):
    images = []

    if file_path.startswith(('http://', 'https://')):
        # If the file_path is a URL
        response = requests.get(file_path)
        response.raise_for_status()
        content = BytesIO(response.content)
        vid = imageio.get_reader(content)
    else:
        # Assuming it's a local file path
        vid = imageio.get_reader(file_path)

    for frame in vid:
        pil_image = Image.fromarray(frame)
        images.append(pil_image)

    return images

video = load_video("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-vid2vid-input-1.gif")

output = pipe(
    video = video,
    prompt="panda playing a guitar, on a boat, in the ocean, high quality",
    negative_prompt="bad quality, worse quality",
    guidance_scale=7.5,
    num_inference_steps=25,
    strength=0.5,
    generator=torch.Generator("cpu").manual_seed(42),
)
frames = output.frames[0]
export_to_gif(frames, "animation.gif")

以下是一些样本输出

源视频	输出视频
弹吉他的浣熊	弹吉他的熊猫
玛格特·罗比特写，背景烟花，高质量	托尼·斯塔克特写，小罗伯特·唐尼，烟花

AnimateDiffVideoToVideoControlNetPipeline

AnimateDiff 可与 ControlNet 结合使用，通过精确控制输出来增强视频到视频的生成。ControlNet 由 Lvmin Zhang、Anyi Rao 和 Maneesh Agrawala 在《为文本到图像扩散模型添加条件控制》中引入，允许您使用额外的控制图像来调整 Stable Diffusion，以确保空间信息在整个视频中得到保留。

此管道允许您同时根据原始视频和控制图像序列来调整生成。

import torch
from PIL import Image
from tqdm.auto import tqdm

from controlnet_aux.processor import OpenposeDetector
from diffusers import AnimateDiffVideoToVideoControlNetPipeline
from diffusers.utils import export_to_gif, load_video
from diffusers import AutoencoderKL, ControlNetModel, MotionAdapter, LCMScheduler

# Load the ControlNet
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-openpose", torch_dtype=torch.float16)
# Load the motion adapter
motion_adapter = MotionAdapter.from_pretrained("wangfuyun/AnimateLCM")
# Load SD 1.5 based finetuned model
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16)
pipe = AnimateDiffVideoToVideoControlNetPipeline.from_pretrained(
    "SG161222/Realistic_Vision_V5.1_noVAE",
    motion_adapter=motion_adapter,
    controlnet=controlnet,
    vae=vae,
).to(device="cuda", dtype=torch.float16)

# Enable LCM to speed up inference
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config, beta_schedule="linear")
pipe.load_lora_weights("wangfuyun/AnimateLCM", weight_name="AnimateLCM_sd15_t2v_lora.safetensors", adapter_name="lcm-lora")
pipe.set_adapters(["lcm-lora"], [0.8])

video = load_video("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/dance.gif")
video = [frame.convert("RGB") for frame in video]

prompt = "astronaut in space, dancing"
negative_prompt = "bad quality, worst quality, jpeg artifacts, ugly"

# Create controlnet preprocessor
open_pose = OpenposeDetector.from_pretrained("lllyasviel/Annotators").to("cuda")

# Preprocess controlnet images
conditioning_frames = []
for frame in tqdm(video):
    conditioning_frames.append(open_pose(frame))

strength = 0.8
with torch.inference_mode():
    video = pipe(
        video=video,
        prompt=prompt,
        negative_prompt=negative_prompt,
        num_inference_steps=10,
        guidance_scale=2.0,
        controlnet_conditioning_scale=0.75,
        conditioning_frames=conditioning_frames,
        strength=strength,
        generator=torch.Generator().manual_seed(42),
    ).frames[0]

video = [frame.resize(conditioning_frames[0].size) for frame in video]
export_to_gif(video, f"animatediff_vid2vid_controlnet.gif", fps=8)

以下是一些样本输出

源视频	输出视频
动漫女孩，跳舞	太空中的宇航员，跳舞

灯光和构图从源视频中转移而来。

使用 Motion LoRA

Motion LoRA 是一系列与 guoyww/animatediff-motion-adapter-v1-5-2 检查点配合使用的 LoRA。这些 LoRA 负责为动画添加特定类型的运动。

import torch
from diffusers import AnimateDiffPipeline, DDIMScheduler, MotionAdapter
from diffusers.utils import export_to_gif

# Load the motion adapter
adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16)
# load SD 1.5 based finetuned model
model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
pipe = AnimateDiffPipeline.from_pretrained(model_id, motion_adapter=adapter, torch_dtype=torch.float16)
pipe.load_lora_weights(
    "guoyww/animatediff-motion-lora-zoom-out", adapter_name="zoom-out"
)

scheduler = DDIMScheduler.from_pretrained(
    model_id,
    subfolder="scheduler",
    clip_sample=False,
    beta_schedule="linear",
    timestep_spacing="linspace",
    steps_offset=1,
)
pipe.scheduler = scheduler

# enable memory savings
pipe.enable_vae_slicing()
pipe.enable_model_cpu_offload()

output = pipe(
    prompt=(
        "masterpiece, bestquality, highlydetailed, ultradetailed, sunset, "
        "orange sky, warm lighting, fishing boats, ocean waves seagulls, "
        "rippling water, wharf, silhouette, serene atmosphere, dusk, evening glow, "
        "golden hour, coastal landscape, seaside scenery"
    ),
    negative_prompt="bad quality, worse quality",
    num_frames=16,
    guidance_scale=7.5,
    num_inference_steps=25,
    generator=torch.Generator("cpu").manual_seed(42),
)
frames = output.frames[0]
export_to_gif(frames, "animation.gif")

杰作，最佳质量，日落。
masterpiece, bestquality, sunset

将 Motion LoRA 与 PEFT 结合使用

您还可以利用 PEFT 后端来组合 Motion LoRA 以创建更复杂的动画。

首先安装 PEFT：

pip install peft

然后您可以使用以下代码组合 Motion LoRA。

import torch
from diffusers import AnimateDiffPipeline, DDIMScheduler, MotionAdapter
from diffusers.utils import export_to_gif

# Load the motion adapter
adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16)
# load SD 1.5 based finetuned model
model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
pipe = AnimateDiffPipeline.from_pretrained(model_id, motion_adapter=adapter, torch_dtype=torch.float16)

pipe.load_lora_weights(
    "diffusers/animatediff-motion-lora-zoom-out", adapter_name="zoom-out",
)
pipe.load_lora_weights(
    "diffusers/animatediff-motion-lora-pan-left", adapter_name="pan-left",
)
pipe.set_adapters(["zoom-out", "pan-left"], adapter_weights=[1.0, 1.0])

scheduler = DDIMScheduler.from_pretrained(
    model_id,
    subfolder="scheduler",
    clip_sample=False,
    timestep_spacing="linspace",
    beta_schedule="linear",
    steps_offset=1,
)
pipe.scheduler = scheduler

# enable memory savings
pipe.enable_vae_slicing()
pipe.enable_model_cpu_offload()

output = pipe(
    prompt=(
        "masterpiece, bestquality, highlydetailed, ultradetailed, sunset, "
        "orange sky, warm lighting, fishing boats, ocean waves seagulls, "
        "rippling water, wharf, silhouette, serene atmosphere, dusk, evening glow, "
        "golden hour, coastal landscape, seaside scenery"
    ),
    negative_prompt="bad quality, worse quality",
    num_frames=16,
    guidance_scale=7.5,
    num_inference_steps=25,
    generator=torch.Generator("cpu").manual_seed(42),
)
frames = output.frames[0]
export_to_gif(frames, "animation.gif")

杰作，最佳质量，日落。
masterpiece, bestquality, sunset

使用 FreeInit

FreeInit: 弥合视频扩散模型中的初始化差距作者：Tianxing Wu, Chenyang Si, Yuming Jiang, Ziqi Huang, Ziwei Liu。

FreeInit 是一种有效的方法，无需额外训练即可提高使用视频扩散模型生成的视频的时间一致性和整体质量。它可以在推理时无缝应用于 AnimateDiff、ModelScope、VideoCrafter 和各种其他视频生成模型，通过迭代优化潜在初始化噪声来工作。更多详细信息可在论文中找到。

以下示例演示了 FreeInit 的用法。

import torch
from diffusers import MotionAdapter, AnimateDiffPipeline, DDIMScheduler
from diffusers.utils import export_to_gif

adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2")
model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
pipe = AnimateDiffPipeline.from_pretrained(model_id, motion_adapter=adapter, torch_dtype=torch.float16).to("cuda")
pipe.scheduler = DDIMScheduler.from_pretrained(
    model_id,
    subfolder="scheduler",
    beta_schedule="linear",
    clip_sample=False,
    timestep_spacing="linspace",
    steps_offset=1
)

# enable memory savings
pipe.enable_vae_slicing()
pipe.enable_vae_tiling()

# enable FreeInit
# Refer to the enable_free_init documentation for a full list of configurable parameters
pipe.enable_free_init(method="butterworth", use_fast_sampling=True)

# run inference
output = pipe(
    prompt="a panda playing a guitar, on a boat, in the ocean, high quality",
    negative_prompt="bad quality, worse quality",
    num_frames=16,
    guidance_scale=7.5,
    num_inference_steps=20,
    generator=torch.Generator("cpu").manual_seed(666),
)

# disable FreeInit
pipe.disable_free_init()

frames = output.frames[0]
export_to_gif(frames, "animation.gif")

FreeInit 并非真正免费——改进的质量是以额外计算为代价的。它需要根据启用时设置的 num_iters 参数进行几次额外采样。将 use_fast_sampling 参数设置为 True 可以提高整体性能（以牺牲质量为代价，但仍优于普通视频生成模型）。

请务必查看调度器指南，了解如何探索调度器速度和质量之间的权衡，并参阅跨管道重用组件部分，了解如何高效地将相同组件加载到多个管道中。

未启用 FreeInit	已启用 FreeInit
弹吉他的熊猫	弹吉他的熊猫

使用 AnimateLCM

AnimateLCM 是一个运动模块检查点和一个 LCM LoRA，它们是使用一致性学习策略创建的，该策略将图像生成先验和运动生成先验的蒸馏解耦。

import torch
from diffusers import AnimateDiffPipeline, LCMScheduler, MotionAdapter
from diffusers.utils import export_to_gif

adapter = MotionAdapter.from_pretrained("wangfuyun/AnimateLCM")
pipe = AnimateDiffPipeline.from_pretrained("emilianJR/epiCRealism", motion_adapter=adapter)
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config, beta_schedule="linear")

pipe.load_lora_weights("wangfuyun/AnimateLCM", weight_name="sd15_lora_beta.safetensors", adapter_name="lcm-lora")

pipe.enable_vae_slicing()
pipe.enable_model_cpu_offload()

output = pipe(
    prompt="A space rocket with trails of smoke behind it launching into space from the desert, 4k, high resolution",
    negative_prompt="bad quality, worse quality, low resolution",
    num_frames=16,
    guidance_scale=1.5,
    num_inference_steps=6,
    generator=torch.Generator("cpu").manual_seed(0),
)
frames = output.frames[0]
export_to_gif(frames, "animatelcm.gif")

一架太空火箭，4K。
A space rocket, 4K

AnimateLCM 还与现有 Motion LoRA 兼容。

import torch
from diffusers import AnimateDiffPipeline, LCMScheduler, MotionAdapter
from diffusers.utils import export_to_gif

adapter = MotionAdapter.from_pretrained("wangfuyun/AnimateLCM")
pipe = AnimateDiffPipeline.from_pretrained("emilianJR/epiCRealism", motion_adapter=adapter)
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config, beta_schedule="linear")

pipe.load_lora_weights("wangfuyun/AnimateLCM", weight_name="sd15_lora_beta.safetensors", adapter_name="lcm-lora")
pipe.load_lora_weights("guoyww/animatediff-motion-lora-tilt-up", adapter_name="tilt-up")

pipe.set_adapters(["lcm-lora", "tilt-up"], [1.0, 0.8])
pipe.enable_vae_slicing()
pipe.enable_model_cpu_offload()

output = pipe(
    prompt="A space rocket with trails of smoke behind it launching into space from the desert, 4k, high resolution",
    negative_prompt="bad quality, worse quality, low resolution",
    num_frames=16,
    guidance_scale=1.5,
    num_inference_steps=6,
    generator=torch.Generator("cpu").manual_seed(0),
)
frames = output.frames[0]
export_to_gif(frames, "animatelcm-motion-lora.gif")

一架太空火箭，4K。
A space rocket, 4K

使用 FreeNoise

FreeNoise: 通过噪声重调度实现免调优更长视频扩散作者：Haonan Qiu、Menghan Xia、Yong Zhang、Yingqing He、Xintao Wang、Ying Shan、Ziwei Liu。

FreeNoise 是一种采样机制，可以通过噪声重调度、滑动窗口上的时间注意力以及潜在帧的加权平均，使用短视频生成模型生成更长的视频。它还可以与多个提示一起使用，以实现插值视频生成。更多详细信息可在论文中找到。

目前支持与 FreeNoise 一起使用的 AnimateDiff 管道有：

为了使用 FreeNoise，在加载管道后，需要在推理代码中添加一行。

+ pipe.enable_free_noise()

此后，可以使用单个提示，或者将多个提示作为整数-字符串对的字典传递。字典的整数键对应于该提示影响最大的帧索引。每个帧索引应映射到一个字符串提示。字典中未传递的中间帧索引的提示是通过在传递的帧提示之间进行插值创建的。默认情况下，使用简单的线性插值。但是，您可以通过在启用 FreeNoise 时向 `prompt_interpolation_callback` 参数添加回调来自定义此行为。

完整示例

import torch
from diffusers import AutoencoderKL, AnimateDiffPipeline, LCMScheduler, MotionAdapter
from diffusers.utils import export_to_video, load_image

# Load pipeline
dtype = torch.float16
motion_adapter = MotionAdapter.from_pretrained("wangfuyun/AnimateLCM", torch_dtype=dtype)
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=dtype)

pipe = AnimateDiffPipeline.from_pretrained("emilianJR/epiCRealism", motion_adapter=motion_adapter, vae=vae, torch_dtype=dtype)
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config, beta_schedule="linear")

pipe.load_lora_weights(
    "wangfuyun/AnimateLCM", weight_name="AnimateLCM_sd15_t2v_lora.safetensors", adapter_name="lcm_lora"
)
pipe.set_adapters(["lcm_lora"], [0.8])

# Enable FreeNoise for long prompt generation
pipe.enable_free_noise(context_length=16, context_stride=4)
pipe.to("cuda")

# Can be a single prompt, or a dictionary with frame timesteps
prompt = {
    0: "A caterpillar on a leaf, high quality, photorealistic",
    40: "A caterpillar transforming into a cocoon, on a leaf, near flowers, photorealistic",
    80: "A cocoon on a leaf, flowers in the background, photorealistic",
    120: "A cocoon maturing and a butterfly being born, flowers and leaves visible in the background, photorealistic",
    160: "A beautiful butterfly, vibrant colors, sitting on a leaf, flowers in the background, photorealistic",
    200: "A beautiful butterfly, flying away in a forest, photorealistic",
    240: "A cyberpunk butterfly, neon lights, glowing",
}
negative_prompt = "bad quality, worst quality, jpeg artifacts"

# Run inference
output = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_frames=256,
    guidance_scale=2.5,
    num_inference_steps=10,
    generator=torch.Generator("cpu").manual_seed(0),
)

# Save video
frames = output.frames[0]
export_to_video(frames, "output.mp4", fps=16)

FreeNoise 内存节省

由于 FreeNoise 同时处理多个帧，因此在建模过程中，所需的内存会超出普通消费级 GPU 的可用内存。我们识别的主要内存瓶颈是空间和时间注意力块、上采样和下采样块、ResNet 块和前馈层。由于这些块大多数只在通道/嵌入维度上有效操作，因此可以对批次维度进行分块推理。AnimateDiff 中的批次维度本质上是空间（[B x F, H x W, C]）或时间（B x H x W, F, C）的（请注意，这可能看起来违反直觉，但这里的批次维度是正确的，因为空间块跨 B x F 维度处理，而时间块跨 B x H x W 维度处理）。我们引入了一个 SplitInferenceModule，可以更轻松地在任何维度上进行分块并执行推理。这节省了大量内存，但代价是需要更长的推理时间。

# Load pipeline and adapters
# ...
+ pipe.enable_free_noise_split_inference()
+ pipe.unet.enable_forward_chunking(16)

对 pipe.enable_free_noise_split_inference 方法的调用接受两个参数：spatial_split_size（默认为 256）和 temporal_split_size（默认为 16）。这些可以根据您可用的 VRAM 进行配置。较小的拆分大小导致较低的内存使用，但推理速度较慢，而较大的拆分大小导致较快的推理，但需要更多内存。

使用 from_single_file 与 MotionAdapter

diffusers>=0.30.0 支持通过 from_single_file 将 AnimateDiff 检查点以其原始格式加载到 MotionAdapter 中

from diffusers import MotionAdapter

ckpt_path = "https://huggingface.co/Lightricks/LongAnimateDiff/blob/main/lt_long_mm_32_frames.ckpt"

adapter = MotionAdapter.from_single_file(ckpt_path, torch_dtype=torch.float16)
pipe = AnimateDiffPipeline.from_pretrained("emilianJR/epiCRealism", motion_adapter=adapter)

AnimateDiffPipeline

class diffusers.AnimateDiffPipeline

< 源 >

( vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer unet: typing.Union[diffusers.models.unets.unet_2d_condition.UNet2DConditionModel, diffusers.models.unets.unet_motion_model.UNetMotionModel] motion_adapter: MotionAdapter scheduler: typing.Union[diffusers.schedulers.scheduling_ddim.DDIMScheduler, diffusers.schedulers.scheduling_pndm.PNDMScheduler, diffusers.schedulers.scheduling_lms_discrete.LMSDiscreteScheduler, diffusers.schedulers.scheduling_euler_discrete.EulerDiscreteScheduler, diffusers.schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteScheduler, diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler] feature_extractor: CLIPImageProcessor = None image_encoder: CLIPVisionModelWithProjection = None )

参数

vae (AutoencoderKL) — 用于将图像编码和解码为潜在表示的变分自编码器 (VAE) 模型。
text_encoder (CLIPTextModel) — 冻结的文本编码器（clip-vit-large-patch14）。
tokenizer (CLIPTokenizer) — 用于文本分词的 CLIPTokenizer。
unet (UNet2DConditionModel) — 用于创建 UNetMotionModel 以对编码视频潜在空间进行去噪的 UNet2DConditionModel。
motion_adapter (MotionAdapter) — 一个与 unet 结合使用的 MotionAdapter，用于对编码视频潜在空间进行去噪。
scheduler (SchedulerMixin) — 一个调度器，与 unet 结合使用，用于对编码图像潜在空间进行去噪。可以是 DDIMScheduler、LMSDiscreteScheduler 或 PNDMScheduler 中的一个。

文本到视频生成管道。

此模型继承自 DiffusionPipeline。有关所有管道实现的通用方法（下载、保存、在特定设备上运行等），请查看超类文档。

该管道还继承了以下加载方法

load_textual_inversion() 用于加载文本反演嵌入
load_lora_weights() 用于加载 LoRA 权重
save_lora_weights() 用于保存 LoRA 权重
load_ip_adapter() 用于加载 IP 适配器

call

< 源 >

( prompt: typing.Union[str, typing.List[str], NoneType] = None num_frames: typing.Optional[int] = 16 height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 guidance_scale: float = 7.5 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_videos_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] decode_chunk_size: int = 16 **kwargs ) → AnimateDiffPipelineOutput or tuple

参数

prompt (str 或 List[str], 可选) — 用于引导图像生成的提示词。如果未定义，则需要传递 prompt_embeds。
height (int, 可选, 默认为 self.unet.config.sample_size * self.vae_scale_factor) — 生成视频的高度（像素）。
width (int, 可选, 默认为 self.unet.config.sample_size * self.vae_scale_factor) — 生成视频的宽度（像素）。
num_frames (int, 可选, 默认为 16) — 生成的视频帧数。默认为 16 帧，按每秒 8 帧计算，相当于 2 秒的视频。
num_inference_steps (int, 可选, 默认为 50) — 去噪步数。更多的去噪步数通常会带来更高质量的视频，但代价是推理速度变慢。
guidance_scale (float, 可选, 默认为 7.5) — 更高的引导尺度值会鼓励模型生成与文本 prompt 紧密相关的图像，但代价是图像质量会降低。当 guidance_scale > 1 时启用引导尺度。
negative_prompt (str 或 List[str], 可选) — 引导图像生成时不包含的提示词。如果未定义，则需要传递 negative_prompt_embeds。当不使用引导时（guidance_scale < 1），此参数将被忽略。
eta (float, 可选, 默认为 0.0) — 对应于 DDIM 论文中的参数 eta (η)。仅适用于 DDIMScheduler，在其他调度器中将被忽略。
generator (torch.Generator 或 List[torch.Generator], 可选) — 用于使生成具有确定性的 torch.Generator。
latents (torch.Tensor, 可选) — 预先生成的从高斯分布采样的噪声潜变量，用作视频生成的输入。可用于使用不同的提示词调整相同的生成。如果未提供，则使用提供的随机 generator 采样生成潜变量张量。潜变量的形状应为 (batch_size, num_channel, num_frames, height, width)。
prompt_embeds (torch.Tensor, 可选) — 预先生成的文本嵌入。可用于轻松调整文本输入（提示词权重）。如果未提供，文本嵌入将从 prompt 输入参数生成。
negative_prompt_embeds (torch.Tensor, 可选) — 预先生成的负面文本嵌入。可用于轻松调整文本输入（提示词权重）。如果未提供，negative_prompt_embeds 将从 negative_prompt 输入参数生成。
ip_adapter_image — (PipelineImageInput, 可选): 用于 IP 适配器的可选图像输入。
ip_adapter_image_embeds (List[torch.Tensor], 可选) — 用于 IP-Adapter 的预先生成的图像嵌入。它应该是一个列表，长度与 IP-adapter 的数量相同。每个元素都应该是一个形状为 (batch_size, num_images, emb_dim) 的张量。如果 do_classifier_free_guidance 设置为 True，它应该包含负图像嵌入。如果未提供，嵌入将从 ip_adapter_image 输入参数计算。
output_type (str, 可选, 默认为 "pil") — 生成视频的输出格式。选择 torch.Tensor、PIL.Image 或 np.array。
return_dict (bool, 可选, 默认为 True) — 是否返回 TextToVideoSDPipelineOutput，否则返回一个 tuple，其中第一个元素是生成的帧列表。
cross_attention_kwargs (dict, 可选) — 一个 kwargs 字典，如果指定，将作为参数传递给 self.processor 中定义的 AttentionProcessor。
clip_skip (int, 可选) — 在计算提示嵌入时从 CLIP 跳过的层数。值为 1 表示将使用倒数第二层的输出计算提示嵌入。
callback_on_step_end (Callable, 可选) — 在推理过程中每次去噪步骤结束时调用的函数。该函数将使用以下参数调用：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。callback_kwargs 将包含 callback_on_step_end_tensor_inputs 指定的所有张量列表。
callback_on_step_end_tensor_inputs (List, 可选) — callback_on_step_end 函数的张量输入列表。列表中指定的张量将作为 callback_kwargs 参数传递。您只能包含管道类 ._callback_tensor_inputs 属性中列出的变量。
decode_chunk_size (int, 默认为 16) — 调用 decode_latents 方法时每次解码的帧数。

AnimateDiffPipelineOutput 或 tuple

如果 return_dict 为 True，则返回 AnimateDiffPipelineOutput，否则返回一个 tuple，其中第一个元素是生成的帧列表。

用于生成的管道的调用函数。

示例

>>> import torch
>>> from diffusers import MotionAdapter, AnimateDiffPipeline, DDIMScheduler
>>> from diffusers.utils import export_to_gif

>>> adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2")
>>> pipe = AnimateDiffPipeline.from_pretrained("frankjoshua/toonyou_beta6", motion_adapter=adapter)
>>> pipe.scheduler = DDIMScheduler(beta_schedule="linear", steps_offset=1, clip_sample=False)
>>> output = pipe(prompt="A corgi walking in the park")
>>> frames = output.frames[0]
>>> export_to_gif(frames, "animation.gif")

encode_prompt

< source >

( prompt device num_images_per_prompt do_classifier_free_guidance negative_prompt = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None clip_skip: typing.Optional[int] = None )

参数

prompt (str 或 List[str], 可选) — 要编码的提示词
device — (torch.device): torch 设备
num_images_per_prompt (int) — 每个提示词应生成的图像数量
do_classifier_free_guidance (bool) — 是否使用分类器自由引导
negative_prompt (str 或 List[str], 可选) — 不用于引导图像生成的提示词。如果未定义，则必须传递 negative_prompt_embeds。当不使用引导时（即 guidance_scale 小于 1 时忽略）。
prompt_embeds (torch.Tensor, 可选) — 预先生成的文本嵌入。可用于轻松调整文本输入，例如提示词加权。如果未提供，文本嵌入将从 prompt 输入参数生成。
negative_prompt_embeds (torch.Tensor, 可选) — 预先生成的负面文本嵌入。可用于轻松调整文本输入，例如提示词加权。如果未提供，negative_prompt_embeds 将从 negative_prompt 输入参数生成。
lora_scale (float, 可选) — 应用于文本编码器所有 LoRA 层的 LoRA 比例（如果已加载 LoRA 层）。
clip_skip (int, 可选) — 在计算提示嵌入时从 CLIP 跳过的层数。值为 1 表示将使用倒数第二层的输出计算提示嵌入。

将提示编码为文本编码器隐藏状态。

AnimateDiffControlNetPipeline

class diffusers.AnimateDiffControlNetPipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer unet: typing.Union[diffusers.models.unets.unet_2d_condition.UNet2DConditionModel, diffusers.models.unets.unet_motion_model.UNetMotionModel] motion_adapter: MotionAdapter controlnet: typing.Union[diffusers.models.controlnets.controlnet.ControlNetModel, typing.List[diffusers.models.controlnets.controlnet.ControlNetModel], typing.Tuple[diffusers.models.controlnets.controlnet.ControlNetModel], diffusers.models.controlnets.multicontrolnet.MultiControlNetModel] scheduler: KarrasDiffusionSchedulers feature_extractor: typing.Optional[transformers.models.clip.image_processing_clip.CLIPImageProcessor] = None image_encoder: typing.Optional[transformers.models.clip.modeling_clip.CLIPVisionModelWithProjection] = None )

参数

vae (AutoencoderKL) — 变分自动编码器 (VAE) 模型，用于将图像编码和解码为潜在表示。
text_encoder (CLIPTextModel) — 冻结文本编码器 (clip-vit-large-patch14)。
tokenizer (CLIPTokenizer) — 用于文本分词的 CLIPTokenizer。
unet (UNet2DConditionModel) — 用于创建 UNetMotionModel 以对编码视频潜变量进行去噪的 UNet2DConditionModel。
motion_adapter (MotionAdapter) — 用于与 unet 结合使用的 MotionAdapter，对编码视频潜变量进行去噪。
scheduler (SchedulerMixin) — 用于与 unet 结合使用以对编码图像潜变量进行去噪的调度器。可以是 DDIMScheduler、LMSDiscreteScheduler 或 PNDMScheduler 之一。

带 ControlNet 引导的文本到视频生成管道。

此模型继承自 DiffusionPipeline。有关所有管道实现的通用方法（下载、保存、在特定设备上运行等），请查看超类文档。

该管道还继承了以下加载方法

load_textual_inversion() 用于加载文本反演嵌入
load_lora_weights() 用于加载 LoRA 权重
save_lora_weights() 用于保存 LoRA 权重
load_ip_adapter() 用于加载 IP 适配器

call

< source >

( prompt: typing.Union[str, typing.List[str]] = None num_frames: typing.Optional[int] = 16 height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 guidance_scale: float = 7.5 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_videos_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None conditioning_frames: typing.Optional[typing.List[typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]]]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None controlnet_conditioning_scale: typing.Union[float, typing.List[float]] = 1.0 guess_mode: bool = False control_guidance_start: typing.Union[float, typing.List[float]] = 0.0 control_guidance_end: typing.Union[float, typing.List[float]] = 1.0 clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] decode_chunk_size: int = 16 ) → AnimateDiffPipelineOutput 或 tuple

参数

prompt (str 或 List[str], 可选) — 用于引导图像生成的提示词。如果未定义，则需要传递 prompt_embeds。
height (int, 可选, 默认为 self.unet.config.sample_size * self.vae_scale_factor) — 生成视频的高度（像素）。
width (int, 可选, 默认为 self.unet.config.sample_size * self.vae_scale_factor) — 生成视频的宽度（像素）。
num_frames (int, 可选, 默认为 16) — 生成视频的帧数。默认为 16 帧，按每秒 8 帧计算，相当于 2 秒的视频。
num_inference_steps (int, 可选, 默认为 50) — 去噪步数。更多的去噪步数通常会带来更高质量的视频，但代价是推理速度会更慢。
guidance_scale (float, 可选, 默认为 7.5) — 较高的指导尺度值会鼓励模型生成与文本 prompt 密切相关的图像，但代价是图像质量较低。当 guidance_scale > 1 时启用指导尺度。
negative_prompt (str 或 List[str], 可选) — 用于指导图像生成中不应包含的内容的提示。如果未定义，则需要传入 negative_prompt_embeds。当不使用指导时（guidance_scale < 1）则忽略。
eta (float, 可选, 默认为 0.0) — 对应于 DDIM 论文中的参数 eta (η)。仅适用于 DDIMScheduler，在其他调度器中被忽略。
generator (torch.Generator 或 List[torch.Generator], 可选) — 用于使生成具有确定性的 torch.Generator。
latents (torch.Tensor, 可选) — 从高斯分布中采样的预生成噪声潜在值，用作视频生成的输入。可用于使用不同提示调整相同生成。如果未提供，则使用提供的随机 generator 进行采样以生成潜在张量。潜在值的形状应为 (batch_size, num_channel, num_frames, height, width)。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入（提示权重）。如果未提供，则从 prompt 输入参数生成文本嵌入。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负文本嵌入。可用于轻松调整文本输入（提示权重）。如果未提供，negative_prompt_embeds 将从 negative_prompt 输入参数生成。
ip_adapter_image (PipelineImageInput, 可选) — 用于 IP Adapters 的可选图像输入。
ip_adapter_image_embeds (List[torch.Tensor], 可选) — 用于 IP-Adapter 的预生成图像嵌入。它应该是一个列表，长度与 IP-adapter 的数量相同。每个元素都应该是一个形状为 (batch_size, num_images, emb_dim) 的张量。如果 do_classifier_free_guidance 设置为 True，则应包含负图像嵌入。如果未提供，则根据 ip_adapter_image 输入参数计算嵌入。
conditioning_frames (List[PipelineImageInput], 可选) — ControlNet 输入条件，为 unet 提供生成指导。如果指定了多个 ControlNet，则图像必须作为列表传递，以便列表中的每个元素都可以正确批处理以输入到单个 ControlNet。
output_type (str, 可选, 默认为 "pil") — 生成视频的输出格式。可选择 torch.Tensor、PIL.Image 或 np.array。
return_dict (bool, 可选, 默认为 True) — 是否返回 TextToVideoSDPipelineOutput 而不是普通元组。
cross_attention_kwargs (dict, 可选) — 一个 kwargs 字典，如果指定，将作为参数传递给 self.processor 中定义的 AttentionProcessor。
controlnet_conditioning_scale (float 或 List[float], 可选, 默认为 1.0) — ControlNet 的输出乘以 controlnet_conditioning_scale 后再添加到原始 unet 中的残差。如果 init 中指定了多个 ControlNet，则可以将其相应的比例设置为列表。
guess_mode (bool, 可选, 默认为 False) — 即使您删除所有提示，ControlNet 编码器也会尝试识别输入图像的内容。建议 guidance_scale 值介于 3.0 到 5.0 之间。
control_guidance_start (float 或 List[float], 可选, 默认为 0.0) — ControlNet 开始应用的步数总百分比。
control_guidance_end (float 或 List[float], 可选, 默认为 1.0) — ControlNet 停止应用的步数总百分比。
clip_skip (int, 可选) — 在计算提示嵌入时，从 CLIP 中跳过的层数。值为 1 表示使用倒数第二层的输出计算提示嵌入。
callback_on_step_end (Callable, 可选) — 在推理期间，每个去噪步骤结束时调用的函数。该函数将使用以下参数调用：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。callback_kwargs 将包含 callback_on_step_end_tensor_inputs 中指定的所有张量列表。
callback_on_step_end_tensor_inputs (List, 可选) — callback_on_step_end 函数的张量输入列表。列表中指定的张量将作为 callback_kwargs 参数传递。您只能包含管道类 ._callback_tensor_inputs 属性中列出的变量。

AnimateDiffPipelineOutput 或 tuple

如果 return_dict 为 True，则返回 AnimateDiffPipelineOutput，否则返回一个 tuple，其中第一个元素是生成的帧列表。

用于生成的管道的调用函数。

示例

encode_prompt

< source >

参数

prompt (str 或 List[str], 可选) — 要编码的提示
device — (torch.device): torch 设备
num_images_per_prompt (int) — 每个提示应生成的图像数量
do_classifier_free_guidance (bool) — 是否使用分类器自由指导
negative_prompt (str 或 List[str], 可选) — 用于指导图像生成中不应包含的内容的提示。如果未定义，则必须传入 negative_prompt_embeds。当不使用指导时（即 guidance_scale 小于 1 时），则忽略。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，则将从 prompt 输入参数生成文本嵌入。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，则将从 negative_prompt 输入参数生成 negative_prompt_embeds。
lora_scale (float, 可选) — 一个 LoRA 比例，如果加载了 LoRA 层，它将应用于文本编码器的所有 LoRA 层。
clip_skip (int, 可选) — 在计算提示嵌入时，从 CLIP 中跳过的层数。值为 1 表示使用倒数第二层的输出计算提示嵌入。

将提示编码为文本编码器隐藏状态。

AnimateDiffSparseControlNetPipeline

类 diffusers.AnimateDiffSparseControlNetPipeline

< source >

参数

vae (AutoencoderKL) — 用于编码和解码图像与潜在表示的变分自编码器（VAE）模型。
text_encoder (CLIPTextModel) — 冻结文本编码器 (clip-vit-large-patch14)。
tokenizer (CLIPTokenizer) — 用于标记文本的 CLIPTokenizer。
unet (UNet2DConditionModel) — 用于创建 UNetMotionModel 以去噪编码视频潜在值的 UNet2DConditionModel。
motion_adapter (MotionAdapter) — 与 unet 结合使用以去噪编码视频潜在值的 MotionAdapter。
scheduler (SchedulerMixin) — 用于与 unet 结合使用以去噪编码图像潜在值的调度器。可以是 DDIMScheduler、LMSDiscreteScheduler 或 PNDMScheduler 中的一个。

使用 SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models 中描述的方法进行受控文本到视频生成的管道。

此模型继承自 DiffusionPipeline。有关所有管道实现的通用方法（下载、保存、在特定设备上运行等），请查看超类文档。

该管道还继承了以下加载方法

load_textual_inversion() 用于加载文本反演嵌入
load_lora_weights() 用于加载 LoRA 权重
save_lora_weights() 用于保存 LoRA 权重
load_ip_adapter() 用于加载 IP 适配器

call

< source >

( prompt: typing.Union[str, typing.List[str], NoneType] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_frames: int = 16 num_inference_steps: int = 50 guidance_scale: float = 7.5 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_videos_per_prompt: int = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None conditioning_frames: typing.Optional[typing.List[typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]]]] = None output_type: str = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None controlnet_conditioning_scale: typing.Union[float, typing.List[float]] = 1.0 controlnet_frame_indices: typing.List[int] = [0] guess_mode: bool = False clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] ) → AnimateDiffPipelineOutput 或 tuple

参数

prompt (str 或 List[str], 可选) — 用于指导图像生成的提示。如果未定义，则需要传入 prompt_embeds。
height (int, 可选, 默认为 self.unet.config.sample_size * self.vae_scale_factor) — 生成视频的高度（像素）。
width (int, 可选, 默认为 self.unet.config.sample_size * self.vae_scale_factor) — 生成视频的宽度（像素）。
num_frames (int, 可选, 默认为 16) — 生成视频的帧数。默认为 16 帧，按每秒 8 帧计算，相当于 2 秒的视频。
num_inference_steps (int, 可选, 默认为 50) — 去噪步数。更多的去噪步数通常会带来更高质量的视频，但代价是推理速度会更慢。
guidance_scale (float, 可选, 默认为 7.5) — 较高的指导尺度值会鼓励模型生成与文本 prompt 密切相关的图像，但代价是图像质量较低。当 guidance_scale > 1 时启用指导尺度。
negative_prompt (str 或 List[str], 可选) — 用于指导图像生成中不应包含的内容的提示。如果未定义，则需要传入 negative_prompt_embeds。当不使用指导时（guidance_scale < 1）则忽略。
eta (float, 可选, 默认为 0.0) — 对应于 DDIM 论文中的参数 eta (η)。仅适用于 DDIMScheduler，在其他调度器中被忽略。
generator (torch.Generator 或 List[torch.Generator], 可选) — 用于使生成具有确定性的 torch.Generator。
latents (torch.Tensor, optional) — 预先生成的噪声潜变量，从高斯分布中采样，用作视频生成的输入。可用于使用不同的提示调整相同的生成。如果未提供，则潜变量张量通过使用提供的随机 generator 进行采样生成。潜变量的形状应为 (batch_size, num_channel, num_frames, height, width)。
prompt_embeds (torch.Tensor, optional) — 预先生成的文本嵌入。可用于轻松调整文本输入（提示权重）。如果未提供，则文本嵌入将从 prompt 输入参数生成。
negative_prompt_embeds (torch.Tensor, optional) — 预先生成的负文本嵌入。可用于轻松调整文本输入（提示权重）。如果未提供，则 negative_prompt_embeds 将从 negative_prompt 输入参数生成。
ip_adapter_image — (PipelineImageInput, 可选)：与 IP Adapters 配合使用的可选图像输入。
ip_adapter_image_embeds (List[torch.Tensor], 可选) — 预先生成的 IP-Adapter 图像嵌入。它应该是一个列表，长度与 IP-Adapter 的数量相同。每个元素都应该是一个形状为 (batch_size, num_images, emb_dim) 的张量。如果 do_classifier_free_guidance 设置为 True，它应该包含负图像嵌入。如果未提供，则嵌入将从 ip_adapter_image 输入参数计算。
conditioning_frames (List[PipelineImageInput], 可选) — 用于为 unet 生成提供指导的 SparseControlNet 输入。
output_type (str, 可选, 默认为 "pil") — 生成视频的输出格式。在 torch.Tensor、PIL.Image 或 np.array 之间选择。
return_dict (bool, 可选, 默认为 True) — 是否返回 TextToVideoSDPipelineOutput 而不是普通的元组。
cross_attention_kwargs (dict, 可选) — 一个 kwargs 字典，如果指定，将作为参数传递给 self.processor 中定义的 AttentionProcessor。
controlnet_conditioning_scale (float 或 List[float], 可选, 默认为 1.0) — ControlNet 的输出在添加到原始 unet 中的残差之前会乘以 controlnet_conditioning_scale。如果在 init 中指定了多个 ControlNet，则可以将其相应的比例设置为列表。
controlnet_frame_indices (List[int]) — 生成时必须应用条件帧的索引。可以提供多个帧来指导模型生成相似的结构输出，其中 unet 可以为插值视频“填充空白”，或者可以提供单个帧以获得一般预期的结构。必须与 conditioning_frames 具有相同的长度。
clip_skip (int, 可选) — 在计算提示嵌入时，从 CLIP 跳过的层数。值为 1 表示将使用预最终层的输出计算提示嵌入。
callback_on_step_end (Callable, 可选) — 推理过程中每个去噪步骤结束时调用的函数。该函数将使用以下参数调用：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。callback_kwargs 将包含 callback_on_step_end_tensor_inputs 指定的所有张量列表。
callback_on_step_end_tensor_inputs (List, 可选) — callback_on_step_end 函数的张量输入列表。列表中指定的张量将作为 callback_kwargs 参数传递。您只能包含管道类的 ._callback_tensor_inputs 属性中列出的变量。

AnimateDiffPipelineOutput 或 tuple

如果 return_dict 为 True，则返回 AnimateDiffPipelineOutput，否则返回一个 tuple，其中第一个元素是生成的帧列表。

用于生成的管道的调用函数。

示例

>>> import torch
>>> from diffusers import AnimateDiffSparseControlNetPipeline
>>> from diffusers.models import AutoencoderKL, MotionAdapter, SparseControlNetModel
>>> from diffusers.schedulers import DPMSolverMultistepScheduler
>>> from diffusers.utils import export_to_gif, load_image

>>> model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
>>> motion_adapter_id = "guoyww/animatediff-motion-adapter-v1-5-3"
>>> controlnet_id = "guoyww/animatediff-sparsectrl-scribble"
>>> lora_adapter_id = "guoyww/animatediff-motion-lora-v1-5-3"
>>> vae_id = "stabilityai/sd-vae-ft-mse"
>>> device = "cuda"

>>> motion_adapter = MotionAdapter.from_pretrained(motion_adapter_id, torch_dtype=torch.float16).to(device)
>>> controlnet = SparseControlNetModel.from_pretrained(controlnet_id, torch_dtype=torch.float16).to(device)
>>> vae = AutoencoderKL.from_pretrained(vae_id, torch_dtype=torch.float16).to(device)
>>> scheduler = DPMSolverMultistepScheduler.from_pretrained(
...     model_id,
...     subfolder="scheduler",
...     beta_schedule="linear",
...     algorithm_type="dpmsolver++",
...     use_karras_sigmas=True,
... )
>>> pipe = AnimateDiffSparseControlNetPipeline.from_pretrained(
...     model_id,
...     motion_adapter=motion_adapter,
...     controlnet=controlnet,
...     vae=vae,
...     scheduler=scheduler,
...     torch_dtype=torch.float16,
... ).to(device)
>>> pipe.load_lora_weights(lora_adapter_id, adapter_name="motion_lora")
>>> pipe.fuse_lora(lora_scale=1.0)

>>> prompt = "an aerial view of a cyberpunk city, night time, neon lights, masterpiece, high quality"
>>> negative_prompt = "low quality, worst quality, letterboxed"

>>> image_files = [
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-scribble-1.png",
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-scribble-2.png",
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-scribble-3.png",
... ]
>>> condition_frame_indices = [0, 8, 15]
>>> conditioning_frames = [load_image(img_file) for img_file in image_files]

>>> video = pipe(
...     prompt=prompt,
...     negative_prompt=negative_prompt,
...     num_inference_steps=25,
...     conditioning_frames=conditioning_frames,
...     controlnet_conditioning_scale=1.0,
...     controlnet_frame_indices=condition_frame_indices,
...     generator=torch.Generator().manual_seed(1337),
... ).frames[0]
>>> export_to_gif(video, "output.gif")

encode_prompt

< 来源 >

参数

prompt (str 或 List[str], 可选) — 要编码的提示
device — (torch.device): torch 设备
num_images_per_prompt (int) — 每个提示应生成的图像数量
do_classifier_free_guidance (bool) — 是否使用分类器自由指导
negative_prompt (str 或 List[str], 可选) — 不用于指导图像生成的提示。如果未定义，则必须传递 negative_prompt_embeds。当不使用指导时（即，如果 guidance_scale 小于 1 时忽略）。
prompt_embeds (torch.Tensor, 可选) — 预先生成的文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，则文本嵌入将从 prompt 输入参数生成。
negative_prompt_embeds (torch.Tensor, 可选) — 预先生成的负文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，则 negative_prompt_embeds 将从 negative_prompt 输入参数生成。
lora_scale (float, 可选) — 如果加载了 LoRA 层，则将应用于文本编码器所有 LoRA 层的 LoRA 比例。
clip_skip (int, 可选) — 在计算提示嵌入时，从 CLIP 跳过的层数。值为 1 表示将使用预最终层的输出计算提示嵌入。

将提示编码为文本编码器隐藏状态。

AnimateDiffSDXLPipeline

class diffusers.AnimateDiffSDXLPipeline

< 来源 >

( vae: AutoencoderKL text_encoder: CLIPTextModel text_encoder_2: CLIPTextModelWithProjection tokenizer: CLIPTokenizer tokenizer_2: CLIPTokenizer unet: typing.Union[diffusers.models.unets.unet_2d_condition.UNet2DConditionModel, diffusers.models.unets.unet_motion_model.UNetMotionModel] motion_adapter: MotionAdapter scheduler: typing.Union[diffusers.schedulers.scheduling_ddim.DDIMScheduler, diffusers.schedulers.scheduling_pndm.PNDMScheduler, diffusers.schedulers.scheduling_lms_discrete.LMSDiscreteScheduler, diffusers.schedulers.scheduling_euler_discrete.EulerDiscreteScheduler, diffusers.schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteScheduler, diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler] image_encoder: CLIPVisionModelWithProjection = None feature_extractor: CLIPImageProcessor = None force_zeros_for_empty_prompt: bool = True )

参数

vae (AutoencoderKL) — 变分自编码器（VAE）模型，用于将图像编码和解码为潜在表示。
text_encoder (CLIPTextModel) — 冻结的文本编码器。Stable Diffusion XL 使用 CLIP 的文本部分，特别是 clip-vit-large-patch14 变体。
text_encoder_2 ( CLIPTextModelWithProjection) — 第二个冻结的文本编码器。Stable Diffusion XL 使用 CLIP 的文本和池化部分，特别是 laion/CLIP-ViT-bigG-14-laion2B-39B-b160k 变体。
tokenizer (CLIPTokenizer) — CLIPTokenizer 类的分词器。
tokenizer_2 (CLIPTokenizer) — CLIPTokenizer 类的第二个分词器。
unet (UNet2DConditionModel) — 用于对编码图像潜变量进行去噪的条件 U-Net 架构。
scheduler (SchedulerMixin) — 与 unet 结合使用的调度器，用于对编码图像潜变量进行去噪。可以是 DDIMScheduler、LMSDiscreteScheduler 或 PNDMScheduler 之一。
force_zeros_for_empty_prompt (bool, 可选, 默认为 "True") — 是否强制将负提示嵌入始终设置为 0。另请参阅 stabilityai/stable-diffusion-xl-base-1-0 的配置。

使用 Stable Diffusion XL 进行文本到视频生成的管道。

此模型继承自 DiffusionPipeline。请查看超类文档，了解库为所有管道实现的通用方法（例如下载或保存、在特定设备上运行等）。

该管道还继承了以下加载方法

load_textual_inversion() 用于加载文本反演嵌入
from_single_file() 用于加载 .ckpt 文件
load_lora_weights() 用于加载 LoRA 权重
save_lora_weights() 用于保存 LoRA 权重
load_ip_adapter() 用于加载 IP 适配器

call

< 来源 >

( prompt: typing.Union[str, typing.List[str]] = None prompt_2: typing.Union[str, typing.List[str], NoneType] = None num_frames: int = 16 height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 timesteps: typing.List[int] = None sigmas: typing.List[float] = None denoising_end: typing.Optional[float] = None guidance_scale: float = 5.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_2: typing.Union[str, typing.List[str], NoneType] = None num_videos_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None guidance_rescale: float = 0.0 original_size: typing.Optional[typing.Tuple[int, int]] = None crops_coords_top_left: typing.Tuple[int, int] = (0, 0) target_size: typing.Optional[typing.Tuple[int, int]] = None negative_original_size: typing.Optional[typing.Tuple[int, int]] = None negative_crops_coords_top_left: typing.Tuple[int, int] = (0, 0) negative_target_size: typing.Optional[typing.Tuple[int, int]] = None clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] ) → AnimateDiffPipelineOutput 或 tuple

参数

prompt (str 或 List[str], 可选) — 用于指导视频生成的提示。如果未定义，则必须传递 prompt_embeds。
prompt_2 (str 或 List[str], 可选) — 要发送到 tokenizer_2 和 text_encoder_2 的提示。如果未定义，则 prompt 将在两个文本编码器中使用。
num_frames — 生成的视频帧数。默认为 16 帧，每秒 8 帧，相当于 2 秒的视频。
height (int, 可选, 默认为 self.unet.config.sample_size * self.vae_scale_factor) — 生成视频的高度（以像素为单位）。为了获得最佳效果，此值默认为 1024。对于 stabilityai/stable-diffusion-xl-base-1.0 和未专门针对低分辨率进行微调的检查点，任何低于 512 像素的值都将无法正常工作。
width (int, 可选, 默认为 self.unet.config.sample_size * self.vae_scale_factor) — 生成视频的宽度（以像素为单位）。为了获得最佳效果，此值默认为 1024。对于 stabilityai/stable-diffusion-xl-base-1.0 和未专门针对低分辨率进行微调的检查点，任何低于 512 像素的值都将无法正常工作。
num_inference_steps (int, 可选, 默认为 50) — 去噪步数。更多的去噪步数通常会带来更高质量的视频，但代价是推理速度变慢。
timesteps (List[int], 可选) — 用于去噪过程的自定义时间步，适用于其 set_timesteps 方法支持 timesteps 参数的调度器。如果未定义，将使用传递 num_inference_steps 时的默认行为。必须按降序排列。
sigmas (List[float], 可选) — 用于去噪过程的自定义 sigmas，适用于其 set_timesteps 方法支持 sigmas 参数的调度器。如果未定义，将使用传递 num_inference_steps 时的默认行为。
denoising_end (float, 可选) — 指定时，确定在故意提前终止之前要完成的总去噪过程的分数（介于 0.0 和 1.0 之间）。因此，返回的样本仍将保留大量噪声，具体由调度器选择的离散时间步决定。当此管道作为“去噪器混合”多管道设置的一部分时，应理想地使用 denoising_end 参数，如优化图像输出中所详述。
guidance_scale (float, 可选, 默认为 5.0) — Classifier-Free Diffusion Guidance 中定义的指导比例。guidance_scale 定义为 Imagen Paper 中公式 2 的 w。通过设置 guidance_scale > 1 启用指导比例。较高的指导比例鼓励生成与文本 prompt 紧密相关的图像，通常以牺牲较低视频质量为代价。
negative_prompt (str 或 List[str], 可选) — 不用于引导视频生成的提示或提示列表。如果未定义，则必须传入 negative_prompt_embeds。当不使用引导时（即，如果 guidance_scale 小于 1 时），此参数将被忽略。
negative_prompt_2 (str 或 List[str], 可选) — 不用于引导视频生成的提示或提示列表，将被发送到 tokenizer_2 和 text_encoder_2。如果未定义，则 negative_prompt 将用于两个文本编码器。
num_videos_per_prompt (int, 可选, 默认为 1) — 每个提示生成的视频数量。
eta (float, 可选, 默认为 0.0) — 对应于 DDIM 论文中的参数 eta (η)：https://huggingface.co/papers/2010.02502。仅适用于 schedulers.DDIMScheduler，对于其他调度器将被忽略。
generator (torch.Generator 或 List[torch.Generator], 可选) — 一个或多个 torch generator(s)，用于使生成过程确定化。
latents (torch.Tensor, 可选) — 预生成的噪声潜在变量，从高斯分布中采样，用作视频生成的输入。可用于使用不同提示调整相同生成。如果未提供，将使用提供的随机 generator 采样生成一个潜在变量张量。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，文本嵌入将从 prompt 输入参数生成。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，负文本嵌入将从 negative_prompt 输入参数生成。
pooled_prompt_embeds (torch.Tensor, 可选) — 预生成的池化文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，池化文本嵌入将从 prompt 输入参数生成。
negative_pooled_prompt_embeds (torch.Tensor, 可选) — 预生成的负池化文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，池化负文本嵌入将从 negative_prompt 输入参数生成。
ip_adapter_image — (PipelineImageInput, 可选): 用于 IP 适配器的可选图像输入。
ip_adapter_image_embeds (List[torch.Tensor], 可选) — 预生成的 IP-Adapter 图像嵌入。如果未提供，嵌入将从 ip_adapter_image 输入参数计算。
output_type (str, 可选, 默认为 "pil") — 生成视频的输出格式。在 PIL: PIL.Image.Image 或 np.array 之间选择。
return_dict (bool, 可选, 默认为 True) — 是否返回 ~pipelines.stable_diffusion_xl.AnimateDiffPipelineOutput 而不是普通元组。
cross_attention_kwargs (dict, 可选) — 一个 kwargs 字典，如果指定，将作为参数传递给 diffusers.models.attention_processor 中定义的 self.processor 的 AttentionProcessor。
guidance_rescale (float, 可选, 默认为 0.0) — Common Diffusion Noise Schedules and Sample Steps are Flawed 中提出的引导重新缩放因子。guidance_scale 定义为 Common Diffusion Noise Schedules and Sample Steps are Flawed 方程 16 中的 φ。引导重新缩放因子应在使用零终端 SNR 时修复过度曝光。
original_size (Tuple[int], 可选, 默认为 (1024, 1024)) — 如果 original_size 与 target_size 不同，图像将显示为缩小或放大。如果未指定，original_size 默认为 (height, width)。SDXL 微条件的一部分，如 https://huggingface.ac.cn/papers/2307.01952 第 2.2 节所述。
crops_coords_top_left (Tuple[int], 可选, 默认为 (0, 0)) — crops_coords_top_left 可用于生成一个图像，该图像看起来像是从 crops_coords_top_left 位置向下“裁剪”的。通常通过将 crops_coords_top_left 设置为 (0, 0) 来获得良好、居中的图像。SDXL 微条件的一部分，如 https://huggingface.ac.cn/papers/2307.01952 第 2.2 节所述。
target_size (Tuple[int], 可选, 默认为 (1024, 1024)) — 在大多数情况下，target_size 应设置为生成图像的所需高度和宽度。如果未指定，它将默认为 (height, width)。SDXL 微条件的一部分，如 https://huggingface.ac.cn/papers/2307.01952 第 2.2 节所述。
negative_original_size (Tuple[int], 可选, 默认为 (1024, 1024)) — 基于特定图像分辨率对生成过程进行负面条件限制。SDXL 微条件的一部分，如 https://huggingface.ac.cn/papers/2307.01952 第 2.2 节所述。有关更多信息，请参阅此问题讨论串：https://github.com/huggingface/diffusers/issues/4208。
negative_crops_coords_top_left (Tuple[int], 可选, 默认为 (0, 0)) — 基于特定裁剪坐标对生成过程进行负面条件限制。SDXL 微条件的一部分，如 https://huggingface.ac.cn/papers/2307.01952 第 2.2 节所述。有关更多信息，请参阅此问题讨论串：https://github.com/huggingface/diffusers/issues/4208。
negative_target_size (Tuple[int], 可选, 默认为 (1024, 1024)) — 基于目标图像分辨率对生成过程进行负面条件限制。在大多数情况下，它应与 target_size 相同。SDXL 微条件的一部分，如 https://huggingface.ac.cn/papers/2307.01952 第 2.2 节所述。有关更多信息，请参阅此问题讨论串：https://github.com/huggingface/diffusers/issues/4208。
callback_on_step_end (Callable, 可选) — 在推理期间，每个去噪步骤结束时调用的函数。该函数通过以下参数调用：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。callback_kwargs 将包含 callback_on_step_end_tensor_inputs 指定的所有张量列表。
callback_on_step_end_tensor_inputs (List, 可选) — callback_on_step_end 函数的张量输入列表。列表中指定的张量将作为 callback_kwargs 参数传递。您只能包含管道类 ._callback_tensor_inputs 属性中列出的变量。

AnimateDiffPipelineOutput 或 tuple

如果 return_dict 为 True，则返回 AnimateDiffPipelineOutput，否则返回一个 tuple，其中第一个元素是生成的帧列表。

调用管道进行生成时调用的函数。

示例

>>> import torch
>>> from diffusers.models import MotionAdapter
>>> from diffusers import AnimateDiffSDXLPipeline, DDIMScheduler
>>> from diffusers.utils import export_to_gif

>>> adapter = MotionAdapter.from_pretrained(
...     "a-r-r-o-w/animatediff-motion-adapter-sdxl-beta", torch_dtype=torch.float16
... )

>>> model_id = "stabilityai/stable-diffusion-xl-base-1.0"
>>> scheduler = DDIMScheduler.from_pretrained(
...     model_id,
...     subfolder="scheduler",
...     clip_sample=False,
...     timestep_spacing="linspace",
...     beta_schedule="linear",
...     steps_offset=1,
... )
>>> pipe = AnimateDiffSDXLPipeline.from_pretrained(
...     model_id,
...     motion_adapter=adapter,
...     scheduler=scheduler,
...     torch_dtype=torch.float16,
...     variant="fp16",
... ).to("cuda")

>>> # enable memory savings
>>> pipe.enable_vae_slicing()
>>> pipe.enable_vae_tiling()

>>> output = pipe(
...     prompt="a panda surfing in the ocean, realistic, high quality",
...     negative_prompt="low quality, worst quality",
...     num_inference_steps=20,
...     guidance_scale=8,
...     width=1024,
...     height=1024,
...     num_frames=16,
... )

>>> frames = output.frames[0]
>>> export_to_gif(frames, "animation.gif")

encode_prompt

< source 源 >

( prompt: str prompt_2: typing.Optional[str] = None device: typing.Optional[torch.device] = None num_videos_per_prompt: int = 1 do_classifier_free_guidance: bool = True negative_prompt: typing.Optional[str] = None negative_prompt_2: typing.Optional[str] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None clip_skip: typing.Optional[int] = None )

参数

prompt (str 或 List[str], 可选) — 待编码的提示
prompt_2 (str 或 List[str], 可选) — 将发送到 tokenizer_2 和 text_encoder_2 的提示或提示列表。如果未定义，则 prompt 将用于两个文本编码器。
device — (torch.device): torch 设备
num_videos_per_prompt (int) — 每个提示应生成的图像数量
do_classifier_free_guidance (bool) — 是否使用分类器自由引导
negative_prompt (str 或 List[str], 可选) — 不用于引导图像生成的提示或提示列表。如果未定义，则必须传入 negative_prompt_embeds。当不使用引导时（即，如果 guidance_scale 小于 1 时），此参数将被忽略。
negative_prompt_2 (str 或 List[str], 可选) — 不用于引导图像生成的提示或提示列表，将被发送到 tokenizer_2 和 text_encoder_2。如果未定义，则 negative_prompt 将用于两个文本编码器。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，文本嵌入将从 prompt 输入参数生成。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，负文本嵌入将从 negative_prompt 输入参数生成。
pooled_prompt_embeds (torch.Tensor, 可选) — 预生成的池化文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，池化文本嵌入将从 prompt 输入参数生成。
negative_pooled_prompt_embeds (torch.Tensor, 可选) — 预生成的负池化文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，池化负文本嵌入将从 negative_prompt 输入参数生成。
lora_scale (float, 可选) — 如果加载了 LoRA 层，将应用于文本编码器所有 LoRA 层的 LoRA 缩放因子。
clip_skip (int, 可选) — 计算提示嵌入时要跳过的 CLIP 层数。值为 1 表示将使用倒数第二层的输出计算提示嵌入。

将提示编码为文本编码器隐藏状态。

get_guidance_scale_embedding

< source 源 >

( w: Tensor embedding_dim: int = 512 dtype: dtype = torch.float32 ) → torch.Tensor

参数

w (torch.Tensor) — 生成具有指定引导尺度的嵌入向量，以随后丰富时间步嵌入。
embedding_dim (int, 可选, 默认为 512) — 要生成的嵌入的维度。
dtype (torch.dtype, 可选, 默认为 torch.float32) — 生成嵌入的数据类型。

torch.Tensor

形状为 (len(w), embedding_dim) 的嵌入向量。

请参阅 https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

AnimateDiffVideoToVideoPipeline

class diffusers.AnimateDiffVideoToVideoPipeline

< source 源 >

参数

vae (AutoencoderKL) — 变分自动编码器（VAE）模型，用于将图像编码和解码为潜在表示。
text_encoder (CLIPTextModel) — 冻结文本编码器 (clip-vit-large-patch14)。
tokenizer (CLIPTokenizer) — 用于文本分词的 CLIPTokenizer。
unet (UNet2DConditionModel) — 用于创建 UNetMotionModel 以对编码视频潜在变量进行去噪的 UNet2DConditionModel。
motion_adapter (MotionAdapter) — 一个 MotionAdapter，与 unet 结合使用，以对编码视频潜在变量进行去噪。
scheduler (SchedulerMixin) — 一个调度器，与 unet 结合使用以对编码图像潜在变量进行去噪。可以是 DDIMScheduler、LMSDiscreteScheduler 或 PNDMScheduler 之一。

用于视频到视频生成的管道。

此模型继承自 DiffusionPipeline。有关所有管道实现的通用方法（下载、保存、在特定设备上运行等），请查看超类文档。

该管道还继承了以下加载方法

load_textual_inversion() 用于加载文本反演嵌入
load_lora_weights() 用于加载 LoRA 权重
save_lora_weights() 用于保存 LoRA 权重
load_ip_adapter() 用于加载 IP 适配器

call

< source 源 >

( video: typing.List[typing.List[typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]]]] = None prompt: typing.Union[str, typing.List[str], NoneType] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 enforce_inference_steps: bool = False timesteps: typing.Optional[typing.List[int]] = None sigmas: typing.Optional[typing.List[float]] = None guidance_scale: float = 7.5 strength: float = 0.8 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_videos_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] decode_chunk_size: int = 16 ) → pipelines.animatediff.pipeline_output.AnimateDiffPipelineOutput 或 tuple

参数

video (List[PipelineImageInput]) — 用于条件生成的输入视频。必须是视频图像/帧的列表。
prompt (str 或 List[str], 可选) — 用于引导图像生成的提示或提示列表。如果未定义，则需要传入 prompt_embeds。
height (int, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) — 生成视频的像素高度。
width (int, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) — 生成视频的像素宽度。
num_inference_steps (int, optional, defaults to 50) — 去噪步数。更多的去噪步数通常会带来更高质量的视频，但会牺牲推理速度。
timesteps (List[int], optional) — 用于去噪过程的自定义时间步，适用于其 set_timesteps 方法支持 timesteps 参数的调度器。如果未定义，将使用传递 num_inference_steps 时的默认行为。必须按降序排列。
sigmas (List[float], optional) — 用于去噪过程的自定义 sigmas，适用于其 set_timesteps 方法支持 sigmas 参数的调度器。如果未定义，将使用传递 num_inference_steps 时的默认行为。
strength (float, optional, defaults to 0.8) — 强度越高，原始视频与生成视频之间的差异越大。
guidance_scale (float, optional, defaults to 7.5) — 较高的指导比例值会促使模型生成与文本 prompt 紧密相关的图像，但图像质量会降低。当 guidance_scale > 1 时，启用指导比例。
negative_prompt (str or List[str], optional) — 用于指导图像生成时不包含内容的提示。如果未定义，需要传入 negative_prompt_embeds。当不使用指导时（guidance_scale < 1）将被忽略。
eta (float, optional, defaults to 0.0) — 对应于 DDIM 论文中的参数 eta (η)。仅适用于 DDIMScheduler，在其他调度器中将被忽略。
generator (torch.Generator or List[torch.Generator], optional) — 一个 torch.Generator，用于使生成具有确定性。
latents (torch.Tensor, optional) — 预先生成的从高斯分布采样的噪声潜在变量，用作视频生成的输入。可用于使用不同的提示调整相同的生成。如果未提供，则使用提供的随机 generator 采样生成一个潜在张量。潜在变量的形状应为 (batch_size, num_channel, num_frames, height, width)。
prompt_embeds (torch.Tensor, optional) — 预先生成的文本嵌入。可用于轻松调整文本输入（提示权重）。如果未提供，将从 prompt 输入参数生成文本嵌入。
negative_prompt_embeds (torch.Tensor, optional) — 预先生成的负文本嵌入。可用于轻松调整文本输入（提示权重）。如果未提供，negative_prompt_embeds 将从 negative_prompt 输入参数生成。
ip_adapter_image — (PipelineImageInput, optional): 用于 IP 适配器的可选图像输入。
ip_adapter_image_embeds (List[torch.Tensor], optional) — 用于 IP-Adapter 的预生成图像嵌入。它应该是一个列表，长度与 IP 适配器的数量相同。每个元素都应该是一个形状为 (batch_size, num_images, emb_dim) 的张量。如果 do_classifier_free_guidance 设置为 True，它应该包含负图像嵌入。如果未提供，嵌入将从 ip_adapter_image 输入参数计算。
output_type (str, optional, defaults to "pil") — 生成视频的输出格式。在 torch.Tensor、PIL.Image 或 np.array 之间选择。
return_dict (bool, optional, defaults to True) — 是否返回 AnimateDiffPipelineOutput 而不是普通元组。
cross_attention_kwargs (dict, optional) — 一个 kwargs 字典，如果指定，将作为 AttentionProcessor 中定义的 self.processor 传递。
clip_skip (int, optional) — 计算提示嵌入时从 CLIP 跳过的层数。值为 1 表示将使用倒数第二层的输出计算提示嵌入。
callback_on_step_end (Callable, optional) — 在推理过程中每个去噪步骤结束时调用的函数。该函数通过以下参数调用：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。callback_kwargs 将包含 callback_on_step_end_tensor_inputs 指定的所有张量列表。
callback_on_step_end_tensor_inputs (List, optional) — callback_on_step_end 函数的张量输入列表。列表中指定的张量将作为 callback_kwargs 参数传递。您只能包含管道类 ._callback_tensor_inputs 属性中列出的变量。
decode_chunk_size (int, defaults to 16) — 调用 decode_latents 方法时每次解码的帧数。

pipelines.animatediff.pipeline_output.AnimateDiffPipelineOutput 或 tuple

如果 return_dict 为 True，则返回 pipelines.animatediff.pipeline_output.AnimateDiffPipelineOutput，否则返回一个 tuple，其中第一个元素是生成的帧列表。

用于生成的管道的调用函数。

示例

encode_prompt

< source >

参数

prompt (str 或 List[str], 可选) — 待编码的提示
device — (torch.device): torch 设备
num_images_per_prompt (int) — 每个提示应生成的图像数量
do_classifier_free_guidance (bool) — 是否使用无分类器指导
negative_prompt (str 或 List[str], 可选) — 不用于指导图像生成的提示。如果未定义，必须传递 negative_prompt_embeds。当不使用指导时（即，如果 guidance_scale 小于 1），将被忽略。
prompt_embeds (torch.Tensor, 可选) — 预先生成的文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，文本嵌入将从 prompt 输入参数生成。
negative_prompt_embeds (torch.Tensor, 可选) — 预先生成的负文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，negative_prompt_embeds 将从 negative_prompt 输入参数生成。
lora_scale (float, 可选) — 如果加载了 LoRA 层，将应用于文本编码器所有 LoRA 层的 LoRA 缩放因子。
clip_skip (int, 可选) — 计算提示嵌入时从 CLIP 跳过的层数。值为 1 表示将使用倒数第二层的输出计算提示嵌入。

将提示编码为文本编码器隐藏状态。

AnimateDiffVideoToVideoControlNetPipeline

class diffusers.AnimateDiffVideoToVideoControlNetPipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer unet: typing.Union[diffusers.models.unets.unet_2d_condition.UNet2DConditionModel, diffusers.models.unets.unet_motion_model.UNetMotionModel] motion_adapter: MotionAdapter controlnet: typing.Union[diffusers.models.controlnets.controlnet.ControlNetModel, typing.List[diffusers.models.controlnets.controlnet.ControlNetModel], typing.Tuple[diffusers.models.controlnets.controlnet.ControlNetModel], diffusers.models.controlnets.multicontrolnet.MultiControlNetModel] scheduler: typing.Union[diffusers.schedulers.scheduling_ddim.DDIMScheduler, diffusers.schedulers.scheduling_pndm.PNDMScheduler, diffusers.schedulers.scheduling_lms_discrete.LMSDiscreteScheduler, diffusers.schedulers.scheduling_euler_discrete.EulerDiscreteScheduler, diffusers.schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteScheduler, diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler] feature_extractor: CLIPImageProcessor = None image_encoder: CLIPVisionModelWithProjection = None )

参数

vae (AutoencoderKL) — 用于将图像编码和解码为潜在表示的变分自编码器 (VAE) 模型。
text_encoder (CLIPTextModel) — 冻结的文本编码器 (clip-vit-large-patch14)。
tokenizer (CLIPTokenizer) — 用于对文本进行标记的 CLIPTokenizer。
unet (UNet2DConditionModel) — 用于创建 UNetMotionModel 以对编码视频潜在变量进行去噪的 UNet2DConditionModel。
motion_adapter (MotionAdapter) — 一个 MotionAdapter，与 unet 结合使用，用于对编码视频潜在变量进行去噪。
controlnet (ControlNetModel 或 List[ControlNetModel] 或 Tuple[ControlNetModel] 或 MultiControlNetModel) — 在去噪过程中为 unet 提供额外的条件。如果将多个 ControlNet 设置为列表，则每个 ControlNet 的输出将相加在一起以创建组合的额外条件。
scheduler (SchedulerMixin) — 与 unet 结合使用以对编码图像潜在变量进行去噪的调度器。可以是 DDIMScheduler、LMSDiscreteScheduler 或 PNDMScheduler 之一。

带 ControlNet 指导的视频到视频生成管道。

此模型继承自 DiffusionPipeline。有关所有管道实现的通用方法（下载、保存、在特定设备上运行等），请查看超类文档。

该管道还继承了以下加载方法

load_textual_inversion() 用于加载文本反演嵌入
load_lora_weights() 用于加载 LoRA 权重
save_lora_weights() 用于保存 LoRA 权重
load_ip_adapter() 用于加载 IP 适配器

call

< source >

( video: typing.List[typing.List[typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]]]] = None prompt: typing.Union[str, typing.List[str], NoneType] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 enforce_inference_steps: bool = False timesteps: typing.Optional[typing.List[int]] = None sigmas: typing.Optional[typing.List[float]] = None guidance_scale: float = 7.5 strength: float = 0.8 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_videos_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None conditioning_frames: typing.Optional[typing.List[typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]]]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None controlnet_conditioning_scale: typing.Union[float, typing.List[float]] = 1.0 guess_mode: bool = False control_guidance_start: typing.Union[float, typing.List[float]] = 0.0 control_guidance_end: typing.Union[float, typing.List[float]] = 1.0 clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] decode_chunk_size: int = 16 ) → pipelines.animatediff.pipeline_output.AnimateDiffPipelineOutput 或 tuple

参数

video (List[PipelineImageInput]) — 用于条件生成输入的视频。必须是视频图像/帧的列表。
prompt (str 或 List[str], 可选) — 用于指导图像生成的提示。如果未定义，需要传递 prompt_embeds。
height (int, 可选, 默认为 self.unet.config.sample_size * self.vae_scale_factor) — 生成视频的像素高度。
width (int, 可选, 默认为 self.unet.config.sample_size * self.vae_scale_factor) — 生成视频的像素宽度。
num_inference_steps (int, 可选, 默认为 50) — 去噪步数。更多的去噪步数通常会带来更高质量的视频，但会牺牲推理速度。
timesteps (List[int], 可选) — 用于去噪过程的自定义时间步，适用于其 `set_timesteps` 方法支持 `timesteps` 参数的调度器。如果未定义，将使用传递 `num_inference_steps` 时的默认行为。必须按降序排列。
sigmas (List[float], 可选) — 用于去噪过程的自定义 sigmas，适用于其 `set_timesteps` 方法支持 `sigmas` 参数的调度器。如果未定义，将使用传递 `num_inference_steps` 时的默认行为。
strength (float, 可选, 默认为 0.8) — 强度越高，原始视频和生成视频之间的差异越大。
guidance_scale (float, 可选, 默认为 7.5) — 引导比例值越高，模型越能生成与文本 `prompt` 密切相关的图像，但图像质量会降低。当 `guidance_scale > 1` 时启用引导比例。
negative_prompt (str 或 List[str], 可选) — 指导图像生成中不包含的内容的提示或提示列表。如果未定义，您需要传递 `negative_prompt_embeds`。当不使用引导时 (`guidance_scale < 1`) 将忽略。
eta (float, 可选, 默认为 0.0) — 对应于 DDIM 论文中的参数 eta (η)。仅适用于 DDIMScheduler，在其他调度器中被忽略。
generator (torch.Generator 或 List[torch.Generator], 可选) — 用于使生成具有确定性的 torch.Generator。
latents (torch.Tensor, 可选) — 从高斯分布中采样的预生成噪声潜在变量，用作视频生成的输入。可用于使用不同的提示调整相同的生成。如果未提供，将使用提供的随机 `generator` 进行采样生成一个潜在变量张量。潜在变量的形状应为 `(batch_size, num_channel, num_frames, height, width)`。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入（提示权重）。如果未提供，文本嵌入将从 `prompt` 输入参数生成。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负面文本嵌入。可用于轻松调整文本输入（提示权重）。如果未提供，`negative_prompt_embeds` 将从 `negative_prompt` 输入参数生成。
ip_adapter_image — (PipelineImageInput, 可选): 用于 IP 适配器的可选图像输入。
ip_adapter_image_embeds (List[torch.Tensor], 可选) — 用于 IP-Adapter 的预生成图像嵌入。它应该是一个列表，长度与 IP 适配器数量相同。每个元素应该是一个形状为 `(batch_size, num_images, emb_dim)` 的张量。如果 `do_classifier_free_guidance` 设置为 `True`，它应该包含负图像嵌入。如果未提供，嵌入将从 `ip_adapter_image` 输入参数计算。
conditioning_frames (List[PipelineImageInput], 可选) — ControlNet 输入条件，用于为 `unet` 提供生成指导。如果指定了多个 ControlNet，图像必须作为列表传递，以便列表的每个元素可以正确批处理以输入到单个 ControlNet。
output_type (str, 可选, 默认为 "pil") — 生成视频的输出格式。在 torch.Tensor、PIL.Image 或 np.array 之间选择。
return_dict (bool, 可选, 默认为 True) — 是否返回 AnimateDiffPipelineOutput 而不是普通元组。
cross_attention_kwargs (dict, 可选) — 一个 kwargs 字典，如果指定，将传递给 self.processor 中定义的 `AttentionProcessor`。
controlnet_conditioning_scale (float 或 List[float], 可选, 默认为 1.0) — ControlNet 的输出在添加到原始 `unet` 中的残差之前乘以 `controlnet_conditioning_scale`。如果在 `init` 中指定了多个 ControlNet，您可以将相应的比例设置为列表。
guess_mode (bool, 可选, 默认为 False) — 即使您删除所有提示，ControlNet 编码器也会尝试识别输入图像的内容。建议 `guidance_scale` 值在 3.0 到 5.0 之间。
control_guidance_start (float 或 List[float], 可选, 默认为 0.0) — ControlNet 开始应用的步数总百分比。
control_guidance_end (float 或 List[float], 可选, 默认为 1.0) — ControlNet 停止应用的步数总百分比。
clip_skip (int, 可选) — 在计算提示嵌入时，从 CLIP 跳过的层数。值为 1 表示将使用倒数第二层的输出计算提示嵌入。
callback_on_step_end (Callable, 可选) — 在推理过程中，每个去噪步骤结束时调用的函数。该函数将使用以下参数调用：`callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)`。`callback_kwargs` 将包含 `callback_on_step_end_tensor_inputs` 指定的所有张量列表。
callback_on_step_end_tensor_inputs (List, 可选) — `callback_on_step_end` 函数的张量输入列表。列表中指定的张量将作为 `callback_kwargs` 参数传递。您只能包含管道类 `._callback_tensor_inputs` 属性中列出的变量。
decode_chunk_size (int, 默认为 16) — 调用 `decode_latents` 方法时一次解码的帧数。

pipelines.animatediff.pipeline_output.AnimateDiffPipelineOutput 或 tuple

如果 return_dict 为 True，则返回 pipelines.animatediff.pipeline_output.AnimateDiffPipelineOutput，否则返回一个 tuple，其中第一个元素是生成的帧列表。

用于生成的管道的调用函数。

示例

encode_prompt

< 源文件 >

参数

prompt (str 或 List[str], 可选) — 要编码的提示
device — (torch.device): torch 设备
num_images_per_prompt (int) — 每个提示应生成的图像数量
do_classifier_free_guidance (bool) — 是否使用分类器自由引导
negative_prompt (str 或 List[str], 可选) — 不用于引导图像生成的提示或提示列表。如果未定义，则必须传递 `negative_prompt_embeds`。当不使用引导时（即，如果 `guidance_scale` 小于 `1`），则忽略。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，文本嵌入将从 `prompt` 输入参数生成。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负面文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，`negative_prompt_embeds` 将从 `negative_prompt` 输入参数生成。
lora_scale (float, 可选) — 应用于文本编码器所有 LoRA 层的 LoRA 比例（如果已加载 LoRA 层）。
clip_skip (int, 可选) — 在计算提示嵌入时，从 CLIP 跳过的层数。值为 1 表示将使用倒数第二层的输出计算提示嵌入。

将提示编码为文本编码器隐藏状态。

AnimateDiffPipelineOutput

class diffusers.pipelines.animatediff.AnimateDiffPipelineOutput

< 源文件 >

( frames: typing.Union[torch.Tensor, numpy.ndarray, typing.List[typing.List[PIL.Image.Image]]] )

参数

frames (torch.Tensor, np.ndarray, 或 List[List[PIL.Image.Image]]) — 视频输出列表 - 可以是长度为 batch_size 的嵌套列表，每个子列表包含去噪后的图像。

AnimateDiff 管道的输出类。

PIL 图像序列，长度为 num_frames。也可以是形状为 (batch_size, num_frames, channels, height, width) 的 NumPy 数组或 Torch 张量

< > 在 GitHub 上更新

←aMUSEd Attend-and-Excite→

Diffusers

使用 AnimateDiff 进行文本到视频生成

概览

可用管道

可用检查点

使用示例

AnimateDiffPipeline

AnimateDiffControlNetPipeline

AnimateDiffSparseControlNetPipeline

使用 SparseCtrl Scribble

使用 SparseCtrl RGB

AnimateDiffSDXLPipeline

AnimateDiffVideoToVideoPipeline

AnimateDiffVideoToVideoControlNetPipeline

使用 Motion LoRA

将 Motion LoRA 与 PEFT 结合使用

使用 FreeInit

使用 AnimateLCM

使用 FreeNoise

FreeNoise 内存节省

使用 from_single_file 与 MotionAdapter

AnimateDiffPipeline

class diffusers.AnimateDiffPipeline

__call__

encode_prompt

AnimateDiffControlNetPipeline

class diffusers.AnimateDiffControlNetPipeline

__call__

encode_prompt

AnimateDiffSparseControlNetPipeline

类 diffusers.AnimateDiffSparseControlNetPipeline

__call__

encode_prompt

AnimateDiffSDXLPipeline

class diffusers.AnimateDiffSDXLPipeline

__call__

encode_prompt

get_guidance_scale_embedding

AnimateDiffVideoToVideoPipeline

class diffusers.AnimateDiffVideoToVideoPipeline

__call__

encode_prompt

AnimateDiffVideoToVideoControlNetPipeline

class diffusers.AnimateDiffVideoToVideoControlNetPipeline

__call__

encode_prompt

AnimateDiffPipelineOutput

class diffusers.pipelines.animatediff.AnimateDiffPipelineOutput

call

call

call

call

call

call