Diffusers 文档

Wan2.1

Diffusers

加入 Hugging Face 社区

并获得增强的文档体验

在模型、数据集和 Spaces 上进行协作

通过加速推理获得更快的示例

切换文档主题

开始使用

Wan2.1

由 Wan 团队推出的Wan-2.1。

本报告介绍了 Wan，这是一套全面开放的视频基础模型，旨在突破视频生成领域的界限。Wan 基于主流的扩散变换器范式构建，通过一系列创新，包括我们新颖的 VAE、可扩展的预训练策略、大规模数据整理和自动化评估指标，在生成能力方面取得了显著进展。这些贡献共同提升了模型的性能和多功能性。具体而言，Wan 具有四个关键特性：领先的性能：Wan 的 14B 模型，在包含数十亿图像和视频的庞大数据集上进行训练，展示了视频生成在数据和模型大小方面的缩放定律。它在多个内部和外部基准测试中始终优于现有开源模型和最先进的商业解决方案，显示出明显且显著的性能优势。全面性：Wan 提供两个功能强大的模型，即 1.3B 和 14B 参数，分别注重效率和效果。它还涵盖了多种下游应用，包括图像到视频、指令引导的视频编辑和个人视频生成，涵盖多达八项任务。消费者级效率：1.3B 模型展示了卓越的资源效率，仅需 8.19 GB 显存，使其与各种消费者级 GPU 兼容。开放性：我们开源了整个 Wan 系列，包括源代码和所有模型，旨在促进视频生成社区的发展。这种开放性旨在显著扩展行业视频制作的创作可能性，并为学术界提供高质量的视频基础模型。所有代码和模型均可在此链接获取。

您可以在 Wan-AI 组织下找到所有原始 Wan2.1 检查点。

Diffusers 支持以下 Wan 模型

点击右侧边栏中的 Wan2.1 模型，查看更多视频生成示例。

文本到视频生成

下面的示例演示了如何从文本生成视频，并针对内存或推理速度进行了优化。

T2V 内存

T2V 推理速度

首尾帧到视频生成

下面的示例演示了如何使用图像到视频流水线，通过文本描述、起始帧和结束帧来生成视频。

用法

任意到视频可控生成

Wan VACE 支持各种生成技术，可实现可控的视频生成。部分功能包括：

控制到视频（深度、姿态、草图、流程、灰度、涂鸦、布局、边界框等）。推荐用于视频预处理以获取控制视频的库：huggingface/controlnet_aux
图像/视频到视频（首帧、末帧、起始剪辑、结束剪辑、随机剪辑）
图像修补和外扩
主题到视频（人脸、物体、角色等）
合成到视频（引用任何内容，动画任何内容，交换任何内容，扩展任何内容，移动任何内容等）

此拉取请求中提供的代码片段演示了如何使用可控信号生成视频的一些示例。

在使用 VACE 流水线准备输入时，需要记住的通用规则是：用作条件的输入图像或视频帧应具有相应的黑色蒙版。黑色蒙版表示模型不会为该区域生成新内容，而仅使用这些部分来条件化生成过程。对于应由模型生成的部分/帧，蒙版应为白色。

注意事项

Wan2.1 支持使用 load_lora_weights() 加载 LoRA。

显示示例代码

# pip install ftfy
import torch
from diffusers import AutoModel, WanPipeline
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler
from diffusers.utils import export_to_video

vae = AutoModel.from_pretrained(
    "Wan-AI/Wan2.1-T2V-1.3B-Diffusers", subfolder="vae", torch_dtype=torch.float32
)
pipeline = WanPipeline.from_pretrained(
    "Wan-AI/Wan2.1-T2V-1.3B-Diffusers", vae=vae, torch_dtype=torch.bfloat16
)
pipeline.scheduler = UniPCMultistepScheduler.from_config(
    pipeline.scheduler.config, flow_shift=5.0
)
pipeline.to("cuda")

pipeline.load_lora_weights("benjamin-paine/steamboat-willie-1.3b", adapter_name="steamboat-willie")
pipeline.set_adapters("steamboat-willie")

pipeline.enable_model_cpu_offload()

# use "steamboat willie style" to trigger the LoRA
prompt = """
steamboat willie style, golden era animation, The camera rushes from far to near in a low-angle shot, 
revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in 
for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground. 
Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic 
shadows and warm highlights. Medium composition, front view, low angle, with depth of field.
"""

output = pipeline(
    prompt=prompt,
    num_frames=81,
    guidance_scale=5.0,
).frames[0]
export_to_video(output, "output.mp4", fps=16)

WanTransformer3DModel 和 AutoencoderKLWan 支持从单个文件使用 from_single_file() 加载。

显示示例代码

# pip install ftfy
import torch
from diffusers import WanPipeline, AutoModel

vae = AutoModel.from_single_file(
    "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/blob/main/split_files/vae/wan_2.1_vae.safetensors"
)
transformer = AutoModel.from_single_file(
    "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/blob/main/split_files/diffusion_models/wan2.1_t2v_1.3B_bf16.safetensors",
    torch_dtype=torch.bfloat16
)
pipeline = WanPipeline.from_pretrained(
    "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
    vae=vae,
    transformer=transformer,
    torch_dtype=torch.bfloat16
)

将 AutoencoderKLWan dtype 设置为 torch.float32 以获得更好的解码质量。
每秒帧数 (fps) 或 k 应通过 4 * k + 1 计算。
对于较低分辨率的视频，尝试较低的 shift 值（2.0 到 5.0）；对于较高分辨率的图像，尝试较高的 shift 值（7.0 到 12.0）。

WanPipeline

class diffusers.WanPipeline

< 源代码 >

( tokenizer: AutoTokenizer text_encoder: UMT5EncoderModel transformer: WanTransformer3DModel vae: AutoencoderKLWan scheduler: FlowMatchEulerDiscreteScheduler )

参数

tokenizer (T5Tokenizer) — 来自 T5 的分词器，特别是 google/umt5-xxl 变体。
text_encoder (T5EncoderModel) — T5，特别是 google/umt5-xxl 变体。
transformer (WanTransformer3DModel) — 用于对输入潜在变量进行去噪的条件变换器。
scheduler (UniPCMultistepScheduler) — 与 transformer 结合使用的调度器，用于对编码图像潜在变量进行去噪。
vae (AutoencoderKLWan) — 变分自动编码器 (VAE) 模型，用于将视频编码和解码为潜在表示。

使用 Wan 进行文本到视频生成的流水线。

此模型继承自DiffusionPipeline。有关所有流水线通用的方法（下载、保存、在特定设备上运行等），请查看超类文档。

call

< 源代码 >

( prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str]] = None height: int = 480 width: int = 832 num_frames: int = 81 num_inference_steps: int = 50 guidance_scale: float = 5.0 num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'np' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 ) → ~WanPipelineOutput 或 tuple

参数

prompt (str 或 List[str], 可选) — 用于引导图像生成的提示词。如果未定义，请传入 prompt_embeds。
negative_prompt (str 或 List[str], 可选) — 在图像生成过程中应避免的提示词。如果未定义，请传入 negative_prompt_embeds。当不使用引导（guidance_scale < 1）时忽略。
height (int, 默认为 480) — 生成图像的高度（像素）。
width (int, 默认为 832) — 生成图像的宽度（像素）。
num_frames (int, 默认为 81) — 生成视频中的帧数。
num_inference_steps (int, 默认为 50) — 去噪步数。更多的去噪步数通常会带来更高质量的图像，但推理速度会变慢。
guidance_scale (float, 默认为 5.0) — Classifier-Free Diffusion Guidance 中定义的引导比例。guidance_scale 定义为 Imagen Paper 中公式 2 的 w。通过设置 guidance_scale > 1 启用引导比例。更高的引导比例有助于生成与文本 prompt 紧密相关的图像，但通常以牺牲图像质量为代价。
num_videos_per_prompt (int, 可选, 默认为 1) — 每个提示词生成的图像数量。
生成器 (torch.Generator 或 List[torch.Generator], 可选) — 一个 torch.Generator 用于使生成具有确定性。
隐式表示 (torch.Tensor, 可选) — 从高斯分布中采样的预生成噪声隐式表示，用作图像生成的输入。可用于使用不同的提示调整相同的生成。如果未提供，则使用提供的随机 生成器 采样生成隐式表示张量。
提示嵌入 (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入（提示加权）。如果未提供，文本嵌入将从 提示 输入参数生成。
输出类型 (str, 可选, 默认为 "np") — 生成图像的输出格式。选择 PIL.Image 或 np.array。
返回字典 (bool, 可选, 默认为 True) — 是否返回 WanPipelineOutput 而不是普通元组。
注意力参数 (dict, 可选) — 如果指定，此 kwargs 字典将传递给 diffusers.models.attention_processor 中 self.processor 下定义的 AttentionProcessor。
回调在步骤结束时 (Callable, PipelineCallback, MultiPipelineCallbacks, 可选) — 一个函数或 PipelineCallback 或 MultiPipelineCallbacks 的子类，在推理过程中每个去噪步骤结束时调用，参数如下：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。callback_kwargs 将包含 callback_on_step_end_tensor_inputs 指定的所有张量列表。
回调在步骤结束时张量输入 (List, 可选) — callback_on_step_end 函数的张量输入列表。列表中指定的张量将作为 callback_kwargs 参数传递。您只能包含管道类 ._callback_tensor_inputs 属性中列出的变量。
最大序列长度 (int, 默认为 512) — 文本编码器的最大序列长度。如果提示比此长度长，将被截断。如果提示更短，将填充到此长度。

~WanPipelineOutput 或 元组

如果 return_dict 为 True，则返回 WanPipelineOutput，否则返回一个 元组，其中第一个元素是生成的图像列表，第二个元素是布尔值列表，指示相应的生成图像是否包含“不适合工作”（nsfw）内容。

用于生成的管道的调用函数。

示例

>>> import torch
>>> from diffusers.utils import export_to_video
>>> from diffusers import AutoencoderKLWan, WanPipeline
>>> from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler

>>> # Available models: Wan-AI/Wan2.1-T2V-14B-Diffusers, Wan-AI/Wan2.1-T2V-1.3B-Diffusers
>>> model_id = "Wan-AI/Wan2.1-T2V-14B-Diffusers"
>>> vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
>>> pipe = WanPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16)
>>> flow_shift = 5.0  # 5.0 for 720P, 3.0 for 480P
>>> pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=flow_shift)
>>> pipe.to("cuda")

>>> prompt = "A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window."
>>> negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"

>>> output = pipe(
...     prompt=prompt,
...     negative_prompt=negative_prompt,
...     height=720,
...     width=1280,
...     num_frames=81,
...     guidance_scale=5.0,
... ).frames[0]
>>> export_to_video(output, "output.mp4", fps=16)

编码提示

< 来源 >

( prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None do_classifier_free_guidance: bool = True num_videos_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None max_sequence_length: int = 226 device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None )

参数

提示 (str 或 List[str], 可选) — 要编码的提示
负面提示 (str 或 List[str], 可选) — 不用于指导图像生成的提示。如果未定义，则必须传递 negative_prompt_embeds。在使用非指导（即，如果 guidance_scale 小于 1）时将被忽略。
执行无分类器指导 (bool, 可选, 默认为 True) — 是否使用无分类器指导。
每个提示的视频数量 (int, 可选, 默认为 1) — 每个提示应生成的视频数量。要放置结果嵌入的 torch 设备。
提示嵌入 (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入，例如提示加权。如果未提供，文本嵌入将从 prompt 输入参数生成。
负面提示嵌入 (torch.Tensor, 可选) — 预生成的负面文本嵌入。可用于轻松调整文本输入，例如提示加权。如果未提供，负面提示嵌入将从 negative_prompt 输入参数生成。
设备 — (torch.device, 可选): torch 设备
数据类型 — (torch.dtype, 可选): torch 数据类型

将提示编码为文本编码器隐藏状态。

WanImageToVideoPipeline

class diffusers.WanImageToVideoPipeline

< 来源 >

( tokenizer: AutoTokenizer text_encoder: UMT5EncoderModel image_encoder: CLIPVisionModel image_processor: CLIPImageProcessor transformer: WanTransformer3DModel vae: AutoencoderKLWan scheduler: FlowMatchEulerDiscreteScheduler )

参数

分词器 (T5Tokenizer) — 来自 T5 的分词器，特别是 google/umt5-xxl 变体。
文本编码器 (T5EncoderModel) — T5，特别是 google/umt5-xxl 变体。
图像编码器 (CLIPVisionModel) — CLIP，特别是 clip-vit-huge-patch14 变体。
变换器 (WanTransformer3DModel) — 用于去噪输入隐式表示的条件变换器。
调度器 (UniPCMultistepScheduler) — 与 transformer 结合使用的调度器，用于对编码后的图像隐式表示进行去噪。
变分自编码器 (AutoencoderKLWan) — 变分自编码器 (VAE) 模型，用于将视频编码和解码为隐式表示。

用于使用 Wan 生成图像到视频的管道。

此模型继承自DiffusionPipeline。有关所有流水线通用的方法（下载、保存、在特定设备上运行等），请查看超类文档。

call

< 来源 >

( image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str]] = None height: int = 480 width: int = 832 num_frames: int = 81 num_inference_steps: int = 50 guidance_scale: float = 5.0 num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None image_embeds: typing.Optional[torch.Tensor] = None last_image: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'np' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 ) → ~WanPipelineOutput 或 元组

参数

图像 (PipelineImageInput) — 用于调节生成的输入图像。必须是图像、图像列表或 torch.Tensor。
提示 (str 或 List[str], 可选) — 用于指导图像生成的提示。如果未定义，则必须传递 prompt_embeds。
负面提示 (str 或 List[str], 可选) — 不用于指导图像生成的提示。如果未定义，则必须传递 negative_prompt_embeds。在使用非指导（即，如果 guidance_scale 小于 1）时将被忽略。
高度 (int, 默认为 480) — 生成视频的高度。
宽度 (int, 默认为 832) — 生成视频的宽度。
帧数 (int, 默认为 81) — 生成视频中的帧数。
推理步数 (int, 默认为 50) — 去噪步数。更多的去噪步数通常会带来更高质量的图像，但推理速度会变慢。
指导比例 (float, 默认为 5.0) — 无分类器扩散指导中定义的指导比例。guidance_scale 被定义为 Imagen 论文中公式 2 的 w。通过设置 guidance_scale > 1 来启用指导比例。更高的指导比例鼓励生成与文本 prompt 紧密相关的图像，通常以牺牲较低图像质量为代价。
每个提示的视频数量 (int, 可选, 默认为 1) — 每个提示生成的图像数量。
生成器 (torch.Generator 或 List[torch.Generator], 可选) — 一个 torch.Generator 用于使生成具有确定性。
隐式表示 (torch.Tensor, 可选) — 从高斯分布中采样的预生成噪声隐式表示，用作图像生成的输入。可用于使用不同提示调整相同的生成。如果未提供，则使用提供的随机 生成器 采样生成隐式表示张量。
提示嵌入 (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入（提示加权）。如果未提供，文本嵌入将从 提示 输入参数生成。
负面提示嵌入 (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入（提示加权）。如果未提供，文本嵌入将从 negative_prompt 输入参数生成。
图像嵌入 (torch.Tensor, 可选) — 预生成的图像嵌入。可用于轻松调整图像输入（加权）。如果未提供，图像嵌入将从 image 输入参数生成。
输出类型 (str, 可选, 默认为 "np") — 生成图像的输出格式。选择 PIL.Image 或 np.array。
返回字典 (bool, 可选, 默认为 True) — 是否返回 WanPipelineOutput 而不是普通元组。
注意力参数 (dict, 可选) — 如果指定，此 kwargs 字典将传递给 diffusers.models.attention_processor 中 self.processor 下定义的 AttentionProcessor。
回调在步骤结束时 (Callable, PipelineCallback, MultiPipelineCallbacks, 可选) — 一个函数或 PipelineCallback 或 MultiPipelineCallbacks 的子类，在推理过程中每个去噪步骤结束时调用，参数如下：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。callback_kwargs 将包含 callback_on_step_end_tensor_inputs 指定的所有张量列表。
回调在步骤结束时张量输入 (List, 可选) — callback_on_step_end 函数的张量输入列表。列表中指定的张量将作为 callback_kwargs 参数传递。您只能包含管道类 ._callback_tensor_inputs 属性中列出的变量。
最大序列长度 (int, 默认为 512) — 文本编码器的最大序列长度。如果提示比此长度长，将被截断。如果提示更短，将填充到此长度。

~WanPipelineOutput 或 元组

用于生成的管道的调用函数。

示例

>>> import torch
>>> import numpy as np
>>> from diffusers import AutoencoderKLWan, WanImageToVideoPipeline
>>> from diffusers.utils import export_to_video, load_image
>>> from transformers import CLIPVisionModel

>>> # Available models: Wan-AI/Wan2.1-I2V-14B-480P-Diffusers, Wan-AI/Wan2.1-I2V-14B-720P-Diffusers
>>> model_id = "Wan-AI/Wan2.1-I2V-14B-480P-Diffusers"
>>> image_encoder = CLIPVisionModel.from_pretrained(
...     model_id, subfolder="image_encoder", torch_dtype=torch.float32
... )
>>> vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
>>> pipe = WanImageToVideoPipeline.from_pretrained(
...     model_id, vae=vae, image_encoder=image_encoder, torch_dtype=torch.bfloat16
... )
>>> pipe.to("cuda")

>>> image = load_image(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg"
... )
>>> max_area = 480 * 832
>>> aspect_ratio = image.height / image.width
>>> mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1]
>>> height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
>>> width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
>>> image = image.resize((width, height))
>>> prompt = (
...     "An astronaut hatching from an egg, on the surface of the moon, the darkness and depth of space realised in "
...     "the background. High quality, ultrarealistic detail and breath-taking movie-like camera shot."
... )
>>> negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"

>>> output = pipe(
...     image=image,
...     prompt=prompt,
...     negative_prompt=negative_prompt,
...     height=height,
...     width=width,
...     num_frames=81,
...     guidance_scale=5.0,
... ).frames[0]
>>> export_to_video(output, "output.mp4", fps=16)

编码提示

< 来源 >

参数

提示 (str 或 List[str], 可选) — 要编码的提示
负面提示 (str 或 List[str], 可选) — 不用于指导图像生成的提示。如果未定义，则必须传递 negative_prompt_embeds。在使用非指导（即，如果 guidance_scale 小于 1）时将被忽略。
执行无分类器指导 (bool, 可选, 默认为 True) — 是否使用无分类器指导。
每个提示的视频数量 (int, 可选, 默认为 1) — 每个提示应生成的视频数量。要放置结果嵌入的 torch 设备。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入，例如提示词权重。如果未提供，将从`prompt`输入参数生成文本嵌入。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负面文本嵌入。可用于轻松调整文本输入，例如提示词权重。如果未提供，`negative_prompt_embeds`将从`negative_prompt`输入参数生成。
device — (torch.device, 可选): torch 设备
dtype — (torch.dtype, 可选): torch 数据类型

将提示编码为文本编码器隐藏状态。

WanVACEPipeline

class diffusers.WanVACEPipeline

< source >

( tokenizer: AutoTokenizer text_encoder: UMT5EncoderModel transformer: WanVACETransformer3DModel vae: AutoencoderKLWan scheduler: FlowMatchEulerDiscreteScheduler )

参数

tokenizer (T5Tokenizer) — 来自 T5 的分词器，具体是 google/umt5-xxl 变体。
text_encoder (T5EncoderModel) — T5，具体是 google/umt5-xxl 变体。
transformer (WanTransformer3DModel) — 用于对输入潜空间进行去噪的条件 Transformer。
scheduler (UniPCMultistepScheduler) — 与 `transformer` 结合使用的调度器，用于对编码后的图像潜空间进行去噪。
vae (AutoencoderKLWan) — 变分自动编码器 (VAE) 模型，用于将视频编码和解码为潜空间表示。

用于使用 Wan 进行可控生成的流水线。

此模型继承自DiffusionPipeline。有关所有流水线通用的方法（下载、保存、在特定设备上运行等），请查看超类文档。

call

< source >

( prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str]] = None video: typing.Optional[typing.List[typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]]]] = None mask: typing.Optional[typing.List[typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]]]] = None reference_images: typing.Optional[typing.List[typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]]]] = None conditioning_scale: typing.Union[float, typing.List[float], torch.Tensor] = 1.0 height: int = 480 width: int = 832 num_frames: int = 81 num_inference_steps: int = 50 guidance_scale: float = 5.0 num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'np' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 ) → ~WanPipelineOutput 或 tuple

参数

prompt (str 或 List[str], 可选) — 用于引导图像生成的提示词。如果未定义，则必须传入 `prompt_embeds`。
negative_prompt (str 或 List[str], 可选) — 用于不引导图像生成的提示词。如果未定义，则必须传入 `negative_prompt_embeds`。当不使用指导时（即，如果 `guidance_scale` 小于 `1` 则忽略），将忽略此参数。
video (List[PIL.Image.Image], 可选) — 用作生成起点的输入视频。视频应为 PIL 图像列表、numpy 数组或 torch 张量。目前，该流水线仅支持一次生成一个视频。
mask (List[PIL.Image.Image], 可选) — 输入掩码定义了要进行条件化的视频区域和要生成的区域。掩码中的黑色区域表示条件区域，而白色区域表示生成区域。掩码应为 PIL 图像列表、numpy 数组或 torch 张量。目前支持一次生成一个视频。
reference_images (List[PIL.Image.Image], 可选) — 一个或多个参考图像的列表，作为生成的额外条件。例如，如果您要对视频进行修复以更改角色，您可以在此处传入新角色的参考图像。有关所有支持的任务和用例的完整列表，请参阅 Diffusers 示例和原始用户指南。
conditioning_scale (float, List[float], torch.Tensor, 默认为 1.0) — 在模型的每个控制层中，将控制条件潜流添加到去噪潜流时应用的条件缩放。如果提供浮点数，它将统一应用于所有层。如果提供列表或张量，其长度应与模型中控制层的数量相同（`len(transformer.config.vace_layers)`）。
height (int, 默认为 480) — 生成图像的高度（像素）。
width (int, 默认为 832) — 生成图像的宽度（像素）。
num_frames (int, 默认为 81) — 生成视频的帧数。
num_inference_steps (int, 默认为 50) — 去噪步数。更多去噪步数通常会导致更高质量的图像，但推理速度会变慢。
guidance_scale (float, 默认为 5.0) — Classifier-Free Diffusion Guidance 中定义的指导比例。`guidance_scale` 被定义为 Imagen Paper 方程 2 中的 `w`。通过设置 `guidance_scale > 1` 启用指导比例。较高的指导比例鼓励生成与文本 `prompt` 密切相关的图像，通常以牺牲较低图像质量为代价。
num_videos_per_prompt (int, 可选, 默认为 1) — 每个提示词生成的图像数量。
generator (torch.Generator 或 List[torch.Generator], 可选) — 用于使生成确定性的 torch.Generator。
latents (torch.Tensor, 可选) — 从高斯分布中采样的预生成的噪声潜空间，用作图像生成的输入。可用于使用不同的提示词调整相同的生成。如果未提供，则通过使用提供的随机 `generator` 进行采样来生成潜空间张量。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入（提示词权重）。如果未提供，文本嵌入将从 `prompt` 输入参数生成。
output_type (str, 可选, 默认为 "np") — 生成图像的输出格式。在 PIL.Image 或 np.array 之间选择。
return_dict (bool, 可选, 默认为 True) — 是否返回 WanPipelineOutput 而不是普通元组。
attention_kwargs (dict, 可选) — 一个 kwargs 字典，如果指定，将传递给 diffusers.models.attention_processor 中定义的 self.processor 下的 AttentionProcessor。
callback_on_step_end (Callable, PipelineCallback, MultiPipelineCallbacks, 可选) — 一个函数或 PipelineCallback 或 MultiPipelineCallbacks 的子类，在推理过程中每个去噪步骤结束时调用，参数如下：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。`callback_kwargs` 将包含由 `callback_on_step_end_tensor_inputs` 指定的所有张量列表。
callback_on_step_end_tensor_inputs (List, 可选) — `callback_on_step_end` 函数的张量输入列表。列表中指定的张量将作为 `callback_kwargs` 参数传递。您只能包含在流水线类的 `._callback_tensor_inputs` 属性中列出的变量。
max_sequence_length (int, 默认为 512) — 文本编码器的最大序列长度。如果提示词长度超过此值，则将被截断。如果提示词长度小于此值，则将被填充至此长度。

~WanPipelineOutput 或 元组

用于生成的管道的调用函数。

示例

>>> import torch
>>> import PIL.Image
>>> from diffusers import AutoencoderKLWan, WanVACEPipeline
>>> from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler
>>> from diffusers.utils import export_to_video, load_image
def prepare_video_and_mask(first_img: PIL.Image.Image, last_img: PIL.Image.Image, height: int, width: int, num_frames: int):
    first_img = first_img.resize((width, height))
    last_img = last_img.resize((width, height))
    frames = []
    frames.append(first_img)
    # Ideally, this should be 127.5 to match original code, but they perform computation on numpy arrays
    # whereas we are passing PIL images. If you choose to pass numpy arrays, you can set it to 127.5 to
    # match the original code.
    frames.extend([PIL.Image.new("RGB", (width, height), (128, 128, 128))] * (num_frames - 2))
    frames.append(last_img)
    mask_black = PIL.Image.new("L", (width, height), 0)
    mask_white = PIL.Image.new("L", (width, height), 255)
    mask = [mask_black, *[mask_white] * (num_frames - 2), mask_black]
    return frames, mask

>>> # Available checkpoints: Wan-AI/Wan2.1-VACE-1.3B-diffusers, Wan-AI/Wan2.1-VACE-14B-diffusers
>>> model_id = "Wan-AI/Wan2.1-VACE-1.3B-diffusers"
>>> vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
>>> pipe = WanVACEPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16)
>>> flow_shift = 3.0  # 5.0 for 720P, 3.0 for 480P
>>> pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=flow_shift)
>>> pipe.to("cuda")

>>> prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."
>>> negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"
>>> first_frame = load_image(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png"
... )
>>> last_frame = load_image(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png>>> "
... )

>>> height = 512
>>> width = 512
>>> num_frames = 81
>>> video, mask = prepare_video_and_mask(first_frame, last_frame, height, width, num_frames)

>>> output = pipe(
...     video=video,
...     mask=mask,
...     prompt=prompt,
...     negative_prompt=negative_prompt,
...     height=height,
...     width=width,
...     num_frames=num_frames,
...     num_inference_steps=30,
...     guidance_scale=5.0,
...     generator=torch.Generator().manual_seed(42),
... ).frames[0]
>>> export_to_video(output, "output.mp4", fps=16)

编码提示

< source >

参数

prompt (str 或 List[str], 可选) — 要编码的提示词
negative_prompt (str 或 List[str], 可选) — 用于不引导图像生成的提示词。如果未定义，则必须传入 `negative_prompt_embeds`。当不使用指导时（即，如果 `guidance_scale` 小于 `1` 则忽略），将忽略此参数。
do_classifier_free_guidance (bool, 可选, 默认为 True) — 是否使用无分类器指导。
num_videos_per_prompt (int, 可选, 默认为 1) — 每个提示词应生成的视频数量。将生成的嵌入放置到的 torch 设备。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入，例如提示词权重。如果未提供，将从`prompt`输入参数生成文本嵌入。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负面文本嵌入。可用于轻松调整文本输入，例如提示词权重。如果未提供，`negative_prompt_embeds`将从`negative_prompt`输入参数生成。
device — (torch.device, 可选): torch 设备
dtype — (torch.dtype, 可选): torch 数据类型

将提示编码为文本编码器隐藏状态。

WanVideoToVideoPipeline

class diffusers.WanVideoToVideoPipeline

< source >

( tokenizer: AutoTokenizer text_encoder: UMT5EncoderModel transformer: WanTransformer3DModel vae: AutoencoderKLWan scheduler: FlowMatchEulerDiscreteScheduler )

参数

tokenizer (T5Tokenizer) — 来自 T5 的分词器，具体是 google/umt5-xxl 变体。
text_encoder (T5EncoderModel) — T5，具体是 google/umt5-xxl 变体。
transformer (WanTransformer3DModel) — 用于对输入潜空间进行去噪的条件 Transformer。
scheduler (UniPCMultistepScheduler) — 与 `transformer` 结合使用的调度器，用于对编码后的图像潜空间进行去噪。
vae (AutoencoderKLWan) — 变分自动编码器 (VAE) 模型，用于将视频编码和解码为潜空间表示。

用于使用 Wan 进行视频到视频生成的流水线。

此模型继承自DiffusionPipeline。有关所有流水线通用的方法（下载、保存、在特定设备上运行等），请查看超类文档。

call

< source >

( video: typing.List[PIL.Image.Image] = None prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str]] = None height: int = 480 width: int = 832 num_inference_steps: int = 50 timesteps: typing.Optional[typing.List[int]] = None guidance_scale: float = 5.0 strength: float = 0.8 num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'np' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 ) → ~WanPipelineOutput 或 tuple

参数

prompt (str 或 List[str], 可选) — 用于引导图像生成的提示词。如果未定义，则必须传入 `prompt_embeds`。
height (int, 默认为 480) — 生成图像的高度（像素）。
width (int, 默认为 832) — 生成图像的宽度（像素）。
num_frames (int, 默认为 81) — 生成视频的帧数。
num_inference_steps (int, 默认为 50) — 去噪步数。更多去噪步数通常会导致更高质量的图像，但推理速度会变慢。
guidance_scale (float, 默认为 5.0) — Classifier-Free Diffusion Guidance 中定义的引导比例。 guidance_scale 被定义为 Imagen Paper 中公式 2 的 w。当 guidance_scale > 1 时启用引导比例。更高的引导比例鼓励生成与文本 prompt 紧密相关的图像，通常以牺牲图像质量为代价。
strength (float, 默认为 0.8) — 更高的强度会导致原始图像和生成的视频之间出现更多差异。
num_videos_per_prompt (int, 可选, 默认为 1) — 每个 prompt 生成的图像数量。
generator (torch.Generator 或 List[torch.Generator], 可选) — 一个 torch.Generator，用于使生成具有确定性。
latents (torch.Tensor, 可选) — 从高斯分布中采样的预生成噪声潜在值，用作图像生成的输入。可用于使用不同的 prompt 调整相同的生成。如果未提供，则使用提供的随机 generator 采样生成潜在张量。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入（prompt 加权）。如果未提供，将从 prompt 输入参数生成文本嵌入。
output_type (str, 可选, 默认为 "np") — 生成图像的输出格式。在 PIL.Image 或 np.array 之间选择。
return_dict (bool, 可选, 默认为 True) — 是否返回 WanPipelineOutput 而不是普通的元组。
attention_kwargs (dict, 可选) — 一个 kwargs 字典，如果指定，将作为参数传递给 diffusers.models.attention_processor 中 self.processor 下定义的 AttentionProcessor。
callback_on_step_end (Callable, PipelineCallback, MultiPipelineCallbacks, 可选) — 一个函数或 PipelineCallback 或 MultiPipelineCallbacks 的子类，在推理过程中每个去噪步骤结束时调用，并带有以下参数：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。callback_kwargs 将包含 callback_on_step_end_tensor_inputs 指定的所有张量列表。
callback_on_step_end_tensor_inputs (List, 可选) — callback_on_step_end 函数的张量输入列表。列表中指定的张量将作为 callback_kwargs 参数传递。您只能包含管道类 ._callback_tensor_inputs 属性中列出的变量。
max_sequence_length (int, 默认为 512) — 文本编码器的最大序列长度。如果 prompt 长度超过此值，将被截断。如果 prompt 长度小于此值，将填充到此长度。

~WanPipelineOutput 或 元组

用于生成的管道的调用函数。

示例

>>> import torch
>>> from diffusers.utils import export_to_video
>>> from diffusers import AutoencoderKLWan, WanVideoToVideoPipeline
>>> from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler

>>> # Available models: Wan-AI/Wan2.1-T2V-14B-Diffusers, Wan-AI/Wan2.1-T2V-1.3B-Diffusers
>>> model_id = "Wan-AI/Wan2.1-T2V-1.3B-Diffusers"
>>> vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
>>> pipe = WanVideoToVideoPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16)
>>> flow_shift = 3.0  # 5.0 for 720P, 3.0 for 480P
>>> pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=flow_shift)
>>> pipe.to("cuda")

>>> prompt = "A robot standing on a mountain top. The sun is setting in the background"
>>> negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"
>>> video = load_video(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/hiker.mp4"
... )
>>> output = pipe(
...     video=video,
...     prompt=prompt,
...     negative_prompt=negative_prompt,
...     height=480,
...     width=720,
...     guidance_scale=5.0,
...     strength=0.7,
... ).frames[0]
>>> export_to_video(output, "output.mp4", fps=16)

编码提示

< 源 >

参数

prompt (str 或 List[str], 可选) — 待编码的 prompt
negative_prompt (str 或 List[str], 可选) — 不用于引导图像生成的 prompt。如果未定义，则必须传递 negative_prompt_embeds。当不使用引导时（即，如果 guidance_scale 小于 1 时），将被忽略。
do_classifier_free_guidance (bool, 可选, 默认为 True) — 是否使用无分类器引导。
num_videos_per_prompt (int, 可选, 默认为 1) — 每个 prompt 应生成的视频数量。将结果嵌入放置的 torch 设备
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入，例如 prompt 加权。如果未提供，文本嵌入将从 prompt 输入参数生成。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负文本嵌入。可用于轻松调整文本输入，例如 prompt 加权。如果未提供，negative_prompt_embeds 将从 negative_prompt 输入参数生成。
device — (torch.device, 可选): torch 设备
dtype — (torch.dtype, 可选): torch 数据类型

将提示编码为文本编码器隐藏状态。

WanPipelineOutput

类 diffusers.pipelines.wan.pipeline_output.WanPipelineOutput

< 源 >

( 帧: 张量 )

参数

frames (torch.Tensor, np.ndarray, 或 List[List[PIL.Image.Image]]) — 视频输出列表 - 可以是长度为 batch_size 的嵌套列表，其中每个子列表包含长度为 num_frames 的去噪 PIL 图像序列。它也可以是形状为 (batch_size, num_frames, channels, height, width) 的 NumPy 数组或 Torch 张量。

Wan 管道的输出类。

< > 在 GitHub 上更新

←VisualCloze Wuerstchen→

Diffusers

Wan2.1

文本到视频生成

首尾帧到视频生成

任意到视频可控生成

注意事项

WanPipeline

class diffusers.WanPipeline

__call__

编码提示

WanImageToVideoPipeline

class diffusers.WanImageToVideoPipeline

__call__

编码提示

WanVACEPipeline

class diffusers.WanVACEPipeline

__call__

编码提示

WanVideoToVideoPipeline

class diffusers.WanVideoToVideoPipeline

__call__

编码提示

WanPipelineOutput

类 diffusers.pipelines.wan.pipeline_output.WanPipelineOutput

call

call

call

call