Diffusers 文档

HunyuanVideo

Diffusers

加入 Hugging Face 社区

并获取增强的文档体验

协作处理模型、数据集和 Spaces

使用加速推理获得更快的示例

切换文档主题

开始使用

HunyuanVideo

HunyuanVideo，由腾讯提供。

视频生成领域的最新进展已对个人和行业的日常生活产生了重大影响。然而，领先的视频生成模型仍然是闭源的，导致行业能力与公众可获得的能力之间存在明显的性能差距。在本报告中，我们介绍了 HunyuanVideo，这是一种创新的开源视频基础模型，其在视频生成方面的性能可与甚至超过领先的闭源模型。HunyuanVideo 包含一个综合框架，该框架集成了几个关键要素，包括数据整理、先进的架构设计、渐进式模型扩展和训练，以及为大规模模型训练和推理量身定制的高效基础设施。因此，我们成功训练了一个参数超过 130 亿的视频生成模型，使其成为所有开源模型中最大的模型。我们进行了广泛的实验，并实施了一系列有针对性的设计，以确保高视觉质量、运动动态、文本视频对齐和先进的拍摄技术。根据专业人士的评估，HunyuanVideo 的性能优于以前最先进的模型，包括 Runway Gen-3、Luma 1.6 和三款表现最佳的中国视频生成模型。通过发布基础模型及其应用程序的代码，我们旨在弥合闭源和开源社区之间的差距。此举将使社区内的个人能够尝试他们的想法，从而培育一个更具活力和生机的视频生成生态系统。代码已在此 HTTPS URL 上公开提供。

请务必查看 Schedulers 指南，了解如何探索 scheduler 速度和质量之间的权衡，并查看跨 pipelines 重用组件部分，了解如何有效地将相同组件加载到多个 pipelines 中。

推理建议

两个文本编码器都应为 torch.float16。
Transformer 应为 torch.bfloat16。
VAE 应为 torch.float16。
num_frames 应为 4 * k + 1 的形式，例如 49 或 129。
对于较小分辨率的视频，请在 Scheduler 中尝试较低的 shift 值（介于 2.0 到 5.0 之间）。对于较大分辨率的图像，请尝试较高的值（介于 7.0 和 12.0 之间）。HunyuanVideo 的默认值为 7.0。
有关支持的分辨率和其他详细信息，请参阅原始存储库此处。

HunyuanVideoPipeline

class diffusers.HunyuanVideoPipeline

< source >

( text_encoder: LlamaModel tokenizer: LlamaTokenizerFast transformer: HunyuanVideoTransformer3DModel vae: AutoencoderKLHunyuanVideo scheduler: FlowMatchEulerDiscreteScheduler text_encoder_2: CLIPTextModel tokenizer_2: CLIPTokenizer )

参数

text_encoder (LlamaModel) — Llava Llama3-8B。
tokenizer (LlamaTokenizer) — 来自 Llava Llama3-8B 的 Tokenizer。
transformer (HunyuanVideoTransformer3DModel) — 用于去噪编码后的图像潜在表示的条件 Transformer。
scheduler (FlowMatchEulerDiscreteScheduler) — 一个调度器，与 transformer 结合使用，以去噪编码后的图像潜在表示。
vae (AutoencoderKLHunyuanVideo) — 变分自编码器 (VAE) 模型，用于将视频编码和解码为潜在表示和从潜在表示解码为视频。
text_encoder_2 (CLIPTextModel) — CLIP，特别是 clip-vit-large-patch14 变体。
tokenizer_2 (CLIPTokenizer) — CLIPTokenizer 类的 Tokenizer。

使用 HunyuanVideo 进行文本到视频生成的 Pipeline。

此模型继承自 DiffusionPipeline。查看超类文档以获取为所有 pipelines 实现的通用方法（下载、保存、在特定设备上运行等）。

call

< source >

( prompt: typing.Union[str, typing.List[str]] = None prompt_2: typing.Union[str, typing.List[str]] = None height: int = 720 width: int = 1280 num_frames: int = 129 num_inference_steps: int = 50 sigmas: typing.List[float] = None guidance_scale: float = 6.0 num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] prompt_template: typing.Dict[str, typing.Any] = {'template': '<|start_header_id|>system<|end_header_id|>\n\nDescribe the video by detailing the following aspects: 1. The main content and theme of the video.2. The color, shape, size, texture, quantity, text, and spatial relationships of the objects.3. Actions, events, behaviors temporal relationships, physical movement changes of the objects.4. background environment, light, style and atmosphere.5. camera angles, movements, and transitions used in the video:<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|>', 'crop_start': 95} max_sequence_length: int = 256 ) → ~HunyuanVideoPipelineOutput or tuple

参数

prompt (str 或 List[str], 可选) — 用于引导图像生成的提示或提示列表。如果未定义，则必须改为传递 prompt_embeds。
prompt_2 (str 或 List[str], 可选) — 要发送到 tokenizer_2 和 text_encoder_2 的提示或提示列表。如果未定义，将使用 prompt 代替。
height (int, 默认为 720) — 生成图像的高度像素值。
width (int, 默认为 1280) — 生成图像的宽度像素值。
num_frames (int, 默认为 129) — 生成视频中的帧数。
num_inference_steps (int, 默认为 50) — 去噪步骤的数量。更多去噪步骤通常会以较慢的推理速度为代价，带来更高质量的图像。
sigmas (List[float], 可选) — 用于支持在其 set_timesteps 方法中使用 sigmas 参数的调度器的去噪过程的自定义 sigmas。如果未定义，将使用传递 num_inference_steps 时的默认行为。
guidance_scale (float, 默认为 6.0) — Classifier-Free Diffusion Guidance 中定义的 Guidance scale。guidance_scale 定义为 Imagen Paper 的等式 2 中的 w。 Guidance scale 通过设置 guidance_scale > 1 启用。较高的 guidance scale 鼓励生成与文本 prompt 紧密相关的图像，但通常以较低的图像质量为代价。请注意，唯一可用的 HunyuanVideo 模型是 CFG 蒸馏的，这意味着不应用非条件和条件潜在表示之间的传统 guidance。
num_videos_per_prompt (int, 可选, 默认为 1) — 每个提示要生成的图像数量。
generator (torch.Generator 或 List[torch.Generator], 可选) — 用于使生成结果具有确定性的 torch.Generator。
latents (torch.Tensor, 可选) — 预生成的、从高斯分布中采样的噪声潜在表示，用作图像生成的输入。可用于使用不同的提示调整相同的生成。如果未提供，则使用提供的随机 generator 采样生成潜在表示张量。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入（提示权重）。如果未提供，则从 prompt 输入参数生成文本嵌入。
output_type (str, 可选, 默认为 "pil") — 生成图像的输出格式。在 PIL.Image 或 np.array 之间选择。
return_dict (bool, 可选, 默认为 True) — 是否返回 HunyuanVideoPipelineOutput 而不是普通元组。
attention_kwargs (dict, 可选) — 一个 kwargs 字典，如果指定，则作为 AttentionProcessor 传递，定义在 diffusers.models.attention_processor 的 self.processor 下。
clip_skip (int, 可选) — 从 CLIP 中跳过的层数，用于计算提示嵌入。值为 1 表示预倒数第二层的输出将用于计算提示嵌入。
callback_on_step_end (Callable, PipelineCallback, MultiPipelineCallbacks, 可选) — 一个函数或 PipelineCallback 或 MultiPipelineCallbacks 的子类，它在推理期间每个去噪步骤结束时被调用。具有以下参数：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。 callback_kwargs 将包含由 callback_on_step_end_tensor_inputs 指定的所有张量列表。
callback_on_step_end_tensor_inputs (List, *可选*) — callback_on_step_end 函数的张量输入列表。列表中指定的张量将作为 callback_kwargs 参数传递。您将只能包含在管道类的 ._callback_tensor_inputs 属性中列出的变量。

~HunyuanVideoPipelineOutput 或 tuple

如果 return_dict 为 True，则返回 HunyuanVideoPipelineOutput，否则返回一个 tuple，其中第一个元素是包含生成图像的列表，第二个元素是 bool 列表，指示相应的生成图像是否包含“不适合工作场所观看”（nsfw）内容。

管道的调用函数，用于生成。

示例

>>> import torch
>>> from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
>>> from diffusers.utils import export_to_video

>>> model_id = "hunyuanvideo-community/HunyuanVideo"
>>> transformer = HunyuanVideoTransformer3DModel.from_pretrained(
...     model_id, subfolder="transformer", torch_dtype=torch.bfloat16
... )
>>> pipe = HunyuanVideoPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.float16)
>>> pipe.vae.enable_tiling()
>>> pipe.to("cuda")

>>> output = pipe(
...     prompt="A cat walks on the grass, realistic",
...     height=320,
...     width=512,
...     num_frames=61,
...     num_inference_steps=30,
... ).frames[0]
>>> export_to_video(output, "output.mp4", fps=15)

disable_vae_slicing

< source >

( )

禁用分片 VAE 解码。如果之前启用了 enable_vae_slicing，此方法将返回到一步计算解码。

disable_vae_tiling

< source >

( )

禁用平铺 VAE 解码。如果之前启用了 enable_vae_tiling，此方法将返回到一步计算解码。

enable_vae_slicing

< source >

( )

启用分片 VAE 解码。启用此选项后，VAE 将输入张量拆分为切片，以分步计算解码。这对于节省一些内存并允许更大的批量大小很有用。

enable_vae_tiling

< source >

( )

启用平铺 VAE 解码。启用此选项后，VAE 会将输入张量拆分为平铺，以分步计算解码和编码。这对于节省大量内存并允许处理更大的图像很有用。

HunyuanVideoPipelineOutput

class diffusers.pipelines.hunyuan_video.pipeline_output.HunyuanVideoPipelineOutput

< source >

( frames: Tensor )

参数

frames (torch.Tensor, np.ndarray, 或 List[List[PIL.Image.Image]]) — 视频输出列表 - 它可以是长度为 batch_size 的嵌套列表，其中每个子列表包含长度为 num_frames 的去噪 PIL 图像序列。它也可以是形状为 (batch_size, num_frames, channels, height, width) 的 NumPy 数组或 Torch 张量。

HunyuanVideo 管道的输出类。

< > GitHub 上更新

←Hunyuan-DiT I2VGen-XL→

Diffusers

HunyuanVideo

HunyuanVideoPipeline

class diffusers.HunyuanVideoPipeline

__call__

disable_vae_slicing

disable_vae_tiling

enable_vae_slicing

enable_vae_tiling

HunyuanVideoPipelineOutput

class diffusers.pipelines.hunyuan_video.pipeline_output.HunyuanVideoPipelineOutput

call