Diffusers 文档

Framepack

Diffusers

加入 Hugging Face 社区

并获得增强的文档体验

在模型、数据集和 Spaces 上进行协作

通过加速推理获得更快的示例

切换文档主题

开始使用

Framepack

Lvmin Zhang 和 Maneesh Agrawala 的论文《在视频生成模型中打包输入帧上下文以进行下一帧预测》。

我们提出了一种名为 FramePack 的神经网络结构，用于训练视频生成的下一帧（或下一帧部分）预测模型。FramePack 压缩输入帧，使 Transformer 上下文长度固定，无论视频长度如何。因此，我们能够使用视频扩散处理大量帧，其计算瓶颈类似于图像扩散。这也使得训练视频批次大小显著提高（批次大小变得与图像扩散训练相当）。我们还提出了一种反漂移采样方法，以反向时间顺序生成帧，并设置早期建立的端点，以避免曝光偏差（迭代中误差累积）。最后，我们展示了现有视频扩散模型可以通过 FramePack 进行微调，并且它们的视觉质量可能会得到改善，因为下一帧预测支持更平衡的扩散调度器，具有更小的极端流偏移时间步。

请务必查看调度器指南，了解如何在调度器速度和质量之间进行权衡，并查看跨管道重用组件部分，了解如何有效地将相同组件加载到多个管道中。

可用模型

模型名称	描述
- `lllyasviel/FramePackI2V_HY`	使用论文中描述的“反向反漂移”策略进行训练。推理时需要将 `sampling_type` 设置为 `"inverted_anti_drifting"`。
- `lllyasviel/FramePack_F1_I2V_HY_20250503`	使用新型反漂移策略进行训练，但推理采用论文中描述的“香草”策略。推理时需要将 `sampling_type` 设置为 `"vanilla"`。

用法

请参阅管道文档以获取基本使用示例。以下部分包含卸载、不同采样方法、量化等示例。

首尾帧到视频

以下示例演示了如何使用 Framepack 和起始/结束图像控制，并使用反向反漂移采样模型。

import torch
from diffusers import HunyuanVideoFramepackPipeline, HunyuanVideoFramepackTransformer3DModel
from diffusers.utils import export_to_video, load_image
from transformers import SiglipImageProcessor, SiglipVisionModel

transformer = HunyuanVideoFramepackTransformer3DModel.from_pretrained(
    "lllyasviel/FramePackI2V_HY", torch_dtype=torch.bfloat16
)
feature_extractor = SiglipImageProcessor.from_pretrained(
    "lllyasviel/flux_redux_bfl", subfolder="feature_extractor"
)
image_encoder = SiglipVisionModel.from_pretrained(
    "lllyasviel/flux_redux_bfl", subfolder="image_encoder", torch_dtype=torch.float16
)
pipe = HunyuanVideoFramepackPipeline.from_pretrained(
    "hunyuanvideo-community/HunyuanVideo",
    transformer=transformer,
    feature_extractor=feature_extractor,
    image_encoder=image_encoder,
    torch_dtype=torch.float16,
)

# Enable memory optimizations
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."
first_image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png"
)
last_image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png"
)
output = pipe(
    image=first_image,
    last_image=last_image,
    prompt=prompt,
    height=512,
    width=512,
    num_frames=91,
    num_inference_steps=30,
    guidance_scale=9.0,
    generator=torch.Generator().manual_seed(0),
    sampling_type="inverted_anti_drifting",
).frames[0]
export_to_video(output, "output.mp4", fps=30)

香草采样

以下示例演示了如何使用 Framepack 和 F1 模型，该模型使用香草采样和新的反漂移调节方法进行训练。

import torch
from diffusers import HunyuanVideoFramepackPipeline, HunyuanVideoFramepackTransformer3DModel
from diffusers.utils import export_to_video, load_image
from transformers import SiglipImageProcessor, SiglipVisionModel

transformer = HunyuanVideoFramepackTransformer3DModel.from_pretrained(
    "lllyasviel/FramePack_F1_I2V_HY_20250503", torch_dtype=torch.bfloat16
)
feature_extractor = SiglipImageProcessor.from_pretrained(
    "lllyasviel/flux_redux_bfl", subfolder="feature_extractor"
)
image_encoder = SiglipVisionModel.from_pretrained(
    "lllyasviel/flux_redux_bfl", subfolder="image_encoder", torch_dtype=torch.float16
)
pipe = HunyuanVideoFramepackPipeline.from_pretrained(
    "hunyuanvideo-community/HunyuanVideo",
    transformer=transformer,
    feature_extractor=feature_extractor,
    image_encoder=image_encoder,
    torch_dtype=torch.float16,
)

# Enable memory optimizations
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/penguin.png"
)
output = pipe(
    image=image,
    prompt="A penguin dancing in the snow",
    height=832,
    width=480,
    num_frames=91,
    num_inference_steps=30,
    guidance_scale=9.0,
    generator=torch.Generator().manual_seed(0),
    sampling_type="vanilla",
).frames[0]
export_to_video(output, "output.mp4", fps=30)

分组卸载

分组卸载 (apply_group_offloading()) 提供了积极的内存优化，可将任何模型的内部部分卸载到 CPU，可能不会对生成时间造成额外开销。如果您的显存非常低，此方法可能适合您，具体取决于可用的 CPU RAM 大小。

import torch
from diffusers import HunyuanVideoFramepackPipeline, HunyuanVideoFramepackTransformer3DModel
from diffusers.hooks import apply_group_offloading
from diffusers.utils import export_to_video, load_image
from transformers import SiglipImageProcessor, SiglipVisionModel

transformer = HunyuanVideoFramepackTransformer3DModel.from_pretrained(
    "lllyasviel/FramePack_F1_I2V_HY_20250503", torch_dtype=torch.bfloat16
)
feature_extractor = SiglipImageProcessor.from_pretrained(
    "lllyasviel/flux_redux_bfl", subfolder="feature_extractor"
)
image_encoder = SiglipVisionModel.from_pretrained(
    "lllyasviel/flux_redux_bfl", subfolder="image_encoder", torch_dtype=torch.float16
)
pipe = HunyuanVideoFramepackPipeline.from_pretrained(
    "hunyuanvideo-community/HunyuanVideo",
    transformer=transformer,
    feature_extractor=feature_extractor,
    image_encoder=image_encoder,
    torch_dtype=torch.float16,
)

# Enable group offloading
onload_device = torch.device("cuda")
offload_device = torch.device("cpu")
list(map(
    lambda x: apply_group_offloading(x, onload_device, offload_device, offload_type="leaf_level", use_stream=True, low_cpu_mem_usage=True),
    [pipe.text_encoder, pipe.text_encoder_2, pipe.transformer]
))
pipe.image_encoder.to(onload_device)
pipe.vae.to(onload_device)
pipe.vae.enable_tiling()

image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/penguin.png"
)
output = pipe(
    image=image,
    prompt="A penguin dancing in the snow",
    height=832,
    width=480,
    num_frames=91,
    num_inference_steps=30,
    guidance_scale=9.0,
    generator=torch.Generator().manual_seed(0),
    sampling_type="vanilla",
).frames[0]
print(f"Max memory: {torch.cuda.max_memory_allocated() / 1024**3:.3f} GB")
export_to_video(output, "output.mp4", fps=30)

< 源 >

类 diffusers.HunyuanVideoFramepackPipeline

< 源代码 >

( text_encoder: LlamaModel tokenizer: LlamaTokenizerFast transformer: HunyuanVideoFramepackTransformer3DModel vae: AutoencoderKLHunyuanVideo scheduler: FlowMatchEulerDiscreteScheduler text_encoder_2: CLIPTextModel tokenizer_2: CLIPTokenizer image_encoder: SiglipVisionModel feature_extractor: SiglipImageProcessor )

参数

text_encoder (LlamaModel) — Llava Llama3-8B。
tokenizer (LlamaTokenizer) — 来自 Llava Llama3-8B 的分词器。
transformer (HunyuanVideoTransformer3DModel) — 用于去噪编码图像潜在的条件 Transformer。
scheduler (FlowMatchEulerDiscreteScheduler) — 与 transformer 结合使用的调度器，用于对编码图像潜在进行去噪。
vae (AutoencoderKLHunyuanVideo) — 变分自编码器 (VAE) 模型，用于将视频编码和解码为潜在表示。
text_encoder_2 (CLIPTextModel) — CLIP，特别是 clip-vit-large-patch14 变体。
tokenizer_2 (CLIPTokenizer) — CLIPTokenizer 类的分词器。

使用 HunyuanVideo 进行文本到视频生成的管道。

此模型继承自 DiffusionPipeline。请查看超类文档，了解所有管道实现的通用方法（下载、保存、在特定设备上运行等）。

call

< 源代码 >

( image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] last_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None prompt: typing.Union[str, typing.List[str]] = None prompt_2: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str]] = None negative_prompt_2: typing.Union[str, typing.List[str]] = None height: int = 720 width: int = 1280 num_frames: int = 129 latent_window_size: int = 9 num_inference_steps: int = 50 sigmas: typing.List[float] = None true_cfg_scale: float = 1.0 guidance_scale: float = 6.0 num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None image_latents: typing.Optional[torch.Tensor] = None last_image_latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] prompt_template: typing.Dict[str, typing.Any] = {'template': '<|start_header_id|>system<|end_header_id|>\n\n描述视频时请详细说明以下几个方面：1. 视频的主要内容和主题。2. 视频中物体的颜色、形状、大小、纹理、数量、文本和空间关系。3. 物体的动作、事件、行为、时间关系、物理运动变化。4. 背景环境、光线、风格和氛围。5. 视频中使用的摄像机角度、运动和过渡：<|eot_id|><|start_header_id|>用户<|end_header_id|>\n\n{}<|eot_id|>', 'crop_start': 95} max_sequence_length: int = 256 sampling_type: FramepackSamplingType = <FramepackSamplingType.INVERTED_ANTI_DRIFTING: 'inverted_anti_drifting'> ) → ~HunyuanVideoFramepackPipelineOutput 或 元组

参数

image (PIL.Image.Image 或 np.ndarray 或 torch.Tensor) — 用作视频生成起点的图像。
last_image (PIL.Image.Image 或 np.ndarray 或 torch.Tensor, 可选) — 可选的最后一张图像，用作视频生成的终点。这对于生成两张图像之间的过渡非常有用。
prompt (str 或 List[str], 可选) — 用于引导图像生成的提示词。如果未定义，则必须传递 prompt_embeds。
prompt_2 (str 或 List[str], 可选) — 发送到 tokenizer_2 和 text_encoder_2 的提示词。如果未定义，将使用 prompt。
negative_prompt (str 或 List[str], 可选) — 不用于引导图像生成的提示词。如果未定义，则必须传递 negative_prompt_embeds。未采用引导时忽略（即，如果 true_cfg_scale 不大于 1 则忽略）。
negative_prompt_2 (str 或 List[str], 可选) — 不用于引导图像生成并发送到 tokenizer_2 和 text_encoder_2 的提示词。如果未定义，所有文本编码器都将使用 negative_prompt。
height (int, 默认为 720) — 生成图像的高度（像素）。
width (int, 默认为 1280) — 生成图像的宽度（像素）。
num_frames (int, 默认为 129) — 生成视频中的帧数。
num_inference_steps (int, 默认为 50) — 去噪步数。更多去噪步数通常会带来更高质量的图像，但推理速度会变慢。
sigmas (List[float], 可选) — 用于去噪过程的自定义 sigmas，适用于其 set_timesteps 方法支持 sigmas 参数的调度器。如果未定义，将使用传递 num_inference_steps 时的默认行为。
true_cfg_scale (float, 可选, 默认为 1.0) — 当 > 1.0 且提供了 negative_prompt 时，启用真实的无分类器指导。
guidance_scale (float, 默认为 6.0) — Classifier-Free Diffusion Guidance 中定义的指导尺度。guidance_scale 定义为 Imagen Paper 中公式 2 的 w。通过设置 guidance_scale > 1 来启用指导尺度。更高的指导尺度会鼓励生成与文本 prompt 紧密相关的图像，通常以牺牲较低图像质量为代价。请注意，唯一可用的 HunyuanVideo 模型是 CFG 蒸馏的，这意味着没有应用无条件和条件潜在之间的传统指导。
num_videos_per_prompt (int, 可选, 默认为 1) — 每个 prompt 生成的图像数量。
generator (torch.Generator 或 List[torch.Generator], 可选) — 用于使生成具有确定性的 torch.Generator。
image_latents (torch.Tensor, 可选) — 预编码的图像潜在。如果未提供，图像将使用 VAE 进行编码。
last_image_latents (torch.Tensor, 可选) — 预编码的最后一个图像潜在。如果未提供，最后一个图像将使用 VAE 进行编码。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入（prompt 权重）。如果未提供，文本嵌入将从 prompt 输入参数生成。
pooled_prompt_embeds (torch.FloatTensor, 可选) — 预生成的池化文本嵌入。可用于轻松调整文本输入，例如 prompt 权重。如果未提供，池化文本嵌入将从 prompt 输入参数生成。
negative_prompt_embeds (torch.FloatTensor, 可选) — 预生成的负面文本嵌入。可用于轻松调整文本输入，例如 prompt 权重。如果未提供，负面 prompt 嵌入将从 negative_prompt 输入参数生成。
negative_pooled_prompt_embeds (torch.FloatTensor, 可选) — 预生成的负面池化文本嵌入。可用于轻松调整文本输入，例如 prompt 权重。如果未提供，池化负面 prompt 嵌入将从 negative_prompt 输入参数生成。
output_type (str, 可选, 默认为 "pil") — 生成图像的输出格式。在 PIL.Image 或 np.array 之间选择。
return_dict (bool, 可选, 默认为 True) — 是否返回 HunyuanVideoFramepackPipelineOutput 而不是普通元组。
attention_kwargs (dict, 可选) — 一个 kwargs 字典，如果指定，则作为参数传递给 diffusers.models.attention_processor 中 self.processor 下定义的 AttentionProcessor。
clip_skip (int, 可选) — 在计算 prompt 嵌入时，从 CLIP 中跳过的层数。值为 1 表示将使用倒数第二层的输出计算 prompt 嵌入。
callback_on_step_end (Callable, PipelineCallback, MultiPipelineCallbacks, 可选) — 在推理期间，每个去噪步骤结束时调用的函数或 PipelineCallback 或 MultiPipelineCallbacks 的子类。参数如下：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。callback_kwargs 将包含 callback_on_step_end_tensor_inputs 指定的所有张量列表。
callback_on_step_end_tensor_inputs (List, 可选) — callback_on_step_end 函数的张量输入列表。列表中指定的张量将作为 callback_kwargs 参数传递。您只能包含 pipeline 类 ._callback_tensor_inputs 属性中列出的变量。

~HunyuanVideoFramepackPipelineOutput 或 tuple

如果 return_dict 为 True，则返回 HunyuanVideoFramepackPipelineOutput，否则返回一个 tuple，其中第一个元素是生成的图像列表，第二个元素是一个 bool 列表，指示相应的生成图像是否包含“不适合工作”（nsfw）内容。

用于生成的管道的调用函数。

示例

图像到视频

>>> import torch
>>> from diffusers import HunyuanVideoFramepackPipeline, HunyuanVideoFramepackTransformer3DModel
>>> from diffusers.utils import export_to_video, load_image
>>> from transformers import SiglipImageProcessor, SiglipVisionModel

>>> transformer = HunyuanVideoFramepackTransformer3DModel.from_pretrained(
...     "lllyasviel/FramePackI2V_HY", torch_dtype=torch.bfloat16
... )
>>> feature_extractor = SiglipImageProcessor.from_pretrained(
...     "lllyasviel/flux_redux_bfl", subfolder="feature_extractor"
... )
>>> image_encoder = SiglipVisionModel.from_pretrained(
...     "lllyasviel/flux_redux_bfl", subfolder="image_encoder", torch_dtype=torch.float16
... )
>>> pipe = HunyuanVideoFramepackPipeline.from_pretrained(
...     "hunyuanvideo-community/HunyuanVideo",
...     transformer=transformer,
...     feature_extractor=feature_extractor,
...     image_encoder=image_encoder,
...     torch_dtype=torch.float16,
... )
>>> pipe.vae.enable_tiling()
>>> pipe.to("cuda")

>>> image = load_image(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/penguin.png"
... )
>>> output = pipe(
...     image=image,
...     prompt="A penguin dancing in the snow",
...     height=832,
...     width=480,
...     num_frames=91,
...     num_inference_steps=30,
...     guidance_scale=9.0,
...     generator=torch.Generator().manual_seed(0),
...     sampling_type="inverted_anti_drifting",
... ).frames[0]
>>> export_to_video(output, "output.mp4", fps=30)

首尾图像到视频

>>> import torch
>>> from diffusers import HunyuanVideoFramepackPipeline, HunyuanVideoFramepackTransformer3DModel
>>> from diffusers.utils import export_to_video, load_image
>>> from transformers import SiglipImageProcessor, SiglipVisionModel

>>> transformer = HunyuanVideoFramepackTransformer3DModel.from_pretrained(
...     "lllyasviel/FramePackI2V_HY", torch_dtype=torch.bfloat16
... )
>>> feature_extractor = SiglipImageProcessor.from_pretrained(
...     "lllyasviel/flux_redux_bfl", subfolder="feature_extractor"
... )
>>> image_encoder = SiglipVisionModel.from_pretrained(
...     "lllyasviel/flux_redux_bfl", subfolder="image_encoder", torch_dtype=torch.float16
... )
>>> pipe = HunyuanVideoFramepackPipeline.from_pretrained(
...     "hunyuanvideo-community/HunyuanVideo",
...     transformer=transformer,
...     feature_extractor=feature_extractor,
...     image_encoder=image_encoder,
...     torch_dtype=torch.float16,
... )
>>> pipe.to("cuda")

>>> prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."
>>> first_image = load_image(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png"
... )
>>> last_image = load_image(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png"
... )
>>> output = pipe(
...     image=first_image,
...     last_image=last_image,
...     prompt=prompt,
...     height=512,
...     width=512,
...     num_frames=91,
...     num_inference_steps=30,
...     guidance_scale=9.0,
...     generator=torch.Generator().manual_seed(0),
...     sampling_type="inverted_anti_drifting",
... ).frames[0]
>>> export_to_video(output, "output.mp4", fps=30)

disable_vae_slicing

< 来源 >

( )

禁用切片 VAE 解码。如果之前启用了 enable_vae_slicing，此方法将返回一步计算解码。

disable_vae_tiling

< 来源 >

( )

禁用平铺 VAE 解码。如果之前启用了 enable_vae_tiling，此方法将恢复一步计算解码。

enable_vae_slicing

< 来源 >

( )

启用切片 VAE 解码。启用此选项后，VAE 会将输入张量分片，分步计算解码。这有助于节省一些内存并允许更大的批次大小。

enable_vae_tiling

< 来源 >

( )

启用平铺 VAE 解码。启用此选项后，VAE 将把输入张量分割成瓦片，分多步计算编码和解码。这对于节省大量内存和处理更大的图像非常有用。

HunyuanVideoPipelineOutput

class diffusers.pipelines.hunyuan_video.pipeline_output.HunyuanVideoPipelineOutput

< 来源 >

( frames: Tensor )

参数

frames (torch.Tensor, np.ndarray, 或 List[List[PIL.Image.Image]]) — 视频输出列表 - 可以是长度为 batch_size 的嵌套列表，每个子列表包含长度为 num_frames 的去噪 PIL 图像序列。它也可以是形状为 (batch_size, num_frames, channels, height, width) 的 NumPy 数组或 Torch 张量。

HunyuanVideo pipelines 的输出类。

< > 在 GitHub 上更新

←FluxControlInpaint HiDream-I1→

Diffusers

Framepack

可用模型

用法

首尾帧到视频

香草采样

分组卸载

< 源 >

类 diffusers.HunyuanVideoFramepackPipeline

__call__

图像到视频

首尾图像到视频

disable_vae_slicing

disable_vae_tiling

enable_vae_slicing

enable_vae_tiling

HunyuanVideoPipelineOutput

class diffusers.pipelines.hunyuan_video.pipeline_output.HunyuanVideoPipelineOutput

call