Diffusers 文档

使用 PIA (个性化图像动画器) 的图像到视频生成

Diffusers

加入 Hugging Face 社区

并获得增强的文档体验

在模型、数据集和 Spaces 上协作

通过加速推理获得更快的示例

切换文档主题

开始使用

使用 PIA (个性化图像动画器) 的图像到视频生成

概述

PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models 作者：Yiming Zhang, Zhening Xing, Yanhong Zeng, Youqing Fang, Kai Chen

个性化文本到图像 (T2I) 模型的最新进展彻底改变了内容创作，使非专业人士也能够生成具有独特风格的惊艳图像。虽然前景广阔，但通过文本将逼真的运动添加到这些个性化图像中，在保持独特风格、高保真细节以及通过文本实现运动可控性方面提出了重大挑战。在本文中，我们提出了 PIA，一个个性化图像动画器，它擅长与条件图像对齐，通过文本实现运动可控性，并与各种个性化 T2I 模型兼容，而无需进行特定的微调。为了实现这些目标，PIA 在基础 T2I 模型的基础上构建了经过良好训练的时间对齐层，从而可以将任何个性化 T2I 模型无缝转换为图像动画模型。PIA 的一个关键组件是条件模块的引入，该模块利用条件帧和帧间亲和力作为输入，以传输外观信息，并通过亲和力提示引导潜在空间中各个帧的合成。这种设计减轻了外观相关的图像对齐的挑战，并允许更专注于与运动相关的引导对齐。

项目页面

可用 Pipelines

Pipeline	Tasks	Demo
PIAPipeline	使用 PIA 的图像到视频生成

可用 checkpoints

用于 PIA 的 Motion Adapter checkpoints 可以在 OpenMMLab org 下找到。这些 checkpoints 旨在与任何基于稳定扩散 1.5 的模型一起使用。

使用示例

PIA 与 MotionAdapter checkpoint 和稳定扩散 1.5 模型 checkpoint 一起工作。MotionAdapter 是一组 Motion Modules，负责在图像帧之间添加连贯的运动。这些模块在稳定扩散 UNet 中的 Resnet 和 Attention 块之后应用。除了运动模块外，PIA 还将 SD 1.5 UNet 模型的输入卷积层替换为 9 通道输入卷积层。

以下示例演示了如何使用 PIA 从单张图像生成视频。

import torch
from diffusers import (
    EulerDiscreteScheduler,
    MotionAdapter,
    PIAPipeline,
)
from diffusers.utils import export_to_gif, load_image

adapter = MotionAdapter.from_pretrained("openmmlab/PIA-condition-adapter")
pipe = PIAPipeline.from_pretrained("SG161222/Realistic_Vision_V6.0_B1_noVAE", motion_adapter=adapter, torch_dtype=torch.float16)

pipe.scheduler = EulerDiscreteScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()
pipe.enable_vae_slicing()

image = load_image(
    "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/pix2pix/cat_6.png?download=true"
)
image = image.resize((512, 512))
prompt = "cat in a field"
negative_prompt = "wrong white balance, dark, sketches,worst quality,low quality"

generator = torch.Generator("cpu").manual_seed(0)
output = pipe(image=image, prompt=prompt, generator=generator)
frames = output.frames[0]
export_to_gif(frames, "pia-animation.gif")

这里有一些示例输出

田野里的猫。

如果您计划使用可以裁剪样本的 scheduler，请确保在 scheduler 中设置 clip_sample=False 来禁用它，因为这也会对生成的样本产生不利影响。此外，PIA checkpoints 可能对 scheduler 的 beta 计划很敏感。我们建议将其设置为 linear。

使用 FreeInit

FreeInit：弥合视频扩散模型中的初始化差距，作者：Tianxing Wu、Chenyang Si、Yuming Jiang、Ziqi Huang、Ziwei Liu。

FreeInit 是一种有效的方法，它改进了使用视频扩散模型生成的视频的时间一致性和整体质量，而无需任何额外的训练。它可以无缝地应用于 PIA、AnimateDiff、ModelScope、VideoCrafter 和各种其他视频生成模型（在推理时），并通过迭代地优化潜在初始化噪声来工作。更多详细信息请参阅论文。

以下示例演示了 FreeInit 的用法。

import torch
from diffusers import (
    DDIMScheduler,
    MotionAdapter,
    PIAPipeline,
)
from diffusers.utils import export_to_gif, load_image

adapter = MotionAdapter.from_pretrained("openmmlab/PIA-condition-adapter")
pipe = PIAPipeline.from_pretrained("SG161222/Realistic_Vision_V6.0_B1_noVAE", motion_adapter=adapter)

# enable FreeInit
# Refer to the enable_free_init documentation for a full list of configurable parameters
pipe.enable_free_init(method="butterworth", use_fast_sampling=True)

# Memory saving options
pipe.enable_model_cpu_offload()
pipe.enable_vae_slicing()

pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
image = load_image(
    "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/pix2pix/cat_6.png?download=true"
)
image = image.resize((512, 512))
prompt = "cat in a field"
negative_prompt = "wrong white balance, dark, sketches,worst quality,low quality"

generator = torch.Generator("cpu").manual_seed(0)

output = pipe(image=image, prompt=prompt, generator=generator)
frames = output.frames[0]
export_to_gif(frames, "pia-freeinit-animation.gif")

田野里的猫。

FreeInit 并非真正“免费” - 质量的提升是以额外的计算成本为代价的。它需要多次额外采样，具体取决于启用它时设置的 num_iters 参数。将 use_fast_sampling 参数设置为 True 可以提高整体性能（代价是质量比 use_fast_sampling=False 时略低，但结果仍然优于原始视频生成模型）。

PIAPipeline

class diffusers.PIAPipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer unet: typing.Union[diffusers.models.unets.unet_2d_condition.UNet2DConditionModel, diffusers.models.unets.unet_motion_model.UNetMotionModel] scheduler: typing.Union[diffusers.schedulers.scheduling_ddim.DDIMScheduler, diffusers.schedulers.scheduling_pndm.PNDMScheduler, diffusers.schedulers.scheduling_lms_discrete.LMSDiscreteScheduler, diffusers.schedulers.scheduling_euler_discrete.EulerDiscreteScheduler, diffusers.schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteScheduler, diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler] motion_adapter: typing.Optional[diffusers.models.unets.unet_motion_model.MotionAdapter] = None feature_extractor: CLIPImageProcessor = None image_encoder: CLIPVisionModelWithProjection = None )

参数

vae (AutoencoderKL) — 变分自编码器 (VAE) 模型，用于将图像编码和解码为潜在表示。
text_encoder (CLIPTextModel) — 冻结的文本编码器 (clip-vit-large-patch14)。
tokenizer (CLIPTokenizer) — 用于标记文本的 CLIPTokenizer。
unet (UNet2DConditionModel) — 用于创建 UNetMotionModel 以去噪编码视频潜在空间的 UNet2DConditionModel。
motion_adapter (MotionAdapter) — 一个 MotionAdapter，与 unet 结合使用，以去噪编码视频潜在空间。
scheduler (SchedulerMixin) — 调度器，与 unet 结合使用，以去噪编码图像潜在空间。可以是 DDIMScheduler、LMSDiscreteScheduler 或 PNDMScheduler 之一。

用于文本到视频生成的 Pipeline。

此模型继承自 DiffusionPipeline。查看超类文档，了解为所有 pipeline 实现的通用方法（下载、保存、在特定设备上运行等）。

该 pipeline 还继承了以下加载方法

load_textual_inversion() 用于加载文本反演嵌入
load_lora_weights() 用于加载 LoRA 权重
save_lora_weights() 用于保存 LoRA 权重
load_ip_adapter() 用于加载 IP 适配器

call

< source >

( image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] prompt: typing.Union[str, typing.List[str]] = None strength: float = 1.0 num_frames: typing.Optional[int] = 16 height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 guidance_scale: float = 7.5 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_videos_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None motion_scale: int = 0 output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] ) → PIAPipelineOutput 或 tuple

参数

image (PipelineImageInput) — 用于视频生成的输入图像。
prompt (str 或 List[str], 可选) — 用于引导图像生成的提示或提示列表。如果未定义，则需要传递 prompt_embeds。
strength (float, 可选, 默认为 1.0) — 指示转换参考 image 的程度。必须介于 0 和 1 之间。
height (int, 可选, 默认为 self.unet.config.sample_size * self.vae_scale_factor) — 生成视频的高度（像素）。
width (int, 可选, 默认为 self.unet.config.sample_size * self.vae_scale_factor) — 生成视频的宽度（像素）。
num_frames (int, 可选, 默认为 16) — 生成的视频帧数。默认为 16 帧，以每秒 8 帧的速度计算，相当于 2 秒的视频。
num_inference_steps (int, 可选, 默认为 50) — 去噪步骤的数量。更多的去噪步骤通常会生成更高质量的视频，但会以较慢的推理速度为代价。
guidance_scale (float, 可选, 默认为 7.5) — 更高的 guidance scale 值会鼓励模型生成与文本 prompt 紧密相关的图像，但会以降低图像质量为代价。当 guidance_scale > 1 时，guidance scale 启用。
negative_prompt (str 或 List[str], 可选) — 用于引导图像生成中不应包含的内容的提示或提示列表。如果未定义，则需要传递 negative_prompt_embeds 代替。当不使用 guidance 时（guidance_scale < 1），将被忽略。
eta (float, 可选, 默认为 0.0) — 对应于 DDIM 论文中的参数 eta (η)。仅适用于 DDIMScheduler，在其他调度器中将被忽略。
generator (torch.Generator 或 List[torch.Generator], 可选) — 用于使生成具有确定性的 torch.Generator。
latents (torch.Tensor, 可选) — 从高斯分布中采样的预生成噪声潜在空间，用作视频生成的输入。可用于使用不同的提示调整相同的生成。如果未提供，则使用提供的随机 generator 采样生成潜在张量。潜在空间应具有形状 (batch_size, num_channel, num_frames, height, width)。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入（提示权重）。如果未提供，则从 prompt 输入参数生成文本嵌入。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负面文本嵌入。可用于轻松调整文本输入（提示权重）。如果未提供，则从 negative_prompt 输入参数生成 negative_prompt_embeds。
ip_adapter_image — (PipelineImageInput, 可选): 与 IP 适配器一起使用的可选图像输入。
ip_adapter_image_embeds (List[torch.Tensor], 可选) — IP-Adapter 的预生成图像嵌入。它应该是一个列表，长度与 IP-Adapter 的数量相同。每个元素都应该是一个形状为 (batch_size, num_images, emb_dim) 的张量。如果 do_classifier_free_guidance 设置为 True，则应包含负图像嵌入。如果未提供，则嵌入将从 ip_adapter_image 输入参数计算得出。
motion_scale — (int, 可选, 默认为 0): 控制添加到图像的运动量和类型的参数。增加该值会增加运动量，而特定的值范围控制添加的运动类型。必须介于 0 和 8 之间。设置为 0-2 仅增加运动量。设置为 3-5 创建循环运动。设置为 6-8 执行带有图像风格迁移的运动。
output_type (str, 可选, 默认为 "pil") — 生成视频的输出格式。在 torch.Tensor、PIL.Image 或 np.array 之间选择。
return_dict (bool, 可选, 默认为 True) — 是否返回 TextToVideoSDPipelineOutput 而不是纯元组。
cross_attention_kwargs (dict, 可选) — 一个 kwargs 字典，如果指定，则会传递给 self.processor 中定义的 AttentionProcessor。
clip_skip (int, 可选) — 从 CLIP 跳过的层数，用于计算提示嵌入。值为 1 表示预最终层的输出将用于计算提示嵌入。
callback_on_step_end (Callable, 可选) — 在推理期间的每个去噪步骤结束时调用的函数。该函数使用以下参数调用： callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。 callback_kwargs 将包含 callback_on_step_end_tensor_inputs 指定的所有张量列表。
callback_on_step_end_tensor_inputs (List, 可选) — callback_on_step_end 函数的张量输入列表。列表中指定的张量将作为 callback_kwargs 参数传递。您将只能包含管道类的 ._callback_tensor_inputs 属性中列出的变量。

PIAPipelineOutput 或 tuple

如果 return_dict 为 True，则返回 PIAPipelineOutput，否则返回 tuple，其中第一个元素是包含生成帧的列表。

调用管道的调用函数以进行生成。

示例

>>> import torch
>>> from diffusers import EulerDiscreteScheduler, MotionAdapter, PIAPipeline
>>> from diffusers.utils import export_to_gif, load_image

>>> adapter = MotionAdapter.from_pretrained("openmmlab/PIA-condition-adapter")
>>> pipe = PIAPipeline.from_pretrained(
...     "SG161222/Realistic_Vision_V6.0_B1_noVAE", motion_adapter=adapter, torch_dtype=torch.float16
... )

>>> pipe.scheduler = EulerDiscreteScheduler.from_config(pipe.scheduler.config)
>>> image = load_image(
...     "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/pix2pix/cat_6.png?download=true"
... )
>>> image = image.resize((512, 512))
>>> prompt = "cat in a hat"
>>> negative_prompt = "wrong white balance, dark, sketches, worst quality, low quality, deformed, distorted"
>>> generator = torch.Generator("cpu").manual_seed(0)
>>> output = pipe(image=image, prompt=prompt, negative_prompt=negative_prompt, generator=generator)
>>> frames = output.frames[0]
>>> export_to_gif(frames, "pia-animation.gif")

encode_prompt

< source >

( prompt device num_images_per_prompt do_classifier_free_guidance negative_prompt = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None clip_skip: typing.Optional[int] = None )

参数

prompt (str 或 List[str], 可选) — 要编码的提示
device — (torch.device): torch 设备
num_images_per_prompt (int) — 每个提示应生成的图像数量
do_classifier_free_guidance (bool) — 是否使用无分类器引导
negative_prompt (str 或 List[str], 可选) — 不引导图像生成的提示或提示。如果未定义，则必须传递 negative_prompt_embeds 代替。当不使用引导时（即，如果 guidance_scale 小于 1），则忽略。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，将从 prompt 输入参数生成文本嵌入。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，将从 negative_prompt 输入参数生成 negative_prompt_embeds。
lora_scale (float, 可选) — 如果加载了 LoRA 层，则将应用于文本编码器的所有 LoRA 层的 LoRA 比例。
clip_skip (int, 可选) — 从 CLIP 跳过的层数，用于计算提示嵌入。值为 1 表示预最终层的输出将用于计算提示嵌入。

将提示编码为文本编码器隐藏状态。

enable_freeu
disable_freeu
enable_free_init
disable_free_init
enable_vae_slicing
disable_vae_slicing
enable_vae_tiling
disable_vae_tiling

PIAPipelineOutput

class diffusers.pipelines.pia.PIAPipelineOutput

< source >

( frames: typing.Union[torch.Tensor, numpy.ndarray, typing.List[typing.List[PIL.Image.Image]]] )

参数

frames (torch.Tensor, np.ndarray, 或 List[List[PIL.Image.Image]]) — 嵌套列表，长度为 batch_size，包含长度为 num_frames 的去噪 PIL 图像序列，形状为 (batch_size, num_frames, channels, height, width 的 NumPy 数组，形状为 (batch_size, num_frames, channels, height, width) 的 Torch 张量。

PIAPipeline 的输出类。

< > 在 GitHub 上更新

←Paint by Example PixArt-α→

Diffusers

使用 PIA (个性化图像动画器) 的图像到视频生成

概述

可用 Pipelines

可用 checkpoints

使用示例

使用 FreeInit

PIAPipeline

class diffusers.PIAPipeline

__call__

encode_prompt

PIAPipelineOutput

class diffusers.pipelines.pia.PIAPipelineOutput

call