Diffusers 文档

ConsisID

Diffusers

加入 Hugging Face 社区

并获得增强的文档体验

在模型、数据集和 Spaces 上进行协作

通过加速推理获得更快的示例

切换文档主题

开始使用

ConsisID

北京大学、罗切斯特大学等机构的 Yuan Shenghai、Huang Jinfa、He Xianyi、Ge Yunyang、Shi Yujun、Chen Liuhan、Luo Jiebo、Yuan Li 的论文：《Identity-Preserving Text-to-Video Generation by Frequency Decomposition》

论文摘要如下：

身份保持文本到视频 (IPT2V) 生成旨在创建具有一致人类身份的高保真视频。这是视频生成中的一个重要任务，但对于生成模型来说仍然是一个悬而未决的问题。本文从两个尚未解决的方向推动了 IPT2V 的技术前沿：(1) 无需繁琐逐案例微调的免调优管道；(2) 基于频率感知启发式身份保持扩散 Transformer (DiT) 的控制方案。为实现这些目标，我们提出了 **ConsisID**，一种免调优的基于 DiT 的可控 IPT2V 模型，用于在生成的视频中保持人类**身份**的**一致性**。受视觉/扩散 Transformer 频率分析中先前发现的启发，它在频域中采用了身份控制信号，其中面部特征可以分解为低频全局特征（例如，轮廓、比例）和高频内在特征（例如，不受姿态变化影响的身份标记）。首先，从低频视角，我们引入了一个全局面部提取器，它将参考图像和面部关键点编码到潜在空间中，生成富含低频信息的特征。这些特征随后被集成到网络的浅层以缓解与 DiT 相关的训练挑战。其次，从高频视角，我们设计了一个局部面部提取器来捕获高频细节并将其注入到 Transformer 块中，增强模型保留细粒度特征的能力。为了利用频率信息进行身份保持，我们提出了一种分层训练策略，将香草预训练视频生成模型转换为 IPT2V 模型。广泛的实验表明，我们的频率感知启发式方案为基于 DiT 的模型提供了最优控制解决方案。得益于该方案，我们的 **ConsisID** 在生成高质量、身份保持视频方面取得了优异的成果，向更有效的 IPT2V 迈出了坚实的一步。ConsisID 模型权重在 https://github.com/PKU-YuanGroup/ConsisID 公开可用。

请务必查看调度器指南，了解如何在调度器速度和质量之间进行权衡，并查看跨管道重用组件部分，了解如何高效地将相同组件加载到多个管道中。

此管道由 SHYuanBest 贡献。原始代码库可以在这里找到。原始权重可以在 hf.co/BestWishYsh 下找到。

有 Identity-Preserving Text-to-Video 的两个官方 ConsisID 检查点。

模型检查点	建议的推理数据类型
`BestWishYsh/ConsisID-preview`	torch.bfloat16
`BestWishYsh/ConsisID-1.5`	torch.bfloat16

内存优化

ConsisID 需要大约 44 GB 的 GPU 内存来解码 49 帧（720x480 (W x H) 输出分辨率，8 FPS 的视频为 6 秒），这使得它无法在消费级 GPU 或免费层 T4 Colab 上运行。可以使用以下内存优化来减少内存占用。如需复现，您可以参考此脚本。

功能（覆盖前一个）	最大分配内存	最大保留内存
-	37 GB	44 GB
启用模型 CPU 卸载	22 GB	25 GB
启用顺序 CPU 卸载	16 GB	22 GB
vae.enable_slicing	16 GB	22 GB
vae.enable_tiling	5 GB	7 GB

ConsisIDPipeline

class diffusers.ConsisIDPipeline

< 来源 >

( tokenizer: T5Tokenizer text_encoder: T5EncoderModel vae: AutoencoderKLCogVideoX transformer: ConsisIDTransformer3DModel scheduler: CogVideoXDPMScheduler )

参数

vae (AutoencoderKL) — 变分自编码器（VAE）模型，用于将视频编码和解码为潜在表示。
text_encoder (T5EncoderModel) — 冻结文本编码器。ConsisID 使用 T5；具体是 t5-v1_1-xxl 变体。
tokenizer (T5Tokenizer) — T5Tokenizer 类的分词器。
transformer (ConsisIDTransformer3DModel) — 一个文本条件化的 ConsisIDTransformer3DModel，用于对编码后的视频潜在表示进行去噪。
scheduler (SchedulerMixin) — 与 transformer 结合使用的调度器，用于对编码后的视频潜在表示进行去噪。

用于使用 ConsisID 进行图像到视频生成的管道。

此模型继承自 DiffusionPipeline。有关库为所有管道实现的通用方法（例如下载或保存、在特定设备上运行等）请查看超类文档。

call

< 来源 >

( image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None height: int = 480 width: int = 720 num_frames: int = 49 num_inference_steps: int = 50 guidance_scale: float = 6.0 use_dynamic_cfg: bool = False num_videos_per_prompt: int = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.FloatTensor] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None output_type: str = 'pil' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 226 id_vit_hidden: typing.Optional[torch.Tensor] = None id_cond: typing.Optional[torch.Tensor] = None kps_cond: typing.Optional[torch.Tensor] = None ) → ConsisIDPipelineOutput 或 tuple

参数

image (PipelineImageInput) — 用于条件生成输入的图像。必须是图像、图像列表或 torch.Tensor。
prompt (str 或 List[str], 可选) — 指导图像生成的提示词。如果未定义，则必须传递 prompt_embeds。
negative_prompt (str 或 List[str], 可选) — 不用于指导图像生成的提示词。如果未定义，则必须传递 negative_prompt_embeds。当不使用引导时（即，如果 guidance_scale 小于 1，则忽略）。
height (int, 可选, 默认为 self.transformer.config.sample_height * self.vae_scale_factor_spatial) — 生成图像的高度（像素）。为获得最佳效果，默认设置为 480。
width (int, 可选, 默认为 self.transformer.config.sample_height * self.vae_scale_factor_spatial) — 生成图像的宽度（像素）。为获得最佳效果，默认设置为 720。
num_frames (int, 默认为 49) — 要生成的帧数。必须可被 self.vae_scale_factor_temporal 整除。生成的视频将包含 1 个额外帧，因为 ConsisID 以 (num_seconds * fps + 1) 帧为条件，其中 num_seconds 为 6，fps 为 4。然而，由于视频可以以任何 fps 保存，唯一需要满足的条件是上述可整除性。
num_inference_steps (int, 可选, 默认为 50) — 去噪步数。更多的去噪步数通常会导致更高质量的图像，但推理速度会变慢。
guidance_scale (float, 可选, 默认为 6) — Classifier-Free Diffusion Guidance 中定义的引导比例。guidance_scale 定义为 Imagen Paper 中公式 2 的 w。通过设置 guidance_scale > 1 启用引导比例。更高的引导比例会促使生成与文本 prompt 紧密相关的图像，通常以牺牲较低图像质量为代价。
use_dynamic_cfg (bool, 可选, 默认为 False) — 如果为 True，则在推理期间动态调整引导比例。这允许模型使用渐进式引导比例，在推理步骤中平衡文本引导生成和图像质量。通常，早期推理步骤使用更高的引导比例以获得更忠实的图像生成，而后期步骤则降低它以获得更多样化和自然的结果。
num_videos_per_prompt (int, 可选, 默认为 1) — 每个提示词生成的视频数量。
generator (torch.Generator 或 List[torch.Generator], 可选) — 一个或多个 torch generator(s) 以使生成具有确定性。
latents (torch.FloatTensor, 可选) — 预先生成的噪声潜在变量，从高斯分布中采样，用作图像生成的输入。可用于使用不同的提示词调整相同的生成。如果未提供，将使用提供的随机 generator 采样生成潜在张量。
prompt_embeds (torch.FloatTensor, 可选) — 预先生成的文本嵌入。可用于轻松调整文本输入，例如提示词加权。如果未提供，文本嵌入将从 prompt 输入参数生成。
negative_prompt_embeds (torch.FloatTensor, 可选) — 预先生成的负面文本嵌入。可用于轻松调整文本输入，例如提示词加权。如果未提供，负面提示词嵌入将从 negative_prompt 输入参数生成。
output_type (str, 可选, 默认为 "pil") — 生成图像的输出格式。选择 PIL: PIL.Image.Image 或 np.array。
return_dict (bool, 可选, 默认为 True) — 是否返回 ~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput 而不是普通元组。
attention_kwargs (dict, 可选) — 一个 kwargs 字典，如果指定，将作为 self.processor 中定义的 diffusers.models.attention_processor 传递给 AttentionProcessor。
callback_on_step_end (Callable, 可选) — 在推理期间每个去噪步骤结束时调用的函数。该函数将使用以下参数调用：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。callback_kwargs 将包含 callback_on_step_end_tensor_inputs 指定的所有张量列表。
callback_on_step_end_tensor_inputs (List, 可选) — callback_on_step_end 函数的张量输入列表。列表中指定的张量将作为 callback_kwargs 参数传递。您只能包含管道类 ._callback_tensor_inputs 属性中列出的变量。
max_sequence_length (int, 默认为 226) — 编码提示中的最大序列长度。必须与 self.transformer.config.max_text_seq_length 保持一致，否则可能导致结果不佳。
id_vit_hidden (Optional[torch.Tensor], 可选) — 表示从人脸模型中提取的隐藏特征张量，用于调节局部人脸提取器。这对于模型获取人脸高频信息至关重要。如果未提供，局部人脸提取器将无法正常运行。
id_cond (Optional[torch.Tensor], 可选) — 表示从 clip 模型中提取的隐藏特征张量，用于调节局部人脸提取器。这对于模型编辑人脸特征至关重要。如果未提供，局部人脸提取器将无法正常运行。
kps_cond (Optional[torch.Tensor], 可选) — 一个张量，用于确定全局人脸提取器是否使用关键点信息进行条件化。如果提供，此张量将控制在生成过程中是否使用眼睛、鼻子和嘴巴等地标等面部关键点。这有助于确保模型保留更多面部低频信息。

ConsisIDPipelineOutput 或 tuple

如果 return_dict 为 True，则返回 ConsisIDPipelineOutput，否则返回 tuple。返回元组时，第一个元素是包含生成图像的列表。

调用管道进行生成时调用的函数。

示例

>>> import torch
>>> from diffusers import ConsisIDPipeline
>>> from diffusers.pipelines.consisid.consisid_utils import prepare_face_models, process_face_embeddings_infer
>>> from diffusers.utils import export_to_video
>>> from huggingface_hub import snapshot_download

>>> snapshot_download(repo_id="BestWishYsh/ConsisID-preview", local_dir="BestWishYsh/ConsisID-preview")
>>> (
...     face_helper_1,
...     face_helper_2,
...     face_clip_model,
...     face_main_model,
...     eva_transform_mean,
...     eva_transform_std,
... ) = prepare_face_models("BestWishYsh/ConsisID-preview", device="cuda", dtype=torch.bfloat16)
>>> pipe = ConsisIDPipeline.from_pretrained("BestWishYsh/ConsisID-preview", torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")

>>> # ConsisID works well with long and well-described prompts. Make sure the face in the image is clearly visible (e.g., preferably half-body or full-body).
>>> prompt = "The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured, timeless quality to the image, evoking a sense of nostalgia. Around him, the cityscape is filled with vintage buildings, cobblestone sidewalks, and softly blurred figures passing by, their outlines faint and indistinct. Streetlights cast a gentle glow, while shadows play across the boy's path, adding depth to the scene. The lighting highlights the boy's subtle smile, hinting at a fleeting moment of curiosity. The overall cinematic atmosphere, complete with classic film still aesthetics and dramatic contrasts, gives the scene an evocative and introspective feel."
>>> image = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_input.png?download=true"

>>> id_cond, id_vit_hidden, image, face_kps = process_face_embeddings_infer(
...     face_helper_1,
...     face_clip_model,
...     face_helper_2,
...     eva_transform_mean,
...     eva_transform_std,
...     face_main_model,
...     "cuda",
...     torch.bfloat16,
...     image,
...     is_align_face=True,
... )

>>> video = pipe(
...     image=image,
...     prompt=prompt,
...     num_inference_steps=50,
...     guidance_scale=6.0,
...     use_dynamic_cfg=False,
...     id_vit_hidden=id_vit_hidden,
...     id_cond=id_cond,
...     kps_cond=face_kps,
...     generator=torch.Generator("cuda").manual_seed(42),
... )
>>> export_to_video(video.frames[0], "output.mp4", fps=8)

encode_prompt

< 来源 >

( prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None do_classifier_free_guidance: bool = True num_videos_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None max_sequence_length: int = 226 device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None )

参数

prompt (str 或 List[str], 可选) — 待编码的提示词
negative_prompt (str 或 List[str], 可选) — 不用于引导图像生成的提示词。如果未定义，则必须传入 negative_prompt_embeds。在使用非引导模式时（即 guidance_scale 小于 1 时），此参数将被忽略。
do_classifier_free_guidance (bool, 可选, 默认为 True) — 是否使用分类器自由引导。
num_videos_per_prompt (int, 可选, 默认为 1) — 每个提示词应生成的视频数量。生成结果嵌入的 torch 设备。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入，例如提示词加权。如果未提供，将从 prompt 输入参数生成文本嵌入。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负面文本嵌入。可用于轻松调整文本输入，例如提示词加权。如果未提供，将从 negative_prompt 输入参数生成 negative_prompt_embeds。
device — (torch.device, 可选): torch 设备
dtype — (torch.dtype, 可选): torch 数据类型

将提示编码为文本编码器隐藏状态。

ConsisIDPipelineOutput

class diffusers.pipelines.consisid.pipeline_output.ConsisIDPipelineOutput

< 来源 >

( frames: Tensor )

参数

frames (torch.Tensor, np.ndarray, 或 List[List[PIL.Image.Image]]) — 视频输出列表 - 可以是长度为 batch_size 的嵌套列表，每个子列表包含长度为 num_frames 的去噪 PIL 图像序列。它也可以是形状为 (batch_size, num_frames, channels, height, width) 的 NumPy 数组或 Torch 张量。

ConsisID 流水线的输出类。

< > 在 GitHub 上更新

←CogView4 Consistency Models→

Diffusers

ConsisID

内存优化

ConsisIDPipeline

class diffusers.ConsisIDPipeline

__call__

encode_prompt

ConsisIDPipelineOutput

class diffusers.pipelines.consisid.pipeline_output.ConsisIDPipelineOutput

call