Diffusers 文档

EasyAnimate

Diffusers

加入 Hugging Face 社区

并获得增强的文档体验

在模型、数据集和 Spaces 上进行协作

通过加速推理获得更快的示例

切换文档主题

开始使用

EasyAnimate

由阿里巴巴 PAI 开发的 EasyAnimate。

其 GitHub 页面上的描述：EasyAnimate 是一个基于 Transformer 架构的管道，专为生成 AI 图像和视频，以及为 Diffusion Transformer 训练基线模型和 Lora 模型而设计。我们支持从预训练的 EasyAnimate 模型直接预测，能够生成各种分辨率的视频，长度约 6 秒，8fps（EasyAnimateV5.1，1 到 49 帧）。此外，用户还可以训练自己的基线和 Lora 模型以实现特定的风格转换。

此管道由 bubbliiiing 贡献。原始代码库可以在这里找到。原始权重可以在hf.co/alibaba-pai下找到。

有针对文本到视频和视频到视频的两个官方 EasyAnimate 检查点。

模型检查点	推荐的推理数据类型
`alibaba-pai/EasyAnimateV5.1-12b-zh`	torch.float16
`alibaba-pai/EasyAnimateV5.1-12b-zh-InP`	torch.float16

有一个官方 EasyAnimate 检查点可用于图像到视频和视频到视频。

模型检查点	推荐的推理数据类型
`alibaba-pai/EasyAnimateV5.1-12b-zh-InP`	torch.float16

有两个官方 EasyAnimate 检查点可用于控制到视频。

模型检查点	推荐的推理数据类型
`alibaba-pai/EasyAnimateV5.1-12b-zh-Control`	torch.float16
`alibaba-pai/EasyAnimateV5.1-12b-zh-Control-Camera`	torch.float16

对于 EasyAnimateV5.1 系列

文本到视频 (T2V) 和图像到视频 (I2V) 适用于多种分辨率。宽度和高度可以从 256 到 1024 不等。
T2V 和 I2V 模型均支持生成 1~49 帧，在此值下效果最佳。建议以 8 FPS 导出视频。

量化

量化有助于通过以较低精度数据类型存储模型权重来减少大型模型的内存需求。但是，量化对视频质量的影响可能因视频模型而异。

请参阅量化概览，了解有关支持的量化后端以及如何选择支持您用例的量化后端。下面的示例演示了如何使用 bitsandbytes 加载量化的 EasyAnimatePipeline 进行推理。

import torch
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, EasyAnimateTransformer3DModel, EasyAnimatePipeline
from diffusers.utils import export_to_video

quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
transformer_8bit = EasyAnimateTransformer3DModel.from_pretrained(
    "alibaba-pai/EasyAnimateV5.1-12b-zh",
    subfolder="transformer",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)

pipeline = EasyAnimatePipeline.from_pretrained(
    "alibaba-pai/EasyAnimateV5.1-12b-zh",
    transformer=transformer_8bit,
    torch_dtype=torch.float16,
    device_map="balanced",
)

prompt = "A cat walks on the grass, realistic style."
negative_prompt = "bad detailed"
video = pipeline(prompt=prompt, negative_prompt=negative_prompt, num_frames=49, num_inference_steps=30).frames[0]
export_to_video(video, "cat.mp4", fps=8)

EasyAnimatePipeline

class diffusers.EasyAnimatePipeline

< 源 >

( vae: AutoencoderKLMagvit text_encoder: typing.Union[transformers.models.qwen2_vl.modeling_qwen2_vl.Qwen2VLForConditionalGeneration, transformers.models.bert.modeling_bert.BertModel] tokenizer: typing.Union[transformers.models.qwen2.tokenization_qwen2.Qwen2Tokenizer, transformers.models.bert.tokenization_bert.BertTokenizer] transformer: EasyAnimateTransformer3DModel scheduler: FlowMatchEulerDiscreteScheduler )

参数

vae (AutoencoderKLMagvit) — 变分自动编码器 (VAE) 模型，用于将视频编码和解码为潜在表示。
text_encoder (Optional[~transformers.Qwen2VLForConditionalGeneration, ~transformers.BertModel]) — EasyAnimate 在 V5.1 中使用 qwen2 vl。
tokenizer (Optional[~transformers.Qwen2Tokenizer, ~transformers.BertTokenizer]) — 用于文本分词的 Qwen2Tokenizer 或 BertTokenizer。
transformer (EasyAnimateTransformer3DModel) — EasyAnimate 团队设计的 EasyAnimate 模型。
scheduler (FlowMatchEulerDiscreteScheduler) — 与 EasyAnimate 结合使用的调度器，用于对编码图像的潜在表示进行去噪。

用于使用 EasyAnimate 生成文本到视频的管道。

此模型继承自 DiffusionPipeline。请查阅超类文档，了解库为所有管道实现的通用方法（例如下载或保存、在特定设备上运行等）。

EasyAnimate 在 V5.1 中使用一个文本编码器 qwen2 vl。

call

< 源 >

( prompt: typing.Union[str, typing.List[str]] = None num_frames: typing.Optional[int] = 49 height: typing.Optional[int] = 512 width: typing.Optional[int] = 512 num_inference_steps: typing.Optional[int] = 50 guidance_scale: typing.Optional[float] = 5.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: typing.Optional[float] = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None timesteps: typing.Optional[typing.List[int]] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] guidance_rescale: float = 0.0 ) → StableDiffusionPipelineOutput 或 tuple

StableDiffusionPipelineOutput 或 tuple

如果 return_dict 为 True，则返回 StableDiffusionPipelineOutput，否则返回一个 tuple，其中第一个元素是生成的图像列表，第二个元素是指示相应生成的图像是否包含“不适合工作”（nsfw）内容的 bool 列表。

使用 EasyAnimate 管道根据提供的提示生成图像或视频。

示例

>>> import torch
>>> from diffusers import EasyAnimatePipeline
>>> from diffusers.utils import export_to_video

>>> # Models: "alibaba-pai/EasyAnimateV5.1-12b-zh"
>>> pipe = EasyAnimatePipeline.from_pretrained(
...     "alibaba-pai/EasyAnimateV5.1-7b-zh-diffusers", torch_dtype=torch.float16
... ).to("cuda")
>>> prompt = (
...     "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
...     "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
...     "pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
...     "casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
...     "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
...     "atmosphere of this unique musical performance."
... )
>>> sample_size = (512, 512)
>>> video = pipe(
...     prompt=prompt,
...     guidance_scale=6,
...     negative_prompt="bad detailed",
...     height=sample_size[0],
...     width=sample_size[1],
...     num_inference_steps=50,
... ).frames[0]
>>> export_to_video(video, "output.mp4", fps=8)

prompt (str 或 List[str], 可选)：用于指导图像或视频生成的文本提示。如果未提供，请改用 prompt_embeds。num_frames (int, 可选)：生成视频的长度（以帧为单位）。height (int, 可选)：生成图像的高度（像素）。width (int, 可选)：生成图像的宽度（像素）。num_inference_steps (int, 可选, 默认为 50)：生成期间的去噪步数。更多步骤通常会产生更高质量的图像，但会减慢推理速度。guidance_scale (float, 可选, 默认为 5.0)：鼓励模型将输出与提示对齐。较高的值可能会降低图像质量。negative_prompt (str 或 List[str], 可选)：指示生成中要排除的内容的提示。如果未指定，请使用 negative_prompt_embeds。num_images_per_prompt (int, 可选, 默认为 1)：为每个提示生成的图像数量。eta (float, 可选, 默认为 0.0)：适用于 DDIM 调度。由相关文献中的 eta 参数控制。generator (torch.Generator 或 List[torch.Generator], 可选)：用于确保图像生成可复现性的生成器。latents (torch.Tensor, 可选)：预定义的潜在张量，用于条件生成。prompt_embeds (torch.Tensor, 可选)：提示的文本嵌入。覆盖提示字符串输入，以提供更大的灵活性。negative_prompt_embeds (torch.Tensor, 可选)：负面提示的嵌入。如果已定义，则覆盖字符串输入。prompt_attention_mask (torch.Tensor, 可选)：主要提示嵌入的注意力掩码。negative_prompt_attention_mask (torch.Tensor, 可选)：负面提示嵌入的注意力掩码。output_type (str, 可选, 默认为“latent”)：生成输出的格式，可以是 PIL 图像或 NumPy 数组。return_dict (bool, 可选, 默认为 True)：如果为 True，则返回结构化输出。否则返回一个简单的元组。callback_on_step_end (Callable, 可选)：在每个去噪步骤结束时调用的函数。callback_on_step_end_tensor_inputs (List[str], 可选)：要包含在回调函数调用中的张量名称。guidance_rescale (float, 可选, 默认为 0.0)：根据引导比例调整噪声水平。original_size (Tuple[int, int], 可选, 默认为 (1024, 1024))：输出的原始尺寸。target_size (Tuple[int, int], 可选)：所需的输出尺寸，用于计算。crops_coords_top_left (Tuple[int, int], 可选, 默认为 (0, 0))：裁剪坐标。

encode_prompt

< 源 >

( prompt: typing.Union[str, typing.List[str]] num_images_per_prompt: int = 1 do_classifier_free_guidance: bool = True negative_prompt: typing.Union[str, typing.List[str], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None max_sequence_length: int = 256 )

参数

prompt (str 或 List[str], 可选) — 待编码的提示
device — (torch.device): torch 设备
dtype (torch.dtype) — torch 数据类型
num_images_per_prompt (int) — 每个提示应生成的图像数量
do_classifier_free_guidance (bool) — 是否使用分类器自由引导
negative_prompt (str 或 List[str], 可选) — 不用于指导图像生成的提示。如果未定义，则必须传入 negative_prompt_embeds。当不使用引导时（即，如果 guidance_scale 小于 1，则忽略）。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，将从 prompt 输入参数生成文本嵌入。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负面文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，将从 negative_prompt 输入参数生成负面提示嵌入。
prompt_attention_mask (torch.Tensor, 可选) — 提示的注意力掩码。当直接传入 prompt_embeds 时需要。
negative_prompt_attention_mask (torch.Tensor, 可选) — 负面提示的注意力掩码。当直接传入 negative_prompt_embeds 时需要。
max_sequence_length (int, 可选) — 用于提示的最大序列长度。

将提示编码为文本编码器隐藏状态。

EasyAnimatePipelineOutput

class diffusers.pipelines.easyanimate.pipeline_output.EasyAnimatePipelineOutput

< 来源 >

( frames: Tensor )

参数

frames (torch.Tensor, np.ndarray, 或 List[List[PIL.Image.Image]]) — 视频输出列表 - 可以是长度为 batch_size 的嵌套列表，其中每个子列表包含长度为 num_frames 的去噪 PIL 图像序列。它也可以是形状为 (batch_size, num_frames, channels, height, width) 的 NumPy 数组或 Torch 张量。

EasyAnimate 流水线的输出类。

< > 在 GitHub 上更新

←DiT Flux→

Diffusers

EasyAnimate

量化

EasyAnimatePipeline

class diffusers.EasyAnimatePipeline

__call__

encode_prompt

EasyAnimatePipelineOutput

class diffusers.pipelines.easyanimate.pipeline_output.EasyAnimatePipelineOutput

call