Diffusers 文档

Hunyuan-DiT

Diffusers

加入 Hugging Face 社区

并获得增强的文档体验

协作处理模型、数据集和 Spaces

通过加速推理获得更快的示例

切换文档主题

开始

Hunyuan-DiT

chinese elements understanding

Hunyuan-DiT：具有细粒度中文理解的强大多分辨率扩散 Transformer 来自腾讯混元。

论文摘要如下

我们提出了 Hunyuan-DiT，一个文本到图像的扩散 Transformer，它能够细粒度地理解英语和中文。为了构建 Hunyuan-DiT，我们仔细设计了 Transformer 结构、文本编码器和位置编码。我们还从零开始构建了一个完整的数据 pipeline，用于更新和评估数据，以进行迭代模型优化。为了实现细粒度的语言理解，我们训练了一个多模态大型语言模型来优化图像的标题。最后，Hunyuan-DiT 可以与用户进行多轮多模态对话，根据上下文生成和优化图像。通过我们全面的、由 50 多位专业人工评估员参与的人工评估协议，与其他开源模型相比，Hunyuan-DiT 在中文到图像生成方面树立了新的技术标杆。

您可以在以下位置找到原始代码库 Tencent/HunyuanDiT 以及所有可用的 checkpoints 在 Tencent-Hunyuan。

亮点：HunyuanDiT 支持中文/英文到图像的多分辨率生成。

HunyuanDiT 具有以下组件

它使用扩散 Transformer 作为骨干网络
它结合了两个文本编码器，一个双语 CLIP 和一个多语言 T5 编码器

请务必查看调度器指南，了解如何在调度器速度和质量之间进行权衡，并查看跨 pipelines 重用组件部分，了解如何有效地将相同组件加载到多个 pipelines 中。

您可以通过将 HungyuanDiTPipeline 生成的图像传递给 SDXL refiner 模型，进一步提高生成质量。

优化

您可以使用 torch.compile 和前馈分块来优化 pipeline 的运行时和内存消耗。要了解其他优化方法，请查看加速推理和减少内存使用指南。

推理

使用 torch.compile 来减少推理延迟。

首先，加载 pipeline

from diffusers import HunyuanDiTPipeline
import torch

pipeline = HunyuanDiTPipeline.from_pretrained(
	"Tencent-Hunyuan/HunyuanDiT-Diffusers", torch_dtype=torch.float16
).to("cuda")

然后将 pipelines 的 transformer 和 vae 组件的内存布局更改为 torch.channels-last

pipeline.transformer.to(memory_format=torch.channels_last)
pipeline.vae.to(memory_format=torch.channels_last)

最后，编译组件并运行推理

pipeline.transformer = torch.compile(pipeline.transformer, mode="max-autotune", fullgraph=True)
pipeline.vae.decode = torch.compile(pipeline.vae.decode, mode="max-autotune", fullgraph=True)

image = pipeline(prompt="一个宇航员在骑马").images[0]

在 80GB A100 机器上的 benchmark 结果是

With torch.compile(): Average inference time: 12.470 seconds.
Without torch.compile(): Average inference time: 20.570 seconds.

内存优化

通过以 8 位加载 T5 文本编码器，您可以在略低于 6 GB 的 GPU VRAM 中运行 pipeline。有关详细信息，请参阅此脚本。

此外，您可以使用 enable_forward_chunking() 方法来减少内存使用。前馈分块在循环中而不是一次性运行 transformer 块中的前馈层。这为您提供了内存消耗和推理运行时之间的权衡。

+ pipeline.transformer.enable_forward_chunking(chunk_size=1, dim=1)

HunyuanDiTPipeline

class diffusers.HunyuanDiTPipeline

< source >

( vae: AutoencoderKL text_encoder: BertModel tokenizer: BertTokenizer transformer: HunyuanDiT2DModel scheduler: DDPMScheduler safety_checker: StableDiffusionSafetyChecker feature_extractor: CLIPImageProcessor requires_safety_checker: bool = True text_encoder_2 = <class 'transformers.models.t5.modeling_t5.T5EncoderModel'> tokenizer_2 = <class 'transformers.models.mt5.tokenization_mt5.MT5Tokenizer'> )

参数

vae (AutoencoderKL) — 变分自动编码器 (VAE) 模型，用于将图像编码和解码为潜在表示。我们使用 sdxl-vae-fp16-fix。
text_encoder (Optional[~transformers.BertModel, ~transformers.CLIPTextModel]) — 冻结的文本编码器 (clip-vit-large-patch14)。 HunyuanDiT 使用微调的 [双语 CLIP]。
tokenizer (Optional[~transformers.BertTokenizer, ~transformers.CLIPTokenizer]) — BertTokenizer 或 CLIPTokenizer 以标记文本。
transformer (HunyuanDiT2DModel) — 腾讯混元设计的 HunyuanDiT 模型。
text_encoder_2 (T5EncoderModel) — mT5 嵌入器。具体来说，它是 “t5-v1_1-xxl”。
tokenizer_2 (MT5Tokenizer) — 用于 mT5 嵌入器的 tokenizer。
scheduler (DDPMScheduler) — 调度器，与 HunyuanDiT 结合使用，以对编码后的图像 latent 进行去噪。

用于英语/中文到图像生成的 Pipeline，使用 HunyuanDiT。

此模型继承自 DiffusionPipeline。查看超类文档，了解库为所有 pipelines 实现的通用方法（例如下载或保存、在特定设备上运行等）。

HunyuanDiT 使用两个文本编码器：mT5 和 [双语 CLIP]（我们自己微调）。

call

< source >

( prompt: typing.Union[str, typing.List[str]] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: typing.Optional[int] = 50 guidance_scale: typing.Optional[float] = 5.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: typing.Optional[float] = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None prompt_embeds_2: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds_2: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None prompt_attention_mask_2: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask_2: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] guidance_rescale: float = 0.0 original_size: typing.Optional[typing.Tuple[int, int]] = (1024, 1024) target_size: typing.Optional[typing.Tuple[int, int]] = None crops_coords_top_left: typing.Tuple[int, int] = (0, 0) use_resolution_binning: bool = True ) → StableDiffusionPipelineOutput 或 tuple

参数

prompt (str 或 List[str], 可选) — 用于引导图像生成的提示或提示列表。如果未定义，则需要传递 prompt_embeds。
height (int) — 生成图像的像素高度。
width (int) — 生成图像的像素宽度。
num_inference_steps (int, 可选, 默认为 50) — 去噪步骤的数量。更多去噪步骤通常会以较慢的推理速度为代价，从而产生更高质量的图像。此参数由 strength 调制。
guidance_scale (float, 可选, 默认为 7.5) — 较高的 guidance scale 值会鼓励模型生成与文本 prompt 紧密相关的图像，但会降低图像质量。当 guidance_scale > 1 时，guidance scale 启用。
negative_prompt (str 或 List[str], 可选) — 用于引导图像生成中不包含的内容的提示或提示列表。如果未定义，则需要改为传递 negative_prompt_embeds。当不使用 guidance（guidance_scale < 1）时忽略。
num_images_per_prompt (int, 可选, 默认为 1) — 每个 prompt 生成的图像数量。
eta (float, 可选, 默认为 0.0) — 对应于 DDIM 论文中的参数 eta (η)。仅适用于 DDIMScheduler，在其他调度器中被忽略。
generator (torch.Generator 或 List[torch.Generator], 可选) — 用于使生成具有确定性的 torch.Generator。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本 embeddings。可用于轻松调整文本输入（prompt weighting）。如果未提供，则文本 embeddings 从 prompt 输入参数生成。
prompt_embeds_2 (torch.Tensor, 可选) — 预生成的文本 embeddings。可用于轻松调整文本输入（prompt weighting）。如果未提供，则文本 embeddings 从 prompt 输入参数生成。
negative_prompt_embeds (torch.Tensor, optional) — 预生成的负面文本嵌入 (negative text embeddings)。可用于轻松调整文本输入（提示权重）。如果未提供，则 negative_prompt_embeds 会从 negative_prompt 输入参数生成。
negative_prompt_embeds_2 (torch.Tensor, optional) — 预生成的负面文本嵌入 (negative text embeddings)。可用于轻松调整文本输入（提示权重）。如果未提供，则 negative_prompt_embeds 会从 negative_prompt 输入参数生成。
prompt_attention_mask (torch.Tensor, optional) — 提示的注意力掩码。当直接传递 prompt_embeds 时是必需的。
prompt_attention_mask_2 (torch.Tensor, optional) — 提示的注意力掩码。当直接传递 prompt_embeds_2 时是必需的。
negative_prompt_attention_mask (torch.Tensor, optional) — 负面提示的注意力掩码。当直接传递 negative_prompt_embeds 时是必需的。
negative_prompt_attention_mask_2 (torch.Tensor, optional) — 负面提示的注意力掩码。当直接传递 negative_prompt_embeds_2 时是必需的。
output_type (str, optional, defaults to "pil") — 生成图像的输出格式。在 PIL.Image 或 np.array 之间选择。
return_dict (bool, optional, defaults to True) — 是否返回 StableDiffusionPipelineOutput 而不是普通元组。
callback_on_step_end (Callable[[int, int, Dict], None], PipelineCallback, MultiPipelineCallbacks, optional) — 在每个去噪步骤结束时调用的回调函数或回调函数列表。
callback_on_step_end_tensor_inputs (List[str], optional) — 应传递给回调函数的张量输入列表。如果未定义，则将传递所有张量输入。
guidance_rescale (float, optional, defaults to 0.0) — 根据 guidance_rescale 重新缩放 noise_cfg。基于 Common Diffusion Noise Schedules and Sample Steps are Flawed 的发现。请参阅第 3.4 节
original_size (Tuple[int, int], optional, defaults to (1024, 1024)) — 图像的原始尺寸。用于计算时间 ID。
target_size (Tuple[int, int], optional) — 图像的目标尺寸。用于计算时间 ID。
crops_coords_top_left (Tuple[int, int], optional, defaults to (0, 0)) — 裁剪区域的左上角坐标。用于计算时间 ID。
use_resolution_binning (bool, optional, defaults to True) — 是否使用分辨率分箱 (resolution binning)。如果为 True，则输入分辨率将映射到最接近的标准分辨率。支持的分辨率为 1024x1024、1280x1280、1024x768、1152x864、1280x960、768x1024、864x1152、960x1280、1280x768 和 768x1280。建议设置为 True。

返回值

StableDiffusionPipelineOutput 或 tuple

如果 return_dict 为 True，则返回 StableDiffusionPipelineOutput，否则返回一个 tuple，其中第一个元素是包含生成图像的列表，第二个元素是包含 bool 值的列表，指示相应的生成图像是否包含“不适合工作场所观看”（nsfw）内容。

调用函数以使用 HunyuanDiT 进行管道生成。

示例

>>> import torch
>>> from diffusers import HunyuanDiTPipeline

>>> pipe = HunyuanDiTPipeline.from_pretrained(
...     "Tencent-Hunyuan/HunyuanDiT-Diffusers", torch_dtype=torch.float16
... )
>>> pipe.to("cuda")

>>> # You may also use English prompt as HunyuanDiT supports both English and Chinese
>>> # prompt = "An astronaut riding a horse"
>>> prompt = "一个宇航员在骑马"
>>> image = pipe(prompt).images[0]

encode_prompt

< source >

( prompt: str device: device = None dtype: dtype = None num_images_per_prompt: int = 1 do_classifier_free_guidance: bool = True negative_prompt: typing.Optional[str] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None max_sequence_length: typing.Optional[int] = None text_encoder_index: int = 0 )

参数

prompt (str or List[str], optional) — 要编码的提示。
device — (torch.device): torch 设备
dtype (torch.dtype) — torch 数据类型
num_images_per_prompt (int) — 每个提示应生成的图像数量。
do_classifier_free_guidance (bool) — 是否使用无分类器引导 (classifier free guidance)。
negative_prompt (str or List[str], optional) — 不引导图像生成的提示或提示列表。如果未定义，则必须传递 negative_prompt_embeds。当不使用引导时忽略（即，如果 guidance_scale 小于 1，则忽略）。
prompt_embeds (torch.Tensor, optional) — 预生成的文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，则将从 prompt 输入参数生成文本嵌入。
negative_prompt_embeds (torch.Tensor, optional) — 预生成的负面文本嵌入。可用于轻松调整文本输入，例如提示权重。如果未提供，则将从 negative_prompt 输入参数生成 negative_prompt_embeds。
prompt_attention_mask (torch.Tensor, optional) — 提示的注意力掩码。当直接传递 prompt_embeds 时是必需的。
negative_prompt_attention_mask (torch.Tensor, optional) — 负面提示的注意力掩码。当直接传递 negative_prompt_embeds 时是必需的。
max_sequence_length (int, optional) — 用于提示的最大序列长度。
text_encoder_index (int, optional) — 要使用的文本编码器的索引。 0 代表 clip，1 代表 T5。

将 prompt 编码为文本编码器隐藏状态。

< > Update on GitHub

←FluxControlInpaint HunyuanVideo→

Diffusers

Hunyuan-DiT

优化

推理

内存优化

HunyuanDiTPipeline

class diffusers.HunyuanDiTPipeline

__call__

encode_prompt

call