Diffusers 文档

UniDiffuser

Diffusers

加入 Hugging Face 社区

并获得增强的文档体验

在模型、数据集和 Spaces 上协作

通过加速推理获得更快的示例

在文档主题之间切换

开始使用

UniDiffuser

UniDiffuser 模型在 One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale 中提出，作者是 Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, Jun Zhu。

该论文的摘要是

本文提出了一个统一的扩散框架（称为 UniDiffuser），以在一个模型中拟合与一组多模态数据相关的所有分布。我们的关键见解是——学习边缘分布、条件分布和联合分布的扩散模型可以统一为预测扰动数据中的噪声，其中不同模态的扰动水平（即时间步长）可能不同。受统一视角的启发，UniDiffuser 通过对原始扩散模型进行最小的修改来同时学习所有分布——扰动所有模态的数据而不是单一模态，输入不同模态的单独时间步长，并预测所有模态的噪声而不是单一模态。UniDiffuser 由用于扩散模型的 Transformer 参数化，以处理不同模态的输入类型。UniDiffuser 在大规模配对图像-文本数据上实现，能够通过设置适当的时间步长来执行图像、文本、文本到图像、图像到文本和图像-文本对生成，而无需额外的开销。特别是，UniDiffuser 能够在所有任务中生成感知上逼真的样本，并且其定量结果（例如，FID 和 CLIP 分数）不仅优于现有的通用模型，而且在代表性任务（例如，文本到图像生成）中也与定制模型（例如，Stable Diffusion 和 DALL-E 2）相当。

您可以在 thu-ml/unidiffuser 找到原始代码库，并在 thu-ml 找到其他检查点。

PyTorch 1.X 目前存在一个问题，即输出图像全部为黑色或像素值变为 NaNs。通过切换到 PyTorch 2.X 可以缓解此问题。

此 pipeline 由 dg845 贡献。 ❤️

使用示例

由于 UniDiffuser 模型经过训练以对（图像、文本）对的联合分布进行建模，因此它能够执行各种生成任务

无条件图像和文本生成

来自 UniDiffuserPipeline 的无条件生成（我们仅从标准高斯先验采样的潜在变量开始）将产生一个（图像，文本）对

import torch

from diffusers import UniDiffuserPipeline

device = "cuda"
model_id_or_path = "thu-ml/unidiffuser-v1"
pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
pipe.to(device)

# Unconditional image and text generation. The generation task is automatically inferred.
sample = pipe(num_inference_steps=20, guidance_scale=8.0)
image = sample.images[0]
text = sample.text[0]
image.save("unidiffuser_joint_sample_image.png")
print(text)

这在 UniDiffuser 论文中也称为“联合”生成，因为我们是从联合图像-文本分布中采样的。

请注意，生成任务是从调用 pipeline 时使用的输入推断出来的。也可以使用 UniDiffuserPipeline.set_joint_mode() 手动指定无条件生成任务（“模式”）

# Equivalent to the above.
pipe.set_joint_mode()
sample = pipe(num_inference_steps=20, guidance_scale=8.0)

当手动设置模式时，后续对 pipeline 的调用将使用设置的模式，而不会尝试推断模式。您可以使用 UniDiffuserPipeline.reset_mode() 重置模式，之后 pipeline 将再次推断模式。

您还可以仅生成图像或仅生成文本（UniDiffuser 论文将其称为“边缘”生成，因为我们分别从图像和文本的边缘分布中采样）

# Unlike other generation tasks, image-only and text-only generation don't use classifier-free guidance
# Image-only generation
pipe.set_image_mode()
sample_image = pipe(num_inference_steps=20).images[0]
# Text-only generation
pipe.set_text_mode()
sample_text = pipe(num_inference_steps=20).text[0]

文本到图像生成

UniDiffuser 也能够从条件分布中采样；也就是说，以文本提示为条件的图像分布或以图像为条件的文本分布。这是一个从条件图像分布中采样的示例（文本到图像生成或文本条件图像生成）

import torch

from diffusers import UniDiffuserPipeline

device = "cuda"
model_id_or_path = "thu-ml/unidiffuser-v1"
pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
pipe.to(device)

# Text-to-image generation
prompt = "an elephant under the sea"

sample = pipe(prompt=prompt, num_inference_steps=20, guidance_scale=8.0)
t2i_image = sample.images[0]
t2i_image

text2img 模式需要提供输入 prompt 或 prompt_embeds。您可以使用 UniDiffuserPipeline.set_text_to_image_mode() 手动设置 text2img 模式。

图像到文本生成

同样地，UniDiffuser 也可以根据给定的图像生成文本样本（图像到文本或图像条件文本生成）

import torch

from diffusers import UniDiffuserPipeline
from diffusers.utils import load_image

device = "cuda"
model_id_or_path = "thu-ml/unidiffuser-v1"
pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
pipe.to(device)

# Image-to-text generation
image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg"
init_image = load_image(image_url).resize((512, 512))

sample = pipe(image=init_image, num_inference_steps=20, guidance_scale=8.0)
i2t_text = sample.text[0]
print(i2t_text)

img2text 模式需要提供输入 image。您可以使用 UniDiffuserPipeline.set_image_to_text_mode() 手动设置 img2text 模式。

图像变异

UniDiffuser 的作者建议通过“往返”生成方法执行图像变异，即给定输入图像，我们首先执行图像到文本的生成，然后在第一次生成的输出上执行文本到图像的生成。这会生成一张与输入图像语义相似的新图像

import torch

from diffusers import UniDiffuserPipeline
from diffusers.utils import load_image

device = "cuda"
model_id_or_path = "thu-ml/unidiffuser-v1"
pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
pipe.to(device)

# Image variation can be performed with an image-to-text generation followed by a text-to-image generation:
# 1. Image-to-text generation
image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg"
init_image = load_image(image_url).resize((512, 512))

sample = pipe(image=init_image, num_inference_steps=20, guidance_scale=8.0)
i2t_text = sample.text[0]
print(i2t_text)

# 2. Text-to-image generation
sample = pipe(prompt=i2t_text, num_inference_steps=20, guidance_scale=8.0)
final_image = sample.images[0]
final_image.save("unidiffuser_image_variation_sample.png")

文本变异

同样地，文本变异可以在输入提示词上执行，方法是先进行文本到图像的生成，然后再进行图像到文本的生成

import torch

from diffusers import UniDiffuserPipeline

device = "cuda"
model_id_or_path = "thu-ml/unidiffuser-v1"
pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
pipe.to(device)

# Text variation can be performed with a text-to-image generation followed by a image-to-text generation:
# 1. Text-to-image generation
prompt = "an elephant under the sea"

sample = pipe(prompt=prompt, num_inference_steps=20, guidance_scale=8.0)
t2i_image = sample.images[0]
t2i_image.save("unidiffuser_text2img_sample_image.png")

# 2. Image-to-text generation
sample = pipe(image=t2i_image, num_inference_steps=20, guidance_scale=8.0)
final_prompt = sample.text[0]
print(final_prompt)

请务必查看调度器指南，了解如何探索调度器速度和质量之间的权衡，并查看跨管道重用组件部分，了解如何有效地将相同组件加载到多个管道中。

UniDiffuserPipeline

class diffusers.UniDiffuserPipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel image_encoder: CLIPVisionModelWithProjection clip_image_processor: CLIPImageProcessor clip_tokenizer: CLIPTokenizer text_decoder: UniDiffuserTextDecoder text_tokenizer: GPT2Tokenizer unet: UniDiffuserModel scheduler: KarrasDiffusionSchedulers )

参数

vae (AutoencoderKL) — 变分自动编码器 (VAE) 模型，用于将图像编码和解码为潜在表示形式以及从潜在表示形式解码。这是 UniDiffuser 图像表示的一部分，与 CLIP 视觉编码一起使用。
text_encoder (CLIPTextModel) — 冻结的文本编码器 (clip-vit-large-patch14)。
image_encoder (CLIPVisionModel) — CLIPVisionModel，用于将图像编码为其图像表示的一部分，与 VAE 潜在表示形式一起使用。
image_processor (CLIPImageProcessor) — CLIPImageProcessor，用于在通过 image_encoder 进行 CLIP 编码之前预处理图像。
clip_tokenizer (CLIPTokenizer) — CLIPTokenizer，用于在通过 text_encoder 进行编码之前对提示词进行标记化。
text_decoder (UniDiffuserTextDecoder) — 冻结的文本解码器。这是一个 GPT 风格的模型，用于从 UniDiffuser 嵌入生成文本。
text_tokenizer (GPT2Tokenizer) — GPT2Tokenizer，用于解码文本以进行文本生成；与 text_decoder 一起使用。
unet (UniDiffuserModel) — 具有 UNet 风格的 Transformer 层之间跳跃连接的 U-ViT 模型，用于对编码的图像潜在空间进行去噪。
scheduler (SchedulerMixin) — 调度器，与 unet 结合使用，以对编码的图像和/或文本潜在空间进行去噪。原始 UniDiffuser 论文使用 DPMSolverMultistepScheduler 调度器。

用于双峰图像-文本模型的 Pipeline，该模型支持无条件文本和图像生成、文本条件图像生成、图像条件文本生成以及联合图像-文本生成。

此模型继承自 DiffusionPipeline。查看超类文档，了解为所有管道实现的通用方法（下载、保存、在特定设备上运行等）。

call

< source >

( prompt: typing.Union[str, typing.List[str], NoneType] = None image: typing.Union[torch.Tensor, PIL.Image.Image, NoneType] = None height: typing.Optional[int] = None width: typing.Optional[int] = None data_type: typing.Optional[int] = 1 num_inference_steps: int = 50 guidance_scale: float = 8.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 num_prompts_per_image: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_latents: typing.Optional[torch.Tensor] = None vae_latents: typing.Optional[torch.Tensor] = None clip_latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 ) → ImageTextPipelineOutput 或 tuple

参数

prompt (str 或 List[str], 可选) — 用于引导图像生成的提示词。如果未定义，则需要传递 prompt_embeds。文本条件图像生成 (text2img) 模式必需。
image (torch.Tensor 或 PIL.Image.Image, 可选) — 表示图像批次的 Image 或张量。图像条件文本生成 (img2text) 模式必需。
height (int, 可选, 默认为 self.unet.config.sample_size * self.vae_scale_factor) — 生成图像的高度（以像素为单位）。
width (int, 可选, 默认为 self.unet.config.sample_size * self.vae_scale_factor) — 生成图像的宽度（以像素为单位）。
data_type (int, 可选, 默认为 1) — 数据类型（0 或 1）。仅当您加载支持数据类型嵌入的检查点时才使用；添加此项是为了与 UniDiffuser-v1 检查点兼容。
num_inference_steps (int, 可选, 默认为 50) — 去噪步骤的数量。更多的去噪步骤通常会生成更高质量的图像，但会牺牲推理速度。
guidance_scale (float, 可选, 默认为 8.0) — 较高的 guidance scale 值会促使模型生成与文本 prompt 更紧密相关的图像，但会以降低图像质量为代价。当 guidance_scale > 1 时，guidance scale 功能启用。
negative_prompt (str 或 List[str], 可选) — 用于引导图像生成中不包含的内容的 prompt 或 prompts。如果未定义，则需要传递 negative_prompt_embeds。当不使用 guidance 时 (guidance_scale < 1) 将被忽略。用于文本条件图像生成 (text2img) 模式。
num_images_per_prompt (int, 可选, 默认为 1) — 每个 prompt 生成的图像数量。用于 text2img (文本条件图像生成) 和 img 模式。如果模式为 joint 且同时提供了 num_images_per_prompt 和 num_prompts_per_image，则会生成 min(num_images_per_prompt, num_prompts_per_image) 个样本。
num_prompts_per_image (int, 可选, 默认为 1) — 每个图像生成的 prompt 数量。用于 img2text (图像条件文本生成) 和 text 模式。如果模式为 joint 且同时提供了 num_images_per_prompt 和 num_prompts_per_image，则会生成 min(num_images_per_prompt, num_prompts_per_image) 个样本。
eta (float, 可选, 默认为 0.0) — 对应于 DDIM 论文中的参数 eta (η)。仅适用于 DDIMScheduler，在其他 schedulers 中将被忽略。
generator (torch.Generator 或 List[torch.Generator], 可选) — 用于使生成过程具有确定性的 torch.Generator。
latents (torch.Tensor, 可选) — 预生成的、从高斯分布中采样的噪声 latents，用作联合图像-文本生成的输入。可用于使用不同的 prompts 微调相同的生成过程。如果未提供，则会通过使用提供的随机 generator 进行采样来生成 latents tensor。这假设提供了一整套 VAE、CLIP 和文本 latents，如果提供，则会覆盖 prompt_latents、vae_latents 和 clip_latents 的值。
prompt_latents (torch.Tensor, 可选) — 预生成的、从高斯分布中采样的噪声 latents，用作文本生成的输入。可用于使用不同的 prompts 微调相同的生成过程。如果未提供，则会通过使用提供的随机 generator 进行采样来生成 latents tensor。
vae_latents (torch.Tensor, 可选) — 预生成的、从高斯分布中采样的噪声 latents，用作图像生成的输入。可用于使用不同的 prompts 微调相同的生成过程。如果未提供，则会通过使用提供的随机 generator 进行采样来生成 latents tensor。
clip_latents (torch.Tensor, 可选) — 预生成的、从高斯分布中采样的噪声 latents，用作图像生成的输入。可用于使用不同的 prompts 微调相同的生成过程。如果未提供，则会通过使用提供的随机 generator 进行采样来生成 latents tensor。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本 embeddings。可用于轻松调整文本输入（prompt weighting）。如果未提供，则会从 prompt 输入参数生成文本 embeddings。用于文本条件图像生成 (text2img) 模式。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负文本 embeddings。可用于轻松调整文本输入（prompt weighting）。如果未提供，则会从 negative_prompt 输入参数生成 negative_prompt_embeds。用于文本条件图像生成 (text2img) 模式。
output_type (str, 可选, 默认为 "pil") — 生成图像的输出格式。在 PIL.Image 或 np.array 之间选择。
return_dict (bool, 可选, 默认为 True) — 是否返回 ImageTextPipelineOutput 而不是普通 tuple。
callback (Callable, 可选) — 一个在推理过程中每 callback_steps 步调用的函数。该函数使用以下参数调用： callback(step: int, timestep: int, latents: torch.Tensor).
callback_steps (int, 可选, 默认为 1) — 调用 callback 函数的频率。如果未指定，则在每个步骤都调用 callback。

ImageTextPipelineOutput 或 tuple

如果 return_dict 为 True，则返回 ImageTextPipelineOutput，否则返回一个 tuple，其中第一个元素是包含生成图像的列表，第二个元素是包含生成文本的列表。

用于生成 pipeline 的调用函数。

disable_vae_slicing

< source >

( )

禁用 sliced VAE 解码。如果之前启用了 enable_vae_slicing，此方法将恢复为一步计算解码。

disable_vae_tiling

< source >

( )

禁用 tiled VAE 解码。如果之前启用了 enable_vae_tiling，此方法将恢复为一步计算解码。

enable_vae_slicing

< source >

( )

启用 sliced VAE 解码。启用此选项后，VAE 将把输入 tensor 分割成 slices，以分步计算解码。这对于节省一些内存并允许更大的 batch sizes 非常有用。

enable_vae_tiling

< source >

( )

启用 tiled VAE 解码。启用此选项后，VAE 将把输入 tensor 分割成 tiles，以分步计算解码和编码。这对于节省大量内存并允许处理更大的图像非常有用。

encode_prompt

< source >

( prompt device num_images_per_prompt do_classifier_free_guidance negative_prompt = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None clip_skip: typing.Optional[int] = None )

参数

prompt (str 或 List[str], 可选) — 要编码的 prompt
device — (torch.device): torch 设备
num_images_per_prompt (int) — 每个 prompt 应生成的图像数量
do_classifier_free_guidance (bool) — 是否使用 classifier free guidance
negative_prompt (str 或 List[str], 可选) — 不用于引导图像生成的 prompt 或 prompts。如果未定义，则必须传递 negative_prompt_embeds。当不使用 guidance 时将被忽略 (即，如果 guidance_scale 小于 1 则忽略)。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入 (text embeddings)。可以用于轻松调整文本输入，例如 prompt 加权。如果未提供，则将从 prompt 输入参数生成文本嵌入。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负面文本嵌入 (negative text embeddings)。可以用于轻松调整文本输入，例如 prompt 加权。如果未提供，则将从 negative_prompt 输入参数生成 negative_prompt_embeds。
lora_scale (float, 可选) — LoRA 缩放比例，如果加载了 LoRA 层，该比例将应用于文本编码器的所有 LoRA 层。
clip_skip (int, 可选) — 在计算 prompt embeddings 时，要从 CLIP 跳过的层数。值为 1 表示预倒数第二层的输出将用于计算 prompt embeddings。

将 prompt 编码为文本编码器隐藏状态 (hidden states)。

reset_mode

< source >

( )

移除手动设置的模式；调用此方法后，pipeline 将从输入推断模式。

set_image_mode

< source >

( )

手动将生成模式设置为无条件（“边缘 (marginal)”）图像生成。

set_image_to_text_mode

< source >

( )

手动将生成模式设置为图像条件文本生成。

set_joint_mode

< source >

( )

手动将生成模式设置为无条件联合图像-文本生成。

set_text_mode

< source >

( )

手动将生成模式设置为无条件（“边缘 (marginal)”）文本生成。

set_text_to_image_mode

< source >

( )

手动将生成模式设置为文本条件图像生成。

ImageTextPipelineOutput

class diffusers.ImageTextPipelineOutput

< source >

( images: typing.Union[typing.List[PIL.Image.Image], numpy.ndarray, NoneType] text: typing.Union[typing.List[str], typing.List[typing.List[str]], NoneType] )

参数

images (List[PIL.Image.Image] 或 np.ndarray) — 长度为 batch_size 的去噪 PIL 图像列表，或形状为 (batch_size, height, width, num_channels) 的 NumPy 数组。
text (List[str] 或 List[List[str]]) — 长度为 batch_size 的生成文本字符串列表，或外部列表长度为 batch_size 的字符串列表的列表。

联合图像-文本 pipeline 的输出类。

< > Update on GitHub

←unCLIP Value-guided sampling→

Diffusers

UniDiffuser

使用示例

无条件图像和文本生成

文本到图像生成

图像到文本生成

图像变异

文本变异

UniDiffuserPipeline

class diffusers.UniDiffuserPipeline

__call__

disable_vae_slicing

disable_vae_tiling

enable_vae_slicing

enable_vae_tiling

encode_prompt

reset_mode

set_image_mode

set_image_to_text_mode

set_joint_mode

set_text_mode

set_text_to_image_mode

ImageTextPipelineOutput

class diffusers.ImageTextPipelineOutput

call