Diffusers 文档

unCLIP

Diffusers

加入 Hugging Face 社区

并获取增强的文档体验

协作开发模型、数据集和 Spaces

通过加速推理获得更快的示例

切换文档主题

开始使用

unCLIP

《使用 CLIP Latents 的分层文本条件图像生成》由 Aditya Ramesh、Prafulla Dhariwal、Alex Nichol、Casey Chu、Mark Chen 撰写。🤗 Diffusers 中的 unCLIP 模型来自 kakaobrain 的 karlo。

论文摘要如下

对比模型（如 CLIP）已被证明可以学习图像的鲁棒表示，捕捉语义和风格。为了利用这些表示进行图像生成，我们提出了一个两阶段模型：一个先验模型，根据文本标题生成 CLIP 图像嵌入；以及一个解码器，根据图像嵌入生成图像。我们表明，显式生成图像表示可以提高图像多样性，同时最大限度地减少照片真实感和标题相似性的损失。我们的解码器以图像表示为条件，还可以生成图像的变体，这些变体既保留了其语义和风格，又改变了图像表示中不存在的非必要细节。此外，CLIP 的联合嵌入空间支持以零样本方式进行语言引导的图像操作。我们对解码器使用扩散模型，并对先验模型尝试了自回归模型和扩散模型，发现后者在计算上更有效率，并产生更高质量的样本。

你可以在 lucidrains/DALLE2-pytorch 找到 lucidrains’ DALL-E 2 的复现。

请务必查看调度器指南，了解如何探索调度器速度和质量之间的权衡，并查看跨 pipelines 重用组件部分，了解如何有效地将相同组件加载到多个 pipelines 中。

UnCLIPPipeline

class diffusers.UnCLIPPipeline

< source >

( prior: PriorTransformer decoder: UNet2DConditionModel text_encoder: CLIPTextModelWithProjection tokenizer: CLIPTokenizer text_proj: UnCLIPTextProjModel super_res_first: UNet2DModel super_res_last: UNet2DModel prior_scheduler: UnCLIPScheduler decoder_scheduler: UnCLIPScheduler super_res_scheduler: UnCLIPScheduler )

参数

text_encoder (CLIPTextModelWithProjection) — 冻结的文本编码器。
tokenizer (CLIPTokenizer) — 一个 CLIPTokenizer，用于标记文本。
prior (PriorTransformer) — 规范的 unCLIP 先验模型，用于从文本嵌入近似图像嵌入。
text_proj (UnCLIPTextProjModel) — 实用程序类，用于准备和组合嵌入，然后再将其传递给解码器。
decoder (UNet2DConditionModel) — 解码器，用于将图像嵌入反转为图像。
super_res_first (UNet2DModel) — 超分辨率 UNet。用于超分辨率扩散过程的所有步骤，但最后一步除外。
super_res_last (UNet2DModel) — 超分辨率 UNet。用于超分辨率扩散过程的最后一步。
prior_scheduler (UnCLIPScheduler) — 先验去噪过程中使用的调度器（修改后的 DDPMScheduler）。
decoder_scheduler (UnCLIPScheduler) — 解码器去噪过程中使用的调度器（修改后的 DDPMScheduler）。
super_res_scheduler (UnCLIPScheduler) — 超分辨率去噪过程中使用的调度器（修改后的 DDPMScheduler）。

使用 unCLIP 进行文本到图像生成的 Pipeline。

此模型继承自 DiffusionPipeline。查看超类文档以获取为所有 pipeline 实现的通用方法（下载、保存、在特定设备上运行等）。

call

< source >

( prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: int = 1 prior_num_inference_steps: int = 25 decoder_num_inference_steps: int = 25 super_res_num_inference_steps: int = 7 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None prior_latents: typing.Optional[torch.Tensor] = None decoder_latents: typing.Optional[torch.Tensor] = None super_res_latents: typing.Optional[torch.Tensor] = None text_model_output: typing.Union[transformers.models.clip.modeling_clip.CLIPTextModelOutput, typing.Tuple, NoneType] = None text_attention_mask: typing.Optional[torch.Tensor] = None prior_guidance_scale: float = 4.0 decoder_guidance_scale: float = 8.0 output_type: typing.Optional[str] = 'pil' return_dict: bool = True ) → ImagePipelineOutput 或 tuple

参数

prompt (str 或 List[str]) — 用于引导图像生成的 prompt 或 prompts。只有在传递了 text_model_output 和 text_attention_mask 时，才可以将其保留为未定义。
num_images_per_prompt (int, 可选，默认为 1) — 每个 prompt 生成的图像数量。
prior_num_inference_steps (int, 可选，默认为 25) — 先验模型的去噪步骤数。更多去噪步骤通常会带来更高质量的图像，但会牺牲较慢的推理速度。
decoder_num_inference_steps (int, 可选，默认为 25) — 解码器的去噪步骤数。更多去噪步骤通常会带来更高质量的图像，但会牺牲较慢的推理速度。
super_res_num_inference_steps (int, 可选，默认为 7) — 超分辨率的去噪步骤数。更多去噪步骤通常会带来更高质量的图像，但会牺牲较慢的推理速度。
generator (torch.Generator 或 List[torch.Generator], 可选) — 用于使生成确定性的 torch.Generator。
prior_latents (形状为 (batch size, embeddings dimension) 的 torch.Tensor, 可选) — 预生成的噪声潜在变量，用作先验模型的输入。
decoder_latents (形状为 (batch size, channels, height, width) 的 torch.Tensor, 可选) — 预生成的噪声潜在变量，用作解码器的输入。
super_res_latents (形状为 (batch size, channels, super res height, super res width) 的 torch.Tensor, 可选) — 预生成的噪声潜在变量，用作解码器的输入。
prior_guidance_scale (float, 可选，默认为 4.0) — 更高的 guidance scale 值会鼓励模型生成与文本 prompt 紧密相关的图像，但会降低图像质量。当 guidance_scale > 1 时，guidance scale 启用。
decoder_guidance_scale (float, 可选，默认为 4.0) — 更高的 guidance scale 值会鼓励模型生成与文本 prompt 紧密相关的图像，但会降低图像质量。当 guidance_scale > 1 时，guidance scale 启用。
text_model_output (CLIPTextModelOutput, 可选) — 可以从文本编码器派生的预定义 CLIPTextModel 输出。预定义的文本输出可以传递用于诸如文本嵌入插值之类的任务。确保在这种情况下也传递 text_attention_mask。prompt 可以保留为 None。
text_attention_mask (torch.Tensor, 可选) — 可以从 tokenizer 派生的预定义 CLIP 文本注意力掩码。当传递 text_model_output 时，预定义的文本注意力掩码是必要的。
output_type (str, 可选，默认为 "pil") — 生成图像的输出格式。在 PIL.Image 或 np.array 之间选择。
return_dict (bool, 可选，默认为 True) — 是否返回 ImagePipelineOutput 而不是普通 tuple。

返回值

ImagePipelineOutput 或 tuple

如果 return_dict 为 True，则返回 ImagePipelineOutput，否则返回 tuple，其中第一个元素是包含生成图像的列表。

用于生成的 pipeline 的调用函数。

UnCLIPImageVariationPipeline

class diffusers.UnCLIPImageVariationPipeline

< source >

( decoder: UNet2DConditionModel text_encoder: CLIPTextModelWithProjection tokenizer: CLIPTokenizer text_proj: UnCLIPTextProjModel feature_extractor: CLIPImageProcessor image_encoder: CLIPVisionModelWithProjection super_res_first: UNet2DModel super_res_last: UNet2DModel decoder_scheduler: UnCLIPScheduler super_res_scheduler: UnCLIPScheduler )

参数

text_encoder (CLIPTextModelWithProjection) — 冻结的文本编码器。
tokenizer (CLIPTokenizer) — 用于文本标记化的 CLIPTokenizer。
feature_extractor (CLIPImageProcessor) — 从生成的图像中提取特征的模型，用作 image_encoder 的输入。
image_encoder (CLIPVisionModelWithProjection) — 冻结的 CLIP 图像编码器 (clip-vit-large-patch14)。
text_proj (UnCLIPTextProjModel) — 实用程序类，用于准备和组合嵌入，然后再将其传递给解码器。
decoder (UNet2DConditionModel) — 将图像嵌入反转为图像的解码器。
super_res_first (UNet2DModel) — 超分辨率 UNet。用于超分辨率扩散过程的所有步骤，但最后一步除外。
super_res_last (UNet2DModel) — 超分辨率 UNet。用于超分辨率扩散过程的最后一步。
decoder_scheduler (UnCLIPScheduler) — 解码器去噪过程中使用的调度器（修改后的 DDPMScheduler）。
super_res_scheduler (UnCLIPScheduler) — 超分辨率去噪过程中使用的调度器（修改后的 DDPMScheduler）。

使用 UnCLIP 从输入图像生成图像变体的 Pipeline。

此模型继承自 DiffusionPipeline。查看超类文档以获取为所有 pipeline 实现的通用方法（下载、保存、在特定设备上运行等）。

call

< source >

( image: typing.Union[PIL.Image.Image, typing.List[PIL.Image.Image], torch.Tensor, NoneType] = None num_images_per_prompt: int = 1 decoder_num_inference_steps: int = 25 super_res_num_inference_steps: int = 7 generator: typing.Optional[torch._C.Generator] = None decoder_latents: typing.Optional[torch.Tensor] = None super_res_latents: typing.Optional[torch.Tensor] = None image_embeddings: typing.Optional[torch.Tensor] = None decoder_guidance_scale: float = 8.0 output_type: typing.Optional[str] = 'pil' return_dict: bool = True ) → ImagePipelineOutput 或 tuple

参数

image (PIL.Image.Image 或 List[PIL.Image.Image] 或 torch.Tensor) — Image 或张量，表示要用作起点的图像批次。如果提供张量，则它需要与 CLIPImageProcessor 配置兼容。仅当传递 image_embeddings 时才可以保留为 None。
num_images_per_prompt (int, 可选，默认为 1) — 每个提示要生成的图像数量。
decoder_num_inference_steps (int, 可选，默认为 25) — 解码器的去噪步骤数。更多去噪步骤通常会带来更高质量的图像，但会以较慢的推理速度为代价。
super_res_num_inference_steps (int, 可选，默认为 7) — 超分辨率的去噪步骤数。更多去噪步骤通常会带来更高质量的图像，但会以较慢的推理速度为代价。
generator (torch.Generator, 可选) — torch.Generator，用于使生成具有确定性。
decoder_latents (形状为 (batch size, channels, height, width) 的 torch.Tensor, 可选) — 预生成的噪声潜在变量，用作解码器的输入。
super_res_latents (形状为 (batch size, channels, super res height, super res width) 的 torch.Tensor, 可选) — 预生成的噪声潜在变量，用作解码器的输入。
decoder_guidance_scale (float, 可选，默认为 4.0) — 较高的 guidance scale 值会鼓励模型生成与文本 prompt 紧密相关的图像，但会以降低图像质量为代价。当 guidance_scale > 1 时，guidance scale 启用。
image_embeddings (torch.Tensor, 可选) — 可以从图像编码器派生的预定义图像嵌入。预定义的图像嵌入可以传递用于图像插值等任务。image 可以保留为 None。
output_type (str, 可选，默认为 "pil") — 生成图像的输出格式。在 PIL.Image 或 np.array 之间选择。
return_dict (bool, 可选，默认为 True) — 是否返回 ImagePipelineOutput 而不是普通元组。

返回值

ImagePipelineOutput 或 tuple

如果 return_dict 为 True，则返回 ImagePipelineOutput，否则返回 tuple，其中第一个元素是包含生成图像的列表。

用于生成的 pipeline 的调用函数。

ImagePipelineOutput

class diffusers.ImagePipelineOutput

( images: typing.Union[typing.List[PIL.Image.Image], numpy.ndarray] )

参数

images (List[PIL.Image.Image] or np.ndarray) — List of denoised PIL images of length batch_size or NumPy array of shape (batch_size, height, width, num_channels).

Output class for image pipelines.

< > Update on GitHub

←Text2Video-Zero UniDiffuser→

Diffusers

unCLIP

UnCLIPPipeline

class diffusers.UnCLIPPipeline

__call__

UnCLIPImageVariationPipeline

class diffusers.UnCLIPImageVariationPipeline

__call__

ImagePipelineOutput

class diffusers.ImagePipelineOutput

call

call