Diffusers 文档

DeepFloyd IF

Diffusers

加入 Hugging Face 社区

并获取增强的文档体验

协作处理模型、数据集和 Spaces

通过加速推理获得更快的示例

切换文档主题

开始使用

DeepFloyd IF

概述

DeepFloyd IF 是一款新颖的、最先进的开源文本到图像模型，具有高度的照片真实感和语言理解能力。该模型是一个模块化组件，由一个冻结的文本编码器和三个级联像素扩散模块组成

阶段 1：一个基础模型，根据文本提示生成 64x64 像素的图像，
阶段 2：一个 64x64 像素 => 256x256 像素的超分辨率模型，以及
阶段 3：一个 256x256 像素 => 1024x1024 像素的超分辨率模型。阶段 1 和阶段 2 利用基于 T5 Transformer 的冻结文本编码器来提取文本嵌入，然后将其馈送到 UNet 架构中，该架构通过交叉注意力和注意力池化得到增强。阶段 3 是 Stability AI 的 x4 放大模型。结果是一个高效的模型，其性能优于当前最先进的模型，在 COCO 数据集上实现了 6.66 的零样本 FID 分数。我们的工作强调了更大的 UNet 架构在级联扩散模型的第一阶段的潜力，并描绘了文本到图像合成的 перспективный 未来。

使用方法

在使用 IF 之前，您需要接受其使用条件。为此，请

确保您拥有 Hugging Face 帐户并已登录。
在 DeepFloyd/IF-I-XL-v1.0 的模型卡上接受许可。在阶段 I 模型卡上接受许可将自动接受其他 IF 模型的许可。
确保在本地登录。安装 huggingface_hub

pip install huggingface_hub --upgrade

在 Python shell 中运行登录函数

from huggingface_hub import login

login()

并输入您的 Hugging Face Hub 访问令牌。

接下来我们安装 diffusers 和依赖项

pip install -q diffusers accelerate transformers

以下部分提供了有关如何使用 IF 的更详细示例。具体而言：

文本到图像生成
图像到图像生成
图像修复
重用模型权重
速度优化
内存优化

可用检查点

阶段-1
阶段-2
- DeepFloyd/IF-II-L-v1.0
- DeepFloyd/IF-II-M-v1.0
阶段-3
- stabilityai/stable-diffusion-x4-upscaler

Google Colab

文本到图像生成

默认情况下，diffusers 利用模型 CPU 卸载，以尽可能少的 14 GB VRAM 运行整个 IF pipeline。

from diffusers import DiffusionPipeline
from diffusers.utils import pt_to_pil, make_image_grid
import torch

# stage 1
stage_1 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
stage_1.enable_model_cpu_offload()

# stage 2
stage_2 = DiffusionPipeline.from_pretrained(
    "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16
)
stage_2.enable_model_cpu_offload()

# stage 3
safety_modules = {
    "feature_extractor": stage_1.feature_extractor,
    "safety_checker": stage_1.safety_checker,
    "watermarker": stage_1.watermarker,
}
stage_3 = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16
)
stage_3.enable_model_cpu_offload()

prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"'
generator = torch.manual_seed(1)

# text embeds
prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt)

# stage 1
stage_1_output = stage_1(
    prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt"
).images
#pt_to_pil(stage_1_output)[0].save("./if_stage_I.png")

# stage 2
stage_2_output = stage_2(
    image=stage_1_output,
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_embeds,
    generator=generator,
    output_type="pt",
).images
#pt_to_pil(stage_2_output)[0].save("./if_stage_II.png")

# stage 3
stage_3_output = stage_3(prompt=prompt, image=stage_2_output, noise_level=100, generator=generator).images
#stage_3_output[0].save("./if_stage_III.png")
make_image_grid([pt_to_pil(stage_1_output)[0], pt_to_pil(stage_2_output)[0], stage_3_output[0]], rows=1, rows=3)

文本引导的图像到图像生成

相同的 IF 模型权重可以用于文本引导的图像到图像转换或图像变体。在这种情况下，只需确保使用 IFImg2ImgPipeline 和 IFImg2ImgSuperResolutionPipeline pipelines 加载权重。

注意：您也可以直接将文本到图像 (text-to-image) 流程的权重移动到图像到图像 (image-to-image) 流程，而无需重复加载。只需使用 components 参数，如此处所述。

from diffusers import IFImg2ImgPipeline, IFImg2ImgSuperResolutionPipeline, DiffusionPipeline
from diffusers.utils import pt_to_pil, load_image, make_image_grid
import torch

# download image
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
original_image = load_image(url)
original_image = original_image.resize((768, 512))

# stage 1
stage_1 = IFImg2ImgPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
stage_1.enable_model_cpu_offload()

# stage 2
stage_2 = IFImg2ImgSuperResolutionPipeline.from_pretrained(
    "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16
)
stage_2.enable_model_cpu_offload()

# stage 3
safety_modules = {
    "feature_extractor": stage_1.feature_extractor,
    "safety_checker": stage_1.safety_checker,
    "watermarker": stage_1.watermarker,
}
stage_3 = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16
)
stage_3.enable_model_cpu_offload()

prompt = "A fantasy landscape in style minecraft"
generator = torch.manual_seed(1)

# text embeds
prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt)

# stage 1
stage_1_output = stage_1(
    image=original_image,
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_embeds,
    generator=generator,
    output_type="pt",
).images
#pt_to_pil(stage_1_output)[0].save("./if_stage_I.png")

# stage 2
stage_2_output = stage_2(
    image=stage_1_output,
    original_image=original_image,
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_embeds,
    generator=generator,
    output_type="pt",
).images
#pt_to_pil(stage_2_output)[0].save("./if_stage_II.png")

# stage 3
stage_3_output = stage_3(prompt=prompt, image=stage_2_output, generator=generator, noise_level=100).images
#stage_3_output[0].save("./if_stage_III.png")
make_image_grid([original_image, pt_to_pil(stage_1_output)[0], pt_to_pil(stage_2_output)[0], stage_3_output[0]], rows=1, rows=4)

文本引导的图像修复生成

相同的 IF 模型权重可以用于文本引导的图像到图像转换或图像变体。在这种情况下，只需确保使用 IFInpaintingPipeline 和 IFInpaintingSuperResolutionPipeline 流程加载权重即可。

注意：您也可以直接将文本到图像 (text-to-image) 流程的权重移动到图像到图像 (image-to-image) 流程，而无需重复加载。只需使用 ~DiffusionPipeline.components() 函数，如此处所述。

from diffusers import IFInpaintingPipeline, IFInpaintingSuperResolutionPipeline, DiffusionPipeline
from diffusers.utils import pt_to_pil, load_image, make_image_grid
import torch

# download image
url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/person.png"
original_image = load_image(url)

# download mask
url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/glasses_mask.png"
mask_image = load_image(url)

# stage 1
stage_1 = IFInpaintingPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
stage_1.enable_model_cpu_offload()

# stage 2
stage_2 = IFInpaintingSuperResolutionPipeline.from_pretrained(
    "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16
)
stage_2.enable_model_cpu_offload()

# stage 3
safety_modules = {
    "feature_extractor": stage_1.feature_extractor,
    "safety_checker": stage_1.safety_checker,
    "watermarker": stage_1.watermarker,
}
stage_3 = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16
)
stage_3.enable_model_cpu_offload()

prompt = "blue sunglasses"
generator = torch.manual_seed(1)

# text embeds
prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt)

# stage 1
stage_1_output = stage_1(
    image=original_image,
    mask_image=mask_image,
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_embeds,
    generator=generator,
    output_type="pt",
).images
#pt_to_pil(stage_1_output)[0].save("./if_stage_I.png")

# stage 2
stage_2_output = stage_2(
    image=stage_1_output,
    original_image=original_image,
    mask_image=mask_image,
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_embeds,
    generator=generator,
    output_type="pt",
).images
#pt_to_pil(stage_1_output)[0].save("./if_stage_II.png")

# stage 3
stage_3_output = stage_3(prompt=prompt, image=stage_2_output, generator=generator, noise_level=100).images
#stage_3_output[0].save("./if_stage_III.png")
make_image_grid([original_image, mask_image, pt_to_pil(stage_1_output)[0], pt_to_pil(stage_2_output)[0], stage_3_output[0]], rows=1, rows=5)

在不同流程之间转换

除了使用 from_pretrained 加载之外，流程还可以直接从彼此加载。

from diffusers import IFPipeline, IFSuperResolutionPipeline

pipe_1 = IFPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0")
pipe_2 = IFSuperResolutionPipeline.from_pretrained("DeepFloyd/IF-II-L-v1.0")


from diffusers import IFImg2ImgPipeline, IFImg2ImgSuperResolutionPipeline

pipe_1 = IFImg2ImgPipeline(**pipe_1.components)
pipe_2 = IFImg2ImgSuperResolutionPipeline(**pipe_2.components)


from diffusers import IFInpaintingPipeline, IFInpaintingSuperResolutionPipeline

pipe_1 = IFInpaintingPipeline(**pipe_1.components)
pipe_2 = IFInpaintingSuperResolutionPipeline(**pipe_2.components)

优化速度

为了更快地运行 IF，最简单的优化方法是将所有模型组件移动到 GPU。

pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
pipe.to("cuda")

您还可以使用较少的 timestep 运行扩散过程。

这可以通过 num_inference_steps 参数来完成

pipe("<prompt>", num_inference_steps=30)

或者使用 timesteps 参数

from diffusers.pipelines.deepfloyd_if import fast27_timesteps

pipe("<prompt>", timesteps=fast27_timesteps)

在进行图像变体或图像修复时，您还可以使用 strength 参数来减少 timestep 的数量。strength 参数是添加到输入图像的噪声量，它也决定了去噪过程中运行的步数。数值越小，图像变化越小，但运行速度越快。

pipe = IFImg2ImgPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
pipe.to("cuda")

image = pipe(image=image, prompt="<prompt>", strength=0.3).images

您也可以使用 torch.compile。请注意，我们尚未全面测试 torch.compile 与 IF 的兼容性，它可能无法给出预期的结果。

from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
pipe.to("cuda")

pipe.text_encoder = torch.compile(pipe.text_encoder, mode="reduce-overhead", fullgraph=True)
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

优化内存

当优化 GPU 内存时，我们可以使用标准的 diffusers CPU 卸载 API。

可以使用基于模型的 CPU 卸载，

pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()

或者更激进的基于层的 CPU 卸载。

pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
pipe.enable_sequential_cpu_offload()

此外，T5 可以以 8 位精度加载

from transformers import T5EncoderModel

text_encoder = T5EncoderModel.from_pretrained(
    "DeepFloyd/IF-I-XL-v1.0", subfolder="text_encoder", device_map="auto", load_in_8bit=True, variant="8bit"
)

from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    "DeepFloyd/IF-I-XL-v1.0",
    text_encoder=text_encoder,  # pass the previously instantiated 8bit text encoder
    unet=None,
    device_map="auto",
)

prompt_embeds, negative_embeds = pipe.encode_prompt("<prompt>")

对于 CPU RAM 受限的机器，例如 Google Colab 免费套餐，我们无法一次将所有模型组件加载到 CPU，我们可以手动仅在需要相应的模型组件时才加载带有文本编码器或 UNet 的流程。

from diffusers import IFPipeline, IFSuperResolutionPipeline
import torch
import gc
from transformers import T5EncoderModel
from diffusers.utils import pt_to_pil, make_image_grid

text_encoder = T5EncoderModel.from_pretrained(
    "DeepFloyd/IF-I-XL-v1.0", subfolder="text_encoder", device_map="auto", load_in_8bit=True, variant="8bit"
)

# text to image
pipe = DiffusionPipeline.from_pretrained(
    "DeepFloyd/IF-I-XL-v1.0",
    text_encoder=text_encoder,  # pass the previously instantiated 8bit text encoder
    unet=None,
    device_map="auto",
)

prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"'
prompt_embeds, negative_embeds = pipe.encode_prompt(prompt)

# Remove the pipeline so we can re-load the pipeline with the unet
del text_encoder
del pipe
gc.collect()
torch.cuda.empty_cache()

pipe = IFPipeline.from_pretrained(
    "DeepFloyd/IF-I-XL-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16, device_map="auto"
)

generator = torch.Generator().manual_seed(0)
stage_1_output = pipe(
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_embeds,
    output_type="pt",
    generator=generator,
).images

#pt_to_pil(stage_1_output)[0].save("./if_stage_I.png")

# Remove the pipeline so we can load the super-resolution pipeline
del pipe
gc.collect()
torch.cuda.empty_cache()

# First super resolution

pipe = IFSuperResolutionPipeline.from_pretrained(
    "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16, device_map="auto"
)

generator = torch.Generator().manual_seed(0)
stage_2_output = pipe(
    image=stage_1_output,
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_embeds,
    output_type="pt",
    generator=generator,
).images

#pt_to_pil(stage_2_output)[0].save("./if_stage_II.png")
make_image_grid([pt_to_pil(stage_1_output)[0], pt_to_pil(stage_2_output)[0]], rows=1, rows=2)

可用流程：

流程	任务	Colab
pipeline_if.py	文本到图像生成	-
pipeline_if_superresolution.py	文本到图像生成	-
pipeline_if_img2img.py	图像到图像生成	-
pipeline_if_img2img_superresolution.py	图像到图像生成	-
pipeline_if_inpainting.py	图像到图像生成	-
pipeline_if_inpainting_superresolution.py	图像到图像生成	-

Diffusers

DeepFloyd IF

概述

使用方法

文本到图像生成

文本引导的图像到图像生成

文本引导的图像修复生成

在不同流程之间转换

优化速度

优化内存

可用流程：

IFPipeline

class diffusers.IFPipeline

__call__

encode_prompt

IFSuperResolutionPipeline

class diffusers.IFSuperResolutionPipeline

__call__

encode_prompt

IFImg2ImgPipeline

class diffusers.IFImg2ImgPipeline

__call__

encode_prompt

IFImg2ImgSuperResolutionPipeline

class diffusers.IFImg2ImgSuperResolutionPipeline

__call__

encode_prompt

IFInpaintingPipeline

class diffusers.IFInpaintingPipeline

__call__

encode_prompt

IFInpaintingSuperResolutionPipeline

class diffusers.IFInpaintingSuperResolutionPipeline

__call__

encode_prompt

call

call

call

call

call

call