Diffusers 文档

GLIGEN（基于语言的图像生成）

Diffusers

加入 Hugging Face 社区

并获得增强的文档体验

在模型、数据集和 Spaces 上进行协作

通过加速推理获得更快的示例

切换文档主题

开始使用

GLIGEN（基于语言的图像生成）

GLIGEN 模型由威斯康星大学麦迪逊分校、哥伦比亚大学和微软的研究人员和工程师创建。StableDiffusionGLIGENPipeline 和 StableDiffusionGLIGENTextImagePipeline 能够根据基础输入生成逼真的图像。除了文本和边界框与 StableDiffusionGLIGENPipeline 结合使用外，如果提供输入图像，StableDiffusionGLIGENTextImagePipeline 可以在边界框定义的区域插入文本描述的对象。否则，它将生成由标题/提示描述的图像，并在边界框定义的区域插入文本描述的对象。它在 COCO2014D 和 COCO2014CD 数据集上训练，模型使用冻结的 CLIP ViT-L/14 文本编码器，以便根据基础输入进行条件化。

这篇论文的摘要是：

大规模文本到图像扩散模型取得了惊人的进展。然而，现状是仅使用文本输入，这会阻碍可控性。在这项工作中，我们提出了 GLIGEN，即基于语言的图像生成，这是一种新颖的方法，它在现有预训练文本到图像扩散模型的功能上进行构建和扩展，使其能够也根据基础输入进行条件化。为了保留预训练模型的巨大概念知识，我们冻结了其所有权重，并通过门控机制将基础信息注入新的可训练层。我们的模型实现了开放世界的基础文本到图像生成，具有标题和边界框条件输入，并且基础能力很好地推广到新颖的空间配置和概念。GLIGEN 在 COCO 和 LVIS 上的零样本性能大大优于现有的受监督布局到图像基线。

请务必查看 Stable Diffusion 提示部分，了解如何探索调度器速度和质量之间的权衡，以及如何有效地重用管道组件！

如果您想使用官方检查点之一完成任务，请探索 gligen Hub 组织！

StableDiffusionGLIGENPipeline 由 Nikhil Gajendrakumar 贡献，StableDiffusionGLIGENTextImagePipeline 由 Nguyễn Công Tú Anh 贡献。

StableDiffusionGLIGENPipeline

class diffusers.StableDiffusionGLIGENPipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer unet: UNet2DConditionModel scheduler: KarrasDiffusionSchedulers safety_checker: StableDiffusionSafetyChecker feature_extractor: CLIPImageProcessor requires_safety_checker: bool = True )

参数

vae (AutoencoderKL) — 用于将图像编码和解码为潜在表示的变分自编码器 (VAE) 模型。
text_encoder (CLIPTextModel) — 冻结的文本编码器 (clip-vit-large-patch14)。
tokenizer (CLIPTokenizer) — 用于标记文本的 CLIPTokenizer。
unet (UNet2DConditionModel) — 用于对编码图像潜在表示进行去噪的 UNet2DConditionModel。
scheduler (SchedulerMixin) — 用于与 unet 结合以对编码图像潜在表示进行去噪的调度器。可以是 DDIMScheduler、LMSDiscreteScheduler 或 PNDMScheduler 之一。
safety_checker (StableDiffusionSafetyChecker) — 分类模块，用于评估生成的图像是否可能被认为是冒犯性或有害的。有关模型潜在危害的更多详细信息，请参阅模型卡。
feature_extractor (CLIPImageProcessor) — 用于从生成的图像中提取特征的 CLIPImageProcessor；用作 safety_checker 的输入。

使用 Stable Diffusion 和 GLIGEN（基于语言的图像生成）进行文本到图像生成的管道。

此模型继承自 DiffusionPipeline。有关库为所有管道实现的通用方法（例如下载或保存、在特定设备上运行等），请查看超类文档。

call

< source >

( prompt: typing.Union[str, typing.List[str]] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 guidance_scale: float = 7.5 gligen_scheduled_sampling_beta: float = 0.3 gligen_phrases: typing.List[str] = None gligen_boxes: typing.List[typing.List[float]] = None gligen_inpaint_image: typing.Optional[PIL.Image.Image] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None clip_skip: typing.Optional[int] = None ) → StableDiffusionPipelineOutput 或 tuple

参数

prompt (str 或 List[str], 可选) — 用于引导图像生成的提示词或提示词列表。如果未定义，您需要传入 prompt_embeds。
height (int, 可选, 默认为 self.unet.config.sample_size * self.vae_scale_factor) — 生成图像的高度（像素）。
width (int, 可选, 默认为 self.unet.config.sample_size * self.vae_scale_factor) — 生成图像的宽度（像素）。
num_inference_steps (int, 可选, 默认为 50) — 去噪步数。更多去噪步数通常会导致更高质量的图像，但会牺牲推理速度。
guidance_scale (float, 可选, 默认为 7.5) — 较高的引导比例值鼓励模型生成与文本 prompt 紧密相关的图像，但会降低图像质量。当 guidance_scale > 1 时启用引导比例。
gligen_phrases (List[str]) — 用于引导在对应 gligen_boxes 定义的每个区域中包含内容的短语。每个边界框应该只有一个短语。
gligen_boxes (List[List[float]]) — 边界框，用于标识图像中将填充相应 gligen_phrases 所描述内容的矩形区域。每个矩形框定义为包含 4 个元素 [xmin, ymin, xmax, ymax] 的 List[float]，其中每个值都在 [0,1] 之间。
gligen_inpaint_image (PIL.Image.Image, 可选) — 输入图像（如果提供）将使用 gligen_boxes 和 gligen_phrases 描述的对象进行图像修复。否则，它将被视为对空白输入图像的生成任务。
gligen_scheduled_sampling_beta (float, 默认为 0.3) — 来自 GLIGEN: 开放集基础文本到图像生成的调度采样因子。调度采样因子仅在推理过程中为调度采样而变化，以提高质量和可控性。
negative_prompt (str 或 List[str], 可选) — 用于指导图像生成中不应包含内容的提示词。如果未定义，则需要传入 negative_prompt_embeds。当不使用指导（guidance_scale < 1）时，此参数将被忽略。
num_images_per_prompt (int, 可选, 默认为 1) — 每个提示词生成的图像数量。
eta (float, 可选, 默认为 0.0) — 对应于 DDIM 论文中的参数 eta (η)。仅适用于 DDIMScheduler，在其他调度器中将被忽略。
generator (torch.Generator 或 List[torch.Generator], 可选) — 用于使生成具有确定性的 torch.Generator。
latents (torch.Tensor, 可选) — 从高斯分布采样的预生成噪声潜在变量，用作图像生成的输入。可用于通过不同提示词微调相同生成。如果未提供，将使用提供的随机 generator 采样生成一个潜在变量张量。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入（提示词权重）。如果未提供，文本嵌入将从 prompt 输入参数生成。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负面文本嵌入。可用于轻松调整文本输入（提示词权重）。如果未提供，negative_prompt_embeds 将从 negative_prompt 输入参数生成。
output_type (str, 可选, 默认为 "pil") — 生成图像的输出格式。可在 PIL.Image 或 np.array 之间选择。
return_dict (bool, 可选, 默认为 True) — 是否返回 StableDiffusionPipelineOutput 而不是普通元组。
callback (Callable, 可选) — 在推理期间每 callback_steps 步调用的函数。该函数将使用以下参数调用：callback(step: int, timestep: int, latents: torch.Tensor)。
callback_steps (int, 可选, 默认为 1) — 调用 callback 函数的频率。如果未指定，则在每一步都调用回调函数。
cross_attention_kwargs (dict, 可选) — 一个 kwargs 字典，如果指定，将作为参数传递给 self.processor 中定义的 AttentionProcessor。
guidance_rescale (float, 可选, 默认为 0.0) — 来自 Common Diffusion Noise Schedules and Sample Steps are Flawed 的指导重新缩放因子。指导重新缩放因子应该在使用零终端信噪比时修复过曝问题。
clip_skip (int, 可选) — 在计算提示词嵌入时从 CLIP 中跳过的层数。值为 1 意味着将使用倒数第二层的输出计算提示词嵌入。

StableDiffusionPipelineOutput 或 tuple

如果 return_dict 为 True，则返回 StableDiffusionPipelineOutput，否则返回一个 tuple，其中第一个元素是生成的图像列表，第二个元素是布尔值列表，指示相应的生成图像是否包含“不适合工作”(nsfw) 内容。

用于生成的管道的调用函数。

示例

>>> import torch
>>> from diffusers import StableDiffusionGLIGENPipeline
>>> from diffusers.utils import load_image

>>> # Insert objects described by text at the region defined by bounding boxes
>>> pipe = StableDiffusionGLIGENPipeline.from_pretrained(
...     "masterful/gligen-1-4-inpainting-text-box", variant="fp16", torch_dtype=torch.float16
... )
>>> pipe = pipe.to("cuda")

>>> input_image = load_image(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/gligen/livingroom_modern.png"
... )
>>> prompt = "a birthday cake"
>>> boxes = [[0.2676, 0.6088, 0.4773, 0.7183]]
>>> phrases = ["a birthday cake"]

>>> images = pipe(
...     prompt=prompt,
...     gligen_phrases=phrases,
...     gligen_inpaint_image=input_image,
...     gligen_boxes=boxes,
...     gligen_scheduled_sampling_beta=1,
...     output_type="pil",
...     num_inference_steps=50,
... ).images

>>> images[0].save("./gligen-1-4-inpainting-text-box.jpg")

>>> # Generate an image described by the prompt and
>>> # insert objects described by text at the region defined by bounding boxes
>>> pipe = StableDiffusionGLIGENPipeline.from_pretrained(
...     "masterful/gligen-1-4-generation-text-box", variant="fp16", torch_dtype=torch.float16
... )
>>> pipe = pipe.to("cuda")

>>> prompt = "a waterfall and a modern high speed train running through the tunnel in a beautiful forest with fall foliage"
>>> boxes = [[0.1387, 0.2051, 0.4277, 0.7090], [0.4980, 0.4355, 0.8516, 0.7266]]
>>> phrases = ["a waterfall", "a modern high speed train running through the tunnel"]

>>> images = pipe(
...     prompt=prompt,
...     gligen_phrases=phrases,
...     gligen_boxes=boxes,
...     gligen_scheduled_sampling_beta=1,
...     output_type="pil",
...     num_inference_steps=50,
... ).images

>>> images[0].save("./gligen-1-4-generation-text-box.jpg")

enable_vae_slicing

< source >

( )

启用切片 VAE 解码。启用此选项后，VAE 会将输入张量分片，分步计算解码。这有助于节省一些内存并允许更大的批次大小。

disable_vae_slicing

< source >

( )

禁用切片 VAE 解码。如果之前启用了 enable_vae_slicing，此方法将返回一步计算解码。

enable_vae_tiling

< source >

( )

启用平铺 VAE 解码。启用此选项后，VAE 将把输入张量分割成瓦片，分多步计算编码和解码。这对于节省大量内存和处理更大的图像非常有用。

disable_vae_tiling

< source >

( )

禁用平铺 VAE 解码。如果之前启用了 enable_vae_tiling，此方法将恢复一步计算解码。

enable_model_cpu_offload

< source >

( gpu_id: typing.Optional[int] = None device: typing.Union[torch.device, str] = None )

参数

gpu_id (int, 可选) — 推理中应使用的加速器 ID。如果未指定，默认为 0。
device (torch.Device 或 str, 可选, 默认为 None) — 推理中应使用的加速器的 PyTorch 设备类型。如果未指定，它将自动检测可用的加速器并使用。

使用 accelerate 将所有模型卸载到 CPU，减少内存使用，对性能影响较小。与 enable_sequential_cpu_offload 相比，此方法在调用其 forward 方法时一次将一个完整的模型移动到加速器，并且该模型在加速器中保留直到下一个模型运行。内存节省低于 enable_sequential_cpu_offload，但由于 unet 的迭代执行，性能要好得多。

prepare_latents

< source >

( batch_size num_channels_latents height width dtype device generator latents = None )

enable_fuser

< source >

( enabled = True )

encode_prompt

< source >

( prompt device num_images_per_prompt do_classifier_free_guidance negative_prompt = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None clip_skip: typing.Optional[int] = None )

参数

prompt (str 或 List[str], 可选) — 要编码的提示词。
device — (torch.device): torch 设备。
num_images_per_prompt (int) — 每个提示词应生成的图像数量。
do_classifier_free_guidance (bool) — 是否使用分类器自由指导。
negative_prompt (str 或 List[str], 可选) — 不用于指导图像生成的提示词。如果未定义，则必须传入 negative_prompt_embeds。当不使用指导时（即，如果 guidance_scale 小于 1），此参数将被忽略。
prompt_embeds (torch.Tensor, 可选) — 预生成的文本嵌入。可用于轻松调整文本输入，例如提示词权重。如果未提供，文本嵌入将从 prompt 输入参数生成。
negative_prompt_embeds (torch.Tensor, 可选) — 预生成的负面文本嵌入。可用于轻松调整文本输入，例如提示词权重。如果未提供，negative_prompt_embeds 将从 negative_prompt 输入参数生成。
lora_scale (float, 可选) — 如果加载了 LoRA 层，将应用于文本编码器所有 LoRA 层的 LoRA 缩放因子。
clip_skip (int, 可选) — 在计算提示词嵌入时从 CLIP 中跳过的层数。值为 1 意味着将使用倒数第二层的输出计算提示词嵌入。

将提示编码为文本编码器隐藏状态。

StableDiffusionGLIGENTextImagePipeline

class diffusers.StableDiffusionGLIGENTextImagePipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer processor: CLIPProcessor image_encoder: CLIPVisionModelWithProjection image_project: CLIPImageProjection unet: UNet2DConditionModel scheduler: KarrasDiffusionSchedulers safety_checker: StableDiffusionSafetyChecker feature_extractor: CLIPImageProcessor requires_safety_checker: bool = True )

参数

vae (AutoencoderKL) — 用于将图像编码和解码为潜在表示的变分自编码器（VAE）模型。
text_encoder (CLIPTextModel) — 冻结文本编码器 (clip-vit-large-patch14)。
tokenizer (CLIPTokenizer) — 用于标记文本的 CLIPTokenizer。
processor (CLIPProcessor) — 用于处理参考图像的 CLIPProcessor。
image_encoder (CLIPVisionModelWithProjection) — 冻结图像编码器 (clip-vit-large-patch14)。
image_project (CLIPImageProjection) — 将图像嵌入投影到短语嵌入空间的 CLIPImageProjection。
unet (UNet2DConditionModel) — 用于对编码图像潜在变量去噪的 UNet2DConditionModel。
scheduler (SchedulerMixin) — 与 unet 结合使用以对编码图像潜在变量去噪的调度器。可以是 DDIMScheduler、LMSDiscreteScheduler 或 PNDMScheduler 之一。
safety_checker (StableDiffusionSafetyChecker) — 用于评估生成图像是否可能具有冒犯性或有害的分类模块。有关模型潜在危害的更多详细信息，请参阅模型卡。
feature_extractor (CLIPImageProcessor) — 用于从生成图像中提取特征的 CLIPImageProcessor；用作 safety_checker 的输入。

使用 Stable Diffusion 和 GLIGEN（基于语言的图像生成）进行文本到图像生成的管道。

此模型继承自 DiffusionPipeline。有关库为所有管道实现的通用方法（例如下载或保存、在特定设备上运行等），请查看超类文档。

call

< source >

( prompt: typing.Union[str, typing.List[str]] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 guidance_scale: float = 7.5 gligen_scheduled_sampling_beta: float = 0.3 gligen_phrases: typing.List[str] = None gligen_images: typing.List[PIL.Image.Image] = None input_phrases_mask: typing.Union[int, typing.List[int]] = None input_images_mask: typing.Union[int, typing.List[int]] = None gligen_boxes: typing.List[typing.List[float]] = None gligen_inpaint_image: typing.Optional[PIL.Image.Image] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None gligen_normalize_constant: float = 28.7 clip_skip: int = None ) → StableDiffusionPipelineOutput 或 tuple

参数

prompt (str 或 List[str], 可选) — 用于引导图像生成的提示词。如果未定义，则需要传入 prompt_embeds。
height (int, 可选, 默认为 self.unet.config.sample_size * self.vae_scale_factor) — 生成图像的高度（像素）。
width (int, 可选, 默认为 self.unet.config.sample_size * self.vae_scale_factor) — 生成图像的宽度（像素）。
num_inference_steps (int, 可选, 默认为 50) — 去噪步数。更多的去噪步数通常会带来更高质量的图像，但推理速度会变慢。
guidance_scale (float, 可选, 默认为 7.5) — 较高的引导比例值会促使模型生成与文本 prompt 紧密相关的图像，但会牺牲图像质量。当 guidance_scale > 1 时启用引导比例。
gligen_phrases (List[str]) — 用于引导在 gligen_boxes 定义的每个区域中包含内容的短语。每个边界框应该只有一个短语。
gligen_images (List[PIL.Image.Image]) — 用于引导在 gligen_boxes 定义的每个区域中包含内容的图像。每个边界框应该只有一个图像。
input_phrases_mask (int 或 List[int]) — 由相应的 input_phrases_mask 定义的预短语掩码输入
input_images_mask (int 或 List[int]) — 由相应的 input_images_mask 定义的预图像掩码输入
gligen_boxes (List[List[float]]) — 边界框，用于标识图像中将填充相应 gligen_phrases 所描述内容的矩形区域。每个矩形框定义为包含 4 个元素 [xmin, ymin, xmax, ymax] 的 List[float]，其中每个值都在 [0,1] 之间。
gligen_inpaint_image (PIL.Image.Image, 可选) — 输入图像（如果提供）将使用 gligen_boxes 和 gligen_phrases 描述的对象进行修复。否则，它将被视为在空白输入图像上执行生成任务。
gligen_scheduled_sampling_beta (float, 默认为 0.3) — 来自 GLIGEN: Open-Set Grounded Text-to-Image Generation 的调度采样因子。为了提高质量和可控性，调度采样因子仅在推理期间进行调度采样时进行调整。
negative_prompt (str 或 List[str], 可选) — 用于引导图像生成中不包含内容的提示词。如果未定义，则需要传入 negative_prompt_embeds。当不使用引导时（guidance_scale < 1），此参数将被忽略。
num_images_per_prompt (int, 可选, 默认为 1) — 每个提示词生成的图像数量。
eta (float, 可选, 默认为 0.0) — 对应于 DDIM 论文中的参数 eta (η)。仅适用于 DDIMScheduler，在其他调度器中将被忽略。
generator (torch.Generator 或 List[torch.Generator], 可选) — 用于使生成过程具有确定性的 torch.Generator。
latents (torch.Tensor, 可选) — 预先从高斯分布中采样的噪声潜在变量，用作图像生成的输入。可用于使用不同的提示词调整相同的生成。如果未提供，则使用提供的随机 generator 进行采样生成潜在张量。
prompt_embeds (torch.Tensor, 可选) — 预先生成的文本嵌入。可用于轻松调整文本输入（提示词权重）。如果未提供，文本嵌入将从 prompt 输入参数生成。
negative_prompt_embeds (torch.Tensor, 可选) — 预先生成的负面文本嵌入。可用于轻松调整文本输入（提示词权重）。如果未提供，negative_prompt_embeds 将从 negative_prompt 输入参数生成。
output_type (str, 可选, 默认为 "pil") — 生成图像的输出格式。在 PIL.Image 或 np.array 之间选择。
return_dict (bool, 可选, 默认为 True) — 是否返回 StableDiffusionPipelineOutput 而不是普通元组。
callback (Callable, 可选) — 在推理过程中每 callback_steps 步调用的函数。该函数将使用以下参数调用：callback(step: int, timestep: int, latents: torch.Tensor)。
callback_steps (int, 可选, 默认为 1) — 调用 callback 函数的频率。如果未指定，则在每一步都调用回调。
cross_attention_kwargs (dict, 可选) — 一个 kwargs 字典，如果指定，将作为参数传递给 self.processor 中定义的 AttentionProcessor。
gligen_normalize_constant (float, 可选, 默认为 28.7) — 图像嵌入的归一化值。
clip_skip (int, 可选) — 在计算提示词嵌入时要跳过 CLIP 的层数。值为 1 表示将使用倒数第二层的输出计算提示词嵌入。

StableDiffusionPipelineOutput 或 tuple

用于生成的管道的调用函数。

示例

>>> import torch
>>> from diffusers import StableDiffusionGLIGENTextImagePipeline
>>> from diffusers.utils import load_image

>>> # Insert objects described by image at the region defined by bounding boxes
>>> pipe = StableDiffusionGLIGENTextImagePipeline.from_pretrained(
...     "anhnct/Gligen_Inpainting_Text_Image", torch_dtype=torch.float16
... )
>>> pipe = pipe.to("cuda")

>>> input_image = load_image(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/gligen/livingroom_modern.png"
... )
>>> prompt = "a backpack"
>>> boxes = [[0.2676, 0.4088, 0.4773, 0.7183]]
>>> phrases = None
>>> gligen_image = load_image(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/gligen/backpack.jpeg"
... )

>>> images = pipe(
...     prompt=prompt,
...     gligen_phrases=phrases,
...     gligen_inpaint_image=input_image,
...     gligen_boxes=boxes,
...     gligen_images=[gligen_image],
...     gligen_scheduled_sampling_beta=1,
...     output_type="pil",
...     num_inference_steps=50,
... ).images

>>> images[0].save("./gligen-inpainting-text-image-box.jpg")

>>> # Generate an image described by the prompt and
>>> # insert objects described by text and image at the region defined by bounding boxes
>>> pipe = StableDiffusionGLIGENTextImagePipeline.from_pretrained(
...     "anhnct/Gligen_Text_Image", torch_dtype=torch.float16
... )
>>> pipe = pipe.to("cuda")

>>> prompt = "a flower sitting on the beach"
>>> boxes = [[0.0, 0.09, 0.53, 0.76]]
>>> phrases = ["flower"]
>>> gligen_image = load_image(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/gligen/pexels-pixabay-60597.jpg"
... )

>>> images = pipe(
...     prompt=prompt,
...     gligen_phrases=phrases,
...     gligen_images=[gligen_image],
...     gligen_boxes=boxes,
...     gligen_scheduled_sampling_beta=1,
...     output_type="pil",
...     num_inference_steps=50,
... ).images

>>> images[0].save("./gligen-generation-text-image-box.jpg")

>>> # Generate an image described by the prompt and
>>> # transfer style described by image at the region defined by bounding boxes
>>> pipe = StableDiffusionGLIGENTextImagePipeline.from_pretrained(
...     "anhnct/Gligen_Text_Image", torch_dtype=torch.float16
... )
>>> pipe = pipe.to("cuda")

>>> prompt = "a dragon flying on the sky"
>>> boxes = [[0.4, 0.2, 1.0, 0.8], [0.0, 1.0, 0.0, 1.0]]  # Set `[0.0, 1.0, 0.0, 1.0]` for the style

>>> gligen_image = load_image(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/landscape.png"
... )

>>> gligen_placeholder = load_image(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/landscape.png"
... )

>>> images = pipe(
...     prompt=prompt,
...     gligen_phrases=[
...         "dragon",
...         "placeholder",
...     ],  # Can use any text instead of `placeholder` token, because we will use mask here
...     gligen_images=[
...         gligen_placeholder,
...         gligen_image,
...     ],  # Can use any image in gligen_placeholder, because we will use mask here
...     input_phrases_mask=[1, 0],  # Set 0 for the placeholder token
...     input_images_mask=[0, 1],  # Set 0 for the placeholder image
...     gligen_boxes=boxes,
...     gligen_scheduled_sampling_beta=1,
...     output_type="pil",
...     num_inference_steps=50,
... ).images

>>> images[0].save("./gligen-generation-text-image-box-style-transfer.jpg")

enable_vae_slicing

< 源 >

( )

启用切片 VAE 解码。启用此选项后，VAE 会将输入张量分片，分步计算解码。这有助于节省一些内存并允许更大的批次大小。

disable_vae_slicing

< 源 >

( )

禁用切片 VAE 解码。如果之前启用了 enable_vae_slicing，此方法将返回一步计算解码。

enable_vae_tiling

< 源 >

( )

启用平铺 VAE 解码。启用此选项后，VAE 将把输入张量分割成瓦片，分多步计算编码和解码。这对于节省大量内存和处理更大的图像非常有用。

disable_vae_tiling

< 源 >

( )

禁用平铺 VAE 解码。如果之前启用了 enable_vae_tiling，此方法将恢复一步计算解码。

enable_model_cpu_offload

< 源 >

( gpu_id: typing.Optional[int] = None device: typing.Union[torch.device, str] = None )

参数

gpu_id (int, 可选) — 推理中应使用的加速器 ID。如果未指定，默认为 0。
device (torch.Device 或 str, 可选, 默认为 None) — 推理中应使用的加速器的 PyTorch 设备类型。如果未指定，它将自动检测可用加速器并使用。

prepare_latents

< 源 >

( batch_size num_channels_latents height width dtype device generator latents = None )

enable_fuser

< 源 >

( enabled = True )

complete_mask

< 源 >

( has_mask max_objs device )

根据每个短语和图像的输入掩码对应值 0 或 1，掩盖与短语和图像对应的特征。

crop

< 源 >

( im new_width new_height )

将输入图像裁剪到指定的尺寸。

draw_inpaint_mask_from_boxes

< 源 >

( boxes size )

根据给定的框创建内画遮罩。该函数使用提供的框生成内画遮罩，以标记需要内画的区域。

encode_prompt

< 源 >

参数

prompt (str 或 List[str], 可选) — 要编码的提示词
device — (torch.device): torch 设备
num_images_per_prompt (int) — 每个提示词应生成的图像数量
do_classifier_free_guidance (bool) — 是否使用分类器自由引导
negative_prompt (str 或 List[str], 可选) — 不用于引导图像生成的提示词。如果未定义，则必须传入 negative_prompt_embeds。当不使用引导时（即，如果 guidance_scale 小于 1），此参数将被忽略。
prompt_embeds (torch.Tensor, 可选) — 预先生成的文本嵌入。可用于轻松调整文本输入，例如提示词权重。如果未提供，文本嵌入将从 prompt 输入参数生成。
negative_prompt_embeds (torch.Tensor, 可选) — 预先生成的负面文本嵌入。可用于轻松调整文本输入，例如提示词权重。如果未提供，negative_prompt_embeds 将从 negative_prompt 输入参数生成。
lora_scale (float, 可选) — 将应用于文本编码器所有 LoRA 层的 LoRA 缩放因子（如果加载了 LoRA 层）。
clip_skip (int, 可选) — 在计算提示词嵌入时要跳过 CLIP 的层数。值为 1 表示将使用倒数第二层的输出计算提示词嵌入。

将提示编码为文本编码器隐藏状态。

get_clip_feature

< 源 >

( input normalize_constant device is_image = False )

使用 CLIP 预训练模型获取图像和短语嵌入。图像嵌入通过投影变换到短语嵌入空间。

get_cross_attention_kwargs_with_grounded

< 源 >

( hidden_size gligen_phrases gligen_images gligen_boxes input_phrases_mask input_images_mask repeat_batch normalize_constant max_objs device )

准备包含关于接地输入（框、掩码、图像嵌入、短语嵌入）信息的交叉注意力 kwargs。

get_cross_attention_kwargs_without_grounded

< 源 >

( hidden_size repeat_batch max_objs device )

准备不包含接地输入（框、掩码、图像嵌入、短语嵌入）信息的交叉注意力 kwargs（所有都是零张量）。

target_size_center_crop

< source >

( im new_hw )

裁剪图像并调整大小以适应目标尺寸，同时保持中心不变。

StableDiffusionPipelineOutput

class diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput

< source >

( images: typing.Union[typing.List[PIL.Image.Image], numpy.ndarray] nsfw_content_detected: typing.Optional[typing.List[bool]] )

参数

images (List[PIL.Image.Image] 或 np.ndarray) — 经过去噪的 PIL 图像列表，长度为 batch_size，或形状为 (batch_size, height, width, num_channels) 的 NumPy 数组。
nsfw_content_detected (List[bool]) — 指示相应生成的图像是否包含“不安全内容”（nsfw）的列表，如果无法执行安全检查，则为 None。

Stable Diffusion 管道的输出类。

< > 在 GitHub 上更新

←深度到图像图像变体→

Diffusers

GLIGEN（基于语言的图像生成）

StableDiffusionGLIGENPipeline

class diffusers.StableDiffusionGLIGENPipeline

__call__

enable_vae_slicing

disable_vae_slicing

enable_vae_tiling

disable_vae_tiling

enable_model_cpu_offload

prepare_latents

enable_fuser

encode_prompt

StableDiffusionGLIGENTextImagePipeline

class diffusers.StableDiffusionGLIGENTextImagePipeline

__call__

enable_vae_slicing

disable_vae_slicing

enable_vae_tiling

disable_vae_tiling

enable_model_cpu_offload

prepare_latents

enable_fuser

complete_mask

crop

draw_inpaint_mask_from_boxes

encode_prompt

get_clip_feature

get_cross_attention_kwargs_with_grounded

get_cross_attention_kwargs_without_grounded

target_size_center_crop

StableDiffusionPipelineOutput

class diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput

call

call