Diffusers 文档
VisualCloze
并获得增强的文档体验
开始使用
VisualCloze
VisualCloze:基于视觉上下文学习的通用图像生成框架是一个创新的基于上下文学习的通用图像生成框架,提供关键能力:
- 支持各种域内任务
- 通过上下文学习泛化到未见任务
- 将多个任务统一到一个步骤中,并生成目标图像和中间结果
- 支持从目标图像逆向工程条件
概述
论文摘要如下:
扩散模型在图像生成任务上的最新进展显著推动了该领域的发展。然而,目前的主流方法仍侧重于构建特定任务模型,这在支持各种不同需求时效率有限。尽管通用模型试图解决这一限制,但它们面临着关键挑战,包括可泛化的任务指令、合适的任务分布和统一的架构设计。为了应对这些挑战,我们提出了 VisualCloze,一个通用的图像生成框架,它支持广泛的域内任务,能够泛化到未见任务,实现多任务的未见统一,并支持逆向生成。与现有依赖于基于语言的任务指令导致任务模糊和泛化能力弱的方法不同,我们整合了视觉上下文学习,允许模型从视觉演示中识别任务。同时,视觉任务分布固有的稀疏性阻碍了跨任务可迁移知识的学习。为此,我们引入了 Graph200K,一个图结构数据集,它建立了各种相互关联的任务,从而提高了任务密度和可迁移知识。此外,我们发现我们统一的图像生成公式与图像补全具有一致的目标,这使我们能够在不修改架构的情况下利用预训练补全模型的强大生成先验。代码、数据集和模型可在 https://visualcloze.github.io 获取。
推理
模型加载
VisualCloze 是一个两阶段级联管道,包含 `VisualClozeGenerationPipeline` 和 `VisualClozeUpsamplingPipeline`。
- 在 `VisualClozeGenerationPipeline` 中,每个图像在拼接成网格布局之前会进行下采样,以避免分辨率过高。VisualCloze 发布了两个适用于 diffusers 的模型,即 VisualClozePipeline-384 和 VisualClozePipeline-512,它们分别将图像下采样到 384 和 512 的分辨率。
- `VisualClozeUpsamplingPipeline` 使用 SDEdit 来实现高分辨率图像合成。
`VisualClozePipeline` 整合了这两个阶段,支持方便的端到端采样,同时允许用户根据需要独立使用每个管道。
输入规格
任务和内容提示
- 任务提示:需要描述生成任务意图
- 内容提示:目标图像的可选描述或说明
- 当不需要内容提示时,传入 `None`
- 对于批量推理,传入 `List[str|None]`
图像输入格式
- 格式:`List[List[Image|None]]`
- 结构
- 除最后一行外,所有行表示上下文示例
- 最后一行表示当前查询(目标图像设置为 `None`)
- 对于批量推理,传入 `List[List[List[Image|None]]]`
分辨率控制
- 默认行为
- 第一阶段初始生成:区域为 `${pipe.resolution}^2`
- 第二阶段上采样:3 倍因子
- 自定义分辨率:使用 `upsampling_height` 和 `upsampling_width` 参数调整
示例
有关涵盖广泛任务的综合示例,请参阅在线演示和GitHub 仓库。以下是三个案例的简单示例:掩码到图像转换、边缘检测和主题驱动生成。
掩码到图像示例
import torch
from diffusers import VisualClozePipeline
from diffusers.utils import load_image
pipe = VisualClozePipeline.from_pretrained("VisualCloze/VisualClozePipeline-384", resolution=384, torch_dtype=torch.bfloat16)
pipe.to("cuda")
# Load in-context images (make sure the paths are correct and accessible)
image_paths = [
# in-context examples
[
load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_incontext-example-1_mask.jpg'),
load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_incontext-example-1_image.jpg'),
],
# query with the target image
[
load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_query_mask.jpg'),
None, # No image needed for the target image
],
]
# Task and content prompt
task_prompt = "In each row, a logical task is demonstrated to achieve [IMAGE2] an aesthetically pleasing photograph based on [IMAGE1] sam 2-generated masks with rich color coding."
content_prompt = """Majestic photo of a golden eagle perched on a rocky outcrop in a mountainous landscape.
The eagle is positioned in the right foreground, facing left, with its sharp beak and keen eyes prominently visible.
Its plumage is a mix of dark brown and golden hues, with intricate feather details.
The background features a soft-focus view of snow-capped mountains under a cloudy sky, creating a serene and grandiose atmosphere.
The foreground includes rugged rocks and patches of green moss. Photorealistic, medium depth of field,
soft natural lighting, cool color palette, high contrast, sharp focus on the eagle, blurred background,
tranquil, majestic, wildlife photography."""
# Run the pipeline
image_result = pipe(
task_prompt=task_prompt,
content_prompt=content_prompt,
image=image_paths,
upsampling_width=1344,
upsampling_height=768,
upsampling_strength=0.4,
guidance_scale=30,
num_inference_steps=30,
max_sequence_length=512,
generator=torch.Generator("cpu").manual_seed(0)
).images[0][0]
# Save the resulting image
image_result.save("visualcloze.png")
边缘检测示例
import torch
from diffusers import VisualClozePipeline
from diffusers.utils import load_image
pipe = VisualClozePipeline.from_pretrained("VisualCloze/VisualClozePipeline-384", resolution=384, torch_dtype=torch.bfloat16)
pipe.to("cuda")
# Load in-context images (make sure the paths are correct and accessible)
image_paths = [
# in-context examples
[
load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_edgedetection_incontext-example-1_image.jpg'),
load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_edgedetection_incontext-example-1_edge.jpg'),
],
[
load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_edgedetection_incontext-example-2_image.jpg'),
load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_edgedetection_incontext-example-2_edge.jpg'),
],
# query with the target image
[
load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_edgedetection_query_image.jpg'),
None, # No image needed for the target image
],
]
# Task and content prompt
task_prompt = "Each row illustrates a pathway from [IMAGE1] a sharp and beautifully composed photograph to [IMAGE2] edge map with natural well-connected outlines using a clear logical task."
content_prompt = ""
# Run the pipeline
image_result = pipe(
task_prompt=task_prompt,
content_prompt=content_prompt,
image=image_paths,
upsampling_width=864,
upsampling_height=1152,
upsampling_strength=0.4,
guidance_scale=30,
num_inference_steps=30,
max_sequence_length=512,
generator=torch.Generator("cpu").manual_seed(0)
).images[0][0]
# Save the resulting image
image_result.save("visualcloze.png")
主题驱动生成示例
import torch
from diffusers import VisualClozePipeline
from diffusers.utils import load_image
pipe = VisualClozePipeline.from_pretrained("VisualCloze/VisualClozePipeline-384", resolution=384, torch_dtype=torch.bfloat16)
pipe.to("cuda")
# Load in-context images (make sure the paths are correct and accessible)
image_paths = [
# in-context examples
[
load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_subjectdriven_incontext-example-1_reference.jpg'),
load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_subjectdriven_incontext-example-1_depth.jpg'),
load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_subjectdriven_incontext-example-1_image.jpg'),
],
[
load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_subjectdriven_incontext-example-2_reference.jpg'),
load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_subjectdriven_incontext-example-2_depth.jpg'),
load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_subjectdriven_incontext-example-2_image.jpg'),
],
# query with the target image
[
load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_subjectdriven_query_reference.jpg'),
load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_subjectdriven_query_depth.jpg'),
None, # No image needed for the target image
],
]
# Task and content prompt
task_prompt = """Each row describes a process that begins with [IMAGE1] an image containing the key object,
[IMAGE2] depth map revealing gray-toned spatial layers and results in
[IMAGE3] an image with artistic qualitya high-quality image with exceptional detail."""
content_prompt = """A vintage porcelain collector's item. Beneath a blossoming cherry tree in early spring,
this treasure is photographed up close, with soft pink petals drifting through the air and vibrant blossoms framing the scene."""
# Run the pipeline
image_result = pipe(
task_prompt=task_prompt,
content_prompt=content_prompt,
image=image_paths,
upsampling_width=1024,
upsampling_height=1024,
upsampling_strength=0.2,
guidance_scale=30,
num_inference_steps=30,
max_sequence_length=512,
generator=torch.Generator("cpu").manual_seed(0)
).images[0][0]
# Save the resulting image
image_result.save("visualcloze.png")
独立使用每个管道
import torch
from diffusers import VisualClozeGenerationPipeline, FluxFillPipeline as VisualClozeUpsamplingPipeline
from diffusers.utils import load_image
from PIL import Image
pipe = VisualClozeGenerationPipeline.from_pretrained(
"VisualCloze/VisualClozePipeline-384", resolution=384, torch_dtype=torch.bfloat16
)
pipe.to("cuda")
image_paths = [
# in-context examples
[
load_image(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_incontext-example-1_mask.jpg"
),
load_image(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_incontext-example-1_image.jpg"
),
],
# query with the target image
[
load_image(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_query_mask.jpg"
),
None, # No image needed for the target image
],
]
task_prompt = "In each row, a logical task is demonstrated to achieve [IMAGE2] an aesthetically pleasing photograph based on [IMAGE1] sam 2-generated masks with rich color coding."
content_prompt = "Majestic photo of a golden eagle perched on a rocky outcrop in a mountainous landscape. The eagle is positioned in the right foreground, facing left, with its sharp beak and keen eyes prominently visible. Its plumage is a mix of dark brown and golden hues, with intricate feather details. The background features a soft-focus view of snow-capped mountains under a cloudy sky, creating a serene and grandiose atmosphere. The foreground includes rugged rocks and patches of green moss. Photorealistic, medium depth of field, soft natural lighting, cool color palette, high contrast, sharp focus on the eagle, blurred background, tranquil, majestic, wildlife photography."
# Stage 1: Generate initial image
image = pipe(
task_prompt=task_prompt,
content_prompt=content_prompt,
image=image_paths,
guidance_scale=30,
num_inference_steps=30,
max_sequence_length=512,
generator=torch.Generator("cpu").manual_seed(0),
).images[0][0]
# Stage 2 (optional): Upsample the generated image
pipe_upsample = VisualClozeUpsamplingPipeline.from_pipe(pipe)
pipe_upsample.to("cuda")
mask_image = Image.new("RGB", image.size, (255, 255, 255))
image = pipe_upsample(
image=image,
mask_image=mask_image,
prompt=content_prompt,
width=1344,
height=768,
strength=0.4,
guidance_scale=30,
num_inference_steps=30,
max_sequence_length=512,
generator=torch.Generator("cpu").manual_seed(0),
).images[0]
image.save("visualcloze.png")
< 源文件 >
class diffusers.VisualClozePipeline
< 源代码 >( scheduler: FlowMatchEulerDiscreteScheduler vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer text_encoder_2: T5EncoderModel tokenizer_2: T5TokenizerFast transformer: FluxTransformer2DModel resolution: int = 384 )
参数
- transformer (FluxTransformer2DModel) — 用于去噪编码图像潜在的条件 Transformer (MMDiT) 架构。
- scheduler (FlowMatchEulerDiscreteScheduler) — 用于与 `transformer` 结合去噪编码图像潜在的调度器。
- vae (AutoencoderKL) — 用于编码和解码图像与潜在表示之间的变分自动编码器 (VAE) 模型。
- text_encoder (
CLIPTextModel
) — CLIP,具体为 clip-vit-large-patch14 变体。 - text_encoder_2 (
T5EncoderModel
) — T5 的第二个文本编码器,具体为 google/t5-v1_1-xxl 变体。 - tokenizer (
CLIPTokenizer
) — CLIPTokenizer 类的分词器。 - tokenizer_2 (
T5TokenizerFast
) — T5TokenizerFast 类的第二个分词器。 - resolution (
int
, 可选, 默认为 384) — 查询和上下文示例中的图像拼接时的分辨率。
用于视觉上下文图像生成的 VisualCloze 管道。参考:https://github.com/lzyhha/VisualCloze/tree/main。此管道旨在根据视觉上下文示例生成图像。
__call__
< 源代码 >( task_prompt: typing.Union[str, typing.List[str]] = None content_prompt: typing.Union[str, typing.List[str]] = None image: typing.Optional[torch.FloatTensor] = None upsampling_height: typing.Optional[int] = None upsampling_width: typing.Optional[int] = None num_inference_steps: int = 50 sigmas: typing.Optional[typing.List[float]] = None guidance_scale: float = 30.0 num_images_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.FloatTensor] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None pooled_prompt_embeds: typing.Optional[torch.FloatTensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True joint_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 upsampling_strength: float = 1.0 ) → ~pipelines.flux.FluxPipelineOutput
or tuple
参数
- task_prompt (
str
或List[str]
, 可选) — 用于定义任务意图的提示。 - content_prompt (
str
或List[str]
, 可选) — 用于定义要生成的目标图像内容或标题的提示。 - image (
torch.Tensor
,PIL.Image.Image
,np.ndarray
,List[torch.Tensor]
,List[PIL.Image.Image]
, 或List[np.ndarray]
) — 用作起点的 `Image`、numpy 数组或表示图像批次的张量。对于 numpy 数组和 pytorch 张量,预期值范围在 `[0, 1]` 之间。如果是张量或张量列表,预期形状应为 `(B, C, H, W)` 或 `(C, H, W)`。如果是 numpy 数组或数组列表,预期形状应为 `(B, H, W, C)` 或 `(H, W, C)`。 - upsampling_height (
int
, 可选, 默认为 self.unet.config.sample_size * self.vae_scale_factor) — 通过 SDEdit 进行上采样后,生成图像(即输出图像)的像素高度。默认情况下,图像上采样因子为 3,并且基础分辨率由管道的分辨率参数决定。当只指定upsampling_height
或upsampling_width
中的一个时,另一个将根据宽高比自动设置。 - upsampling_width (
int
, 可选, 默认为 self.unet.config.sample_size * self.vae_scale_factor) — 通过 SDEdit 进行上采样后,生成图像(即输出图像)的像素宽度。默认情况下,图像上采样因子为 3,并且基础分辨率由管道的分辨率参数决定。当只指定upsampling_height
或upsampling_width
中的一个时,另一个将根据宽高比自动设置。 - num_inference_steps (
int
, 可选, 默认为 50) — 去噪步数。更多的去噪步数通常会带来更高质量的图像,但推理速度会变慢。 - sigmas (
List[float]
, 可选) — 用于去噪过程的自定义 sigmas,适用于其set_timesteps
方法支持sigmas
参数的调度器。如果未定义,将使用传入num_inference_steps
时的默认行为。 - guidance_scale (
float
, 可选, 默认为 30.0) — Classifier-Free Diffusion Guidance 中定义的 Guidance scale。guidance_scale
定义为 Imagen Paper 中公式 2 的w
。通过设置guidance_scale > 1
启用 Guidance scale。较高的 Guidance scale 鼓励生成与文本prompt
紧密相关的图像,通常以牺牲较低图像质量为代价。 - num_images_per_prompt (
int
, 可选, 默认为 1) — 每个提示要生成的图像数量。 - generator (
torch.Generator
或List[torch.Generator]
, 可选) — 一个或多个 torch generator(s),用于使生成具有确定性。 - latents (
torch.FloatTensor
, 可选) — 预先生成的噪声潜在变量,从高斯分布中采样,用作图像生成的输入。可用于通过不同的提示调整相同的生成。如果未提供,将使用提供的随机generator
采样生成一个潜在变量张量。 - prompt_embeds (
torch.FloatTensor
, 可选) — 预先生成的文本嵌入。可用于轻松调整文本输入,例如提示权重。如果未提供,将从prompt
输入参数生成文本嵌入。 - pooled_prompt_embeds (
torch.FloatTensor
, 可选) — 预先生成的池化文本嵌入。可用于轻松调整文本输入,例如提示权重。如果未提供,池化文本嵌入将从prompt
输入参数生成。 - output_type (
str
, 可选, 默认为"pil"
) — 生成图像的输出格式。选择 PIL:PIL.Image.Image
或np.array
。 - return_dict (
bool
, 可选, 默认为True
) — 是否返回~pipelines.flux.FluxPipelineOutput
而不是普通元组。 - joint_attention_kwargs (
dict
, 可选) — 一个 kwargs 字典,如果指定,将作为self.processor
中定义的AttentionProcessor
的参数传递到 diffusers.models.attention_processor。 - callback_on_step_end (
Callable
, 可选) — 一个函数,在推理过程中每个去噪步骤结束时调用。该函数使用以下参数调用:callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)
。callback_kwargs
将包含callback_on_step_end_tensor_inputs
指定的所有张量列表。 - callback_on_step_end_tensor_inputs (
List
, 可选) —callback_on_step_end
函数的张量输入列表。列表中指定的张量将作为callback_kwargs
参数传递。您只能包含管道类._callback_tensor_inputs
属性中列出的变量。 - max_sequence_length (
int
默认为 512) — 与prompt
一起使用的最大序列长度。 - upsampling_strength (
float
, 可选, 默认为 1.0) — 表示上采样结果时转换参考image
的程度。必须在 0 到 1 之间。生成的图像用作起点,upsampling_strength
越高,添加的噪声越多。去噪步骤的数量取决于最初添加的噪声量。当upsampling_strength
为 1 时,添加的噪声最大,去噪过程运行num_inference_steps
中指定的完整迭代次数。值为 0 会跳过上采样步骤,并以self.resolution
的分辨率输出结果。
返回
~pipelines.flux.FluxPipelineOutput
或 tuple
如果 return_dict
为 True,则为 ~pipelines.flux.FluxPipelineOutput
,否则为 tuple
。返回元组时,第一个元素是生成的图像列表。
调用 VisualCloze 管道进行生成时调用的函数。
示例
>>> import torch
>>> from diffusers import VisualClozePipeline
>>> from diffusers.utils import load_image
>>> image_paths = [
... # in-context examples
... [
... load_image(
... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_incontext-example-1_mask.jpg"
... ),
... load_image(
... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_incontext-example-1_image.jpg"
... ),
... ],
... # query with the target image
... [
... load_image(
... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_query_mask.jpg"
... ),
... None, # No image needed for the target image
... ],
... ]
>>> task_prompt = "In each row, a logical task is demonstrated to achieve [IMAGE2] an aesthetically pleasing photograph based on [IMAGE1] sam 2-generated masks with rich color coding."
>>> content_prompt = "Majestic photo of a golden eagle perched on a rocky outcrop in a mountainous landscape. The eagle is positioned in the right foreground, facing left, with its sharp beak and keen eyes prominently visible. Its plumage is a mix of dark brown and golden hues, with intricate feather details. The background features a soft-focus view of snow-capped mountains under a cloudy sky, creating a serene and grandiose atmosphere. The foreground includes rugged rocks and patches of green moss. Photorealistic, medium depth of field, soft natural lighting, cool color palette, high contrast, sharp focus on the eagle, blurred background, tranquil, majestic, wildlife photography."
>>> pipe = VisualClozePipeline.from_pretrained(
... "VisualCloze/VisualClozePipeline-384", resolution=384, torch_dtype=torch.bfloat16
... )
>>> pipe.to("cuda")
>>> image = pipe(
... task_prompt=task_prompt,
... content_prompt=content_prompt,
... image=image_paths,
... upsampling_width=1344,
... upsampling_height=768,
... upsampling_strength=0.4,
... guidance_scale=30,
... num_inference_steps=30,
... max_sequence_length=512,
... generator=torch.Generator("cpu").manual_seed(0),
... ).images[0][0]
>>> image.save("visualcloze.png")
VisualClozeGenerationPipeline
class diffusers.VisualClozeGenerationPipeline
< source >( scheduler: FlowMatchEulerDiscreteScheduler vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer text_encoder_2: T5EncoderModel tokenizer_2: T5TokenizerFast transformer: FluxTransformer2DModel resolution: int = 384 )
参数
- transformer (FluxTransformer2DModel) — 用于对编码图像潜在变量进行去噪的条件 Transformer (MMDiT) 架构。
- scheduler (FlowMatchEulerDiscreteScheduler) — 与
transformer
结合使用以对编码图像潜在变量进行去噪的调度器。 - vae (AutoencoderKL) — 用于在潜在表示之间编码和解码图像的变分自动编码器 (VAE) 模型。
- text_encoder (
CLIPTextModel
) — CLIP,特别是 clip-vit-large-patch14 变体。 - text_encoder_2 (
T5EncoderModel
) — T5,特别是 google/t5-v1_1-xxl 变体。 - tokenizer (
CLIPTokenizer
) —CLIPTokenizer
类的分词器。 - tokenizer_2 (
T5TokenizerFast
) — 第二个分词器,属于T5TokenizerFast
类。 - resolution (
int
, 可选, 默认为 384) — 查询图像和情境示例图像拼接后的每张图像的分辨率。
用于生成具有视觉上下文图像的 VisualCloze 管道。参考:https://github.com/lzyhha/VisualCloze/tree/main 该管道旨在根据视觉情境示例生成图像。
__call__
< source >( task_prompt: typing.Union[str, typing.List[str]] = None content_prompt: typing.Union[str, typing.List[str]] = None image: typing.Optional[torch.FloatTensor] = None num_inference_steps: int = 50 sigmas: typing.Optional[typing.List[float]] = None guidance_scale: float = 30.0 num_images_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.FloatTensor] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None pooled_prompt_embeds: typing.Optional[torch.FloatTensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True joint_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 ) → ~pipelines.flux.FluxPipelineOutput
或 tuple
参数
- task_prompt (
str
或List[str]
, 可选) — 定义任务意图的一个或多个提示。 - content_prompt (
str
或List[str]
, 可选) — 定义要生成的图像内容或描述的一个或多个提示。 - image (
torch.Tensor
,PIL.Image.Image
,np.ndarray
,List[torch.Tensor]
,List[PIL.Image.Image]
, 或List[np.ndarray]
) — 用作起点的Image
、Numpy 数组或表示图像批次的张量。对于 Numpy 数组和 PyTorch 张量,预期值范围在[0, 1]
之间。如果它是张量或张量列表,则预期形状应为(B, C, H, W)
或(C, H, W)
。如果它是 Numpy 数组或数组列表,则预期形状应为(B, H, W, C)
或(H, W, C)
。 - num_inference_steps (
int
, 可选, 默认为 50) — 去噪步数。更多的去噪步数通常会带来更高质量的图像,但推理速度会变慢。 - sigmas (
List[float]
, 可选) — 用于去噪过程的自定义 sigmas,适用于其set_timesteps
方法支持sigmas
参数的调度器。如果未定义,将使用传入num_inference_steps
时的默认行为。 - guidance_scale (
float
, 可选, 默认为 30.0) — Classifier-Free Diffusion Guidance 中定义的 Guidance scale。guidance_scale
定义为 Imagen Paper 中公式 2 的w
。通过设置guidance_scale > 1
启用 Guidance scale。较高的 Guidance scale 鼓励生成与文本prompt
紧密相关的图像,通常以牺牲较低图像质量为代价。 - num_images_per_prompt (
int
, 可选, 默认为 1) — 每个提示要生成的图像数量。 - generator (
torch.Generator
或List[torch.Generator]
, 可选) — 一个或多个 torch generator(s),用于使生成具有确定性。 - latents (
torch.FloatTensor
, 可选) — 预先生成的噪声潜在变量,从高斯分布中采样,用作图像生成的输入。可用于通过不同的提示调整相同的生成。如果未提供,将使用提供的随机generator
采样生成一个潜在变量张量。 - prompt_embeds (
torch.FloatTensor
, 可选) — 预先生成的文本嵌入。可用于轻松调整文本输入,例如提示权重。如果未提供,将从prompt
输入参数生成文本嵌入。 - pooled_prompt_embeds (
torch.FloatTensor
, 可选) — 预先生成的池化文本嵌入。可用于轻松调整文本输入,例如提示权重。如果未提供,池化文本嵌入将从prompt
输入参数生成。 - output_type (
str
, 可选, 默认为"pil"
) — 生成图像的输出格式。选择 PIL:PIL.Image.Image
或np.array
。 - return_dict (
bool
, 可选, 默认为True
) — 是否返回~pipelines.flux.FluxPipelineOutput
而不是普通元组。 - joint_attention_kwargs (
dict
, 可选) — 一个 kwargs 字典,如果指定,将作为self.processor
中定义的AttentionProcessor
的参数传递到 diffusers.models.attention_processor。 - callback_on_step_end (
Callable
, 可选) — 一个函数,在推理过程中每个去噪步骤结束时调用。该函数使用以下参数调用:callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)
。callback_kwargs
将包含callback_on_step_end_tensor_inputs
指定的所有张量列表。 - callback_on_step_end_tensor_inputs (
List
, 可选) —callback_on_step_end
函数的张量输入列表。列表中指定的张量将作为callback_kwargs
参数传递。您只能包含管道类._callback_tensor_inputs
属性中列出的变量。 - max_sequence_length (
int
默认为 512) — 与prompt
一起使用的最大序列长度。
返回
~pipelines.flux.FluxPipelineOutput
或 tuple
如果 return_dict
为 True,则为 ~pipelines.flux.FluxPipelineOutput
,否则为 tuple
。返回元组时,第一个元素是生成的图像列表。
调用 VisualCloze 管道进行生成时调用的函数。
示例
>>> import torch
>>> from diffusers import VisualClozeGenerationPipeline, FluxFillPipeline as VisualClozeUpsamplingPipeline
>>> from diffusers.utils import load_image
>>> from PIL import Image
>>> image_paths = [
... # in-context examples
... [
... load_image(
... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_incontext-example-1_mask.jpg"
... ),
... load_image(
... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_incontext-example-1_image.jpg"
... ),
... ],
... # query with the target image
... [
... load_image(
... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_query_mask.jpg"
... ),
... None, # No image needed for the target image
... ],
... ]
>>> task_prompt = "In each row, a logical task is demonstrated to achieve [IMAGE2] an aesthetically pleasing photograph based on [IMAGE1] sam 2-generated masks with rich color coding."
>>> content_prompt = "Majestic photo of a golden eagle perched on a rocky outcrop in a mountainous landscape. The eagle is positioned in the right foreground, facing left, with its sharp beak and keen eyes prominently visible. Its plumage is a mix of dark brown and golden hues, with intricate feather details. The background features a soft-focus view of snow-capped mountains under a cloudy sky, creating a serene and grandiose atmosphere. The foreground includes rugged rocks and patches of green moss. Photorealistic, medium depth of field, soft natural lighting, cool color palette, high contrast, sharp focus on the eagle, blurred background, tranquil, majestic, wildlife photography."
>>> pipe = VisualClozeGenerationPipeline.from_pretrained(
... "VisualCloze/VisualClozePipeline-384", resolution=384, torch_dtype=torch.bfloat16
... )
>>> pipe.to("cuda")
>>> image = pipe(
... task_prompt=task_prompt,
... content_prompt=content_prompt,
... image=image_paths,
... guidance_scale=30,
... num_inference_steps=30,
... max_sequence_length=512,
... generator=torch.Generator("cpu").manual_seed(0),
... ).images[0][0]
>>> # optional, upsampling the generated image
>>> pipe_upsample = VisualClozeUpsamplingPipeline.from_pipe(pipe)
>>> pipe_upsample.to("cuda")
>>> mask_image = Image.new("RGB", image.size, (255, 255, 255))
>>> image = pipe_upsample(
... image=image,
... mask_image=mask_image,
... prompt=content_prompt,
... width=1344,
... height=768,
... strength=0.4,
... guidance_scale=30,
... num_inference_steps=30,
... max_sequence_length=512,
... generator=torch.Generator("cpu").manual_seed(0),
... ).images[0]
>>> image.save("visualcloze.png")
禁用切片 VAE 解码。如果之前启用了 enable_vae_slicing
,此方法将返回一步计算解码。
禁用平铺 VAE 解码。如果之前启用了 enable_vae_tiling
,此方法将恢复一步计算解码。
启用切片 VAE 解码。启用此选项后,VAE 会将输入张量分片,分步计算解码。这有助于节省一些内存并允许更大的批次大小。
启用平铺 VAE 解码。启用此选项后,VAE 将把输入张量分割成瓦片,分多步计算编码和解码。这对于节省大量内存和处理更大的图像非常有用。
encode_prompt
< source >( layout_prompt: typing.Union[str, typing.List[str]] task_prompt: typing.Union[str, typing.List[str]] content_prompt: typing.Union[str, typing.List[str]] device: typing.Optional[torch.device] = None num_images_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.FloatTensor] = None pooled_prompt_embeds: typing.Optional[torch.FloatTensor] = None max_sequence_length: int = 512 lora_scale: typing.Optional[float] = None )
参数
- layout_prompt (
str
或List[str]
, 可选) — 定义情境示例数量和任务中涉及的图像数量的一个或多个提示。 - task_prompt (
str
或List[str]
, 可选) — 定义任务意图的一个或多个提示。 - content_prompt (
str
或List[str]
, 可选) — 定义要生成的图像内容或描述的一个或多个提示。 - device — (
torch.device
): torch 设备 - num_images_per_prompt (
int
) — 每个提示应生成的图像数量。 - prompt_embeds (
torch.FloatTensor
, 可选) — 预先生成的文本嵌入。可用于轻松调整文本输入,例如提示权重。如果未提供,则文本嵌入将从prompt
输入参数生成。 - pooled_prompt_embeds (
torch.FloatTensor
, 可选) — 预先生成的池化文本嵌入。可用于轻松调整文本输入,例如提示权重。如果未提供,则池化文本嵌入将从prompt
输入参数生成。 - lora_scale (
float
, 可选) — 如果加载了 LoRA 层,将应用于文本编码器所有 LoRA 层的 LoRA 缩放比例。