Diffusers

加入 Hugging Face 社区

并获得增强的文档体验

协作处理模型、数据集和 Spaces

通过加速推理获得更快的示例

在文档主题之间切换

开始使用

Shap-E

Shap-E 模型由 Alex Nichol 和 Heewoo Jun 在 Shap-E: Generating Conditional 3D Implicit Functions 中提出，他们来自 OpenAI。

论文摘要如下：

我们提出了 Shap-E，一个用于 3D 资产的条件生成模型。与最近关于 3D 生成模型的工作（生成单一输出表示）不同，Shap-E 直接生成隐式函数的参数，这些参数可以渲染为纹理网格和神经辐射场。我们分两个阶段训练 Shap-E：首先，我们训练一个编码器，将 3D 资产确定性地映射到隐式函数的参数中；其次，我们在编码器的输出上训练一个条件扩散模型。当在大型配对的 3D 和文本数据集上训练时，我们得到的模型能够在几秒钟内生成复杂且多样的 3D 资产。与 Point-E（一个关于点云的显式生成模型）相比，Shap-E 收敛速度更快，并且在建模更高维度、多表示输出空间的情况下，达到了可比较甚至更好的样本质量。

原始代码库可以在 openai/shap-e 找到。

请参阅跨 pipelines 复用组件部分，了解如何高效地将相同组件加载到多个 pipelines 中。

ShapEPipeline

class diffusers.ShapEPipeline

< source >

( prior: PriorTransformer text_encoder: CLIPTextModelWithProjection tokenizer: CLIPTokenizer scheduler: HeunDiscreteScheduler shap_e_renderer: ShapERenderer )

参数

prior (PriorTransformer) — The canonical unCLIP prior to approximate the image embedding from the text embedding.
text_encoder (CLIPTextModelWithProjection) — 冻结的文本编码器。
tokenizer (CLIPTokenizer) — 一个 CLIPTokenizer，用于标记化文本。
scheduler (HeunDiscreteScheduler) — 与 prior 模型结合使用的调度器，用于生成图像嵌入。
shap_e_renderer (ShapERenderer) — Shap-E 渲染器将生成的潜在空间投影到 MLP 的参数中，以使用 NeRF 渲染方法创建 3D 对象。

用于生成 3D 资产的潜在表示并使用 NeRF 方法渲染的 Pipeline。

此模型继承自 DiffusionPipeline。查看超类文档以获取为所有 Pipeline 实现的通用方法（下载、保存、在特定设备上运行等）。

call

< source >

( prompt: str num_images_per_prompt: int = 1 num_inference_steps: int = 25 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None guidance_scale: float = 4.0 frame_size: int = 64 output_type: typing.Optional[str] = 'pil' return_dict: bool = True ) → ShapEPipelineOutput 或 tuple

参数

prompt (str 或 List[str]) — 用于引导图像生成的提示或提示语列表。
num_images_per_prompt (int, *可选的*，默认为 1) — 每个提示生成的图像数量。
num_inference_steps (int, *可选的*，默认为 25) — 去噪步骤的数量。更多的去噪步骤通常会带来更高质量的图像，但会牺牲推理速度。
generator (torch.Generator 或 List[torch.Generator], *可选的*) — 用于使生成具有确定性的 torch.Generator。
latents (torch.Tensor, *可选的*) — 从高斯分布中采样的预生成的噪声潜在空间，用作图像生成的输入。可用于使用不同的提示调整相同的生成结果。如果未提供，则会通过使用提供的随机 generator 进行采样来生成潜在张量。
guidance_scale (float, *可选的*，默认为 4.0) — 更高的 guidance scale 值会鼓励模型生成与文本 prompt 紧密相关的图像，但会牺牲图像质量。当 guidance_scale > 1 时，guidance scale 生效。
frame_size (int, *可选的*，默认为 64) — 生成的 3D 输出的每个图像帧的宽度和高度。
output_type (str, *可选的*，默认为 "pil") — 生成图像的输出格式。在 "pil" (PIL.Image.Image)、"np" (np.array)、"latent" (torch.Tensor) 或 mesh (MeshDecoderOutput) 之间选择。
return_dict (bool, *可选的*，默认为 True) — 是否返回 ShapEPipelineOutput 而不是纯元组。

返回值

ShapEPipelineOutput 或 tuple

如果 return_dict 为 True，则返回 ShapEPipelineOutput，否则返回 tuple，其中第一个元素是包含生成图像的列表。

用于生成 pipeline 的调用函数。

示例

>>> import torch
>>> from diffusers import DiffusionPipeline
>>> from diffusers.utils import export_to_gif

>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

>>> repo = "openai/shap-e"
>>> pipe = DiffusionPipeline.from_pretrained(repo, torch_dtype=torch.float16)
>>> pipe = pipe.to(device)

>>> guidance_scale = 15.0
>>> prompt = "a shark"

>>> images = pipe(
...     prompt,
...     guidance_scale=guidance_scale,
...     num_inference_steps=64,
...     frame_size=256,
... ).images

>>> gif_path = export_to_gif(images[0], "shark_3d.gif")

ShapEImg2ImgPipeline

class diffusers.ShapEImg2ImgPipeline

< source >

( prior: PriorTransformer image_encoder: CLIPVisionModel image_processor: CLIPImageProcessor scheduler: HeunDiscreteScheduler shap_e_renderer: ShapERenderer )

参数

prior (PriorTransformer) — 规范的 unCLIP prior，用于从文本嵌入逼近图像嵌入。
image_encoder (CLIPVisionModel) — 冻结的图像编码器。
image_processor (CLIPImageProcessor) — 用于处理图像的 CLIPImageProcessor。
scheduler (HeunDiscreteScheduler) — 与 prior 模型结合使用的调度器，用于生成图像嵌入。
shap_e_renderer (ShapERenderer) — Shap-E 渲染器将生成的潜在空间投影到 MLP 的参数中，以使用 NeRF 渲染方法创建 3D 对象。

用于从图像生成 3D 资产的潜在表示并使用 NeRF 方法渲染的 Pipeline。

此模型继承自 DiffusionPipeline。查看超类文档以获取为所有 Pipeline 实现的通用方法（下载、保存、在特定设备上运行等）。

call

< source >

( image: typing.Union[PIL.Image.Image, typing.List[PIL.Image.Image]] num_images_per_prompt: int = 1 num_inference_steps: int = 25 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None guidance_scale: float = 4.0 frame_size: int = 64 output_type: typing.Optional[str] = 'pil' return_dict: bool = True ) → ShapEPipelineOutput 或 tuple

参数

image (torch.Tensor, PIL.Image.Image, np.ndarray, List[torch.Tensor], List[PIL.Image.Image], 或 List[np.ndarray]) — Image 或张量，表示要用作起点的图像批次。也可以接受图像潜在空间作为图像，但如果直接传递潜在空间，则不会再次编码。
num_images_per_prompt (int, *可选的*，默认为 1) — 每个提示生成的图像数量。
num_inference_steps (int, 可选，默认为 25) — 去噪步骤的数量。更多的去噪步骤通常会生成更高质量的图像，但代价是推理速度较慢。
generator (torch.Generator 或 List[torch.Generator], 可选) — 用于使生成具有确定性的 torch.Generator。
latents (torch.Tensor, 可选) — 从高斯分布中预生成的噪声潜变量，用作图像生成的输入。可用于使用不同的提示调整相同的生成。如果未提供，则会通过使用提供的随机 generator 进行采样来生成潜变量张量。
guidance_scale (float, 可选，默认为 4.0) — 较高的 guidance scale 值会鼓励模型生成与文本 prompt 紧密相关的图像，但会牺牲图像质量。当 guidance_scale > 1 时，guidance scale 生效。
frame_size (int, 可选，默认为 64) — 生成的 3D 输出的每个图像帧的宽度和高度。
output_type (str, 可选，默认为 "pil") — 生成图像的输出格式。在 "pil" (PIL.Image.Image), "np" (np.array), "latent" (torch.Tensor), 或 mesh (MeshDecoderOutput) 之间选择。
return_dict (bool, 可选，默认为 True) — 是否返回 ShapEPipelineOutput 而不是普通元组。

返回值

ShapEPipelineOutput 或 tuple

如果 return_dict 为 True，则返回 ShapEPipelineOutput，否则返回 tuple，其中第一个元素是包含生成图像的列表。

用于生成 pipeline 的调用函数。

示例

>>> from PIL import Image
>>> import torch
>>> from diffusers import DiffusionPipeline
>>> from diffusers.utils import export_to_gif, load_image

>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

>>> repo = "openai/shap-e-img2img"
>>> pipe = DiffusionPipeline.from_pretrained(repo, torch_dtype=torch.float16)
>>> pipe = pipe.to(device)

>>> guidance_scale = 3.0
>>> image_url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/shap-e/corgi.png"
>>> image = load_image(image_url).convert("RGB")

>>> images = pipe(
...     image,
...     guidance_scale=guidance_scale,
...     num_inference_steps=64,
...     frame_size=256,
... ).images

>>> gif_path = export_to_gif(images[0], "corgi_3d.gif")