Transformers 文档

Prompt Depth Anything

Transformers

加入 Hugging Face 社区

并获取增强的文档体验

在模型、数据集和 Spaces 上协作

通过加速推理获得更快的示例

切换文档主题

开始使用

Prompt Depth Anything

概述

Prompt Depth Anything 模型在 Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation 中被介绍，作者是 Haotong Lin, Sida Peng, Jingxiao Chen, Songyou Peng, Jiaming Sun, Minghuan Liu, Hujun Bao, Jiashi Feng, Xiaowei Zhou, Bingyi Kang。

论文的摘要如下：

提示在释放语言和视觉基础模型在特定任务中的力量方面起着关键作用。我们首次将提示引入深度基础模型，为度量深度估计创建了一个新的范例，称为 Prompt Depth Anything。具体来说，我们使用低成本的 LiDAR 作为提示来引导 Depth Anything 模型进行精确的度量深度输出，实现高达 4K 的分辨率。我们的方法侧重于简洁的提示融合设计，该设计在深度解码器内的多个尺度上集成了 LiDAR。为了解决包含 LiDAR 深度和精确 GT 深度的有限数据集带来的训练挑战，我们提出了一个可扩展的数据管道，其中包括合成数据 LiDAR 模拟和真实数据伪 GT 深度生成。我们的方法在 ARKitScenes 和 ScanNet++ 数据集上创造了新的最先进水平，并使下游应用受益，包括 3D 重建和通用机器人抓取。

Prompt Depth Anything 概述。摘自原始论文。

用法示例

Transformers 库允许您仅用几行代码即可使用该模型

>>> import torch
>>> import requests
>>> import numpy as np

>>> from PIL import Image
>>> from transformers import AutoImageProcessor, AutoModelForDepthEstimation

>>> url = "https://github.com/DepthAnything/PromptDA/blob/main/assets/example_images/image.jpg?raw=true"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image_processor = AutoImageProcessor.from_pretrained("depth-anything/prompt-depth-anything-vits-hf")
>>> model = AutoModelForDepthEstimation.from_pretrained("depth-anything/prompt-depth-anything-vits-hf")

>>> prompt_depth_url = "https://github.com/DepthAnything/PromptDA/blob/main/assets/example_images/arkit_depth.png?raw=true"
>>> prompt_depth = Image.open(requests.get(prompt_depth_url, stream=True).raw)
>>> # the prompt depth can be None, and the model will output a monocular relative depth.

>>> # prepare image for the model
>>> inputs = image_processor(images=image, return_tensors="pt", prompt_depth=prompt_depth)

>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> # interpolate to original size
>>> post_processed_output = image_processor.post_process_depth_estimation(
...     outputs,
...     target_sizes=[(image.height, image.width)],
... )

>>> # visualize the prediction
>>> predicted_depth = post_processed_output[0]["predicted_depth"]
>>> depth = predicted_depth * 1000 
>>> depth = depth.detach().cpu().numpy()
>>> depth = Image.fromarray(depth.astype("uint16")) # mm

资源

以下是 Hugging Face 官方和社区（标有 🌎）资源列表，可帮助您开始使用 Prompt Depth Anything。

如果您有兴趣提交资源以包含在此处，请随时打开 Pull Request，我们将对其进行审核！理想情况下，资源应展示一些新的内容，而不是重复现有资源。

PromptDepthAnythingConfig

class transformers.PromptDepthAnythingConfig

< source >

( backbone_config = None backbone = None use_pretrained_backbone = False use_timm_backbone = False backbone_kwargs = None patch_size = 14 initializer_range = 0.02 reassemble_hidden_size = 384 reassemble_factors = [4, 2, 1, 0.5] neck_hidden_sizes = [48, 96, 192, 384] fusion_hidden_size = 64 head_in_index = -1 head_hidden_size = 32 depth_estimation_type = 'relative' max_depth = None **kwargs )

参数

backbone_config (Union[Dict[str, Any], PretrainedConfig], 可选) — 主干模型的配置。仅在 is_hybrid 为 True 或您想利用 AutoBackbone API 时使用。
backbone (str, 可选) — 当 backbone_config 为 None 时，要使用的主干名称。如果 use_pretrained_backbone 为 True，这将从 timm 或 transformers 库加载相应的预训练权重。如果 use_pretrained_backbone 为 False，这将加载主干的配置并使用它来初始化具有随机权重的主干。
use_pretrained_backbone (bool, 可选, 默认为 False) — 是否对主干使用预训练权重。
use_timm_backbone (bool, 可选, 默认为 False) — 是否对主干使用 timm 库。如果设置为 False，将使用 AutoBackbone API。
backbone_kwargs (dict, 可选) — 从检查点加载时要传递给 AutoBackbone 的关键字参数，例如 {'out_indices': (0, 1, 2, 3)}。如果设置了 backbone_config，则无法指定。
patch_size (int, 可选, 默认为 14) — 从骨干网络特征中提取的 patch 的大小。
initializer_range (float, 可选, 默认为 0.02) — 用于初始化所有权重矩阵的 truncated_normal_initializer 的标准差。
reassemble_hidden_size (int, 可选, 默认为 384) — 重组层的输入通道数。
reassemble_factors (List[int], 可选, 默认为 [4, 2, 1, 0.5]) — 重组层的上/下采样因子。
neck_hidden_sizes (List[str], 可选, 默认为 [48, 96, 192, 384]) — 骨干网络特征图要投影到的隐藏层大小。
fusion_hidden_size (int, 可选, 默认为 64) — 融合之前的通道数。
head_in_index (int, 可选, 默认为 -1) — 在深度估计头中使用的特征索引。
head_hidden_size (int, 可选, 默认为 32) — 深度估计头第二层卷积中的输出通道数。
depth_estimation_type (str, 可选, 默认为 "relative") — 要使用的深度估计类型。可以是 ["relative", "metric"] 之一。
max_depth (float, 可选) — 用于 “metric” 深度估计头的最大深度。室内模型应使用 20，室外模型应使用 80。“relative” 深度估计会忽略此值。

这是用于存储 PromptDepthAnythingModel 配置的配置类。它用于根据指定的参数实例化 PromptDepthAnything 模型，定义模型架构。使用默认值实例化配置将产生与 PromptDepthAnything LiheYoung/depth-anything-small-hf 架构类似的配置。

配置对象继承自 PretrainedConfig，可用于控制模型输出。有关更多信息，请阅读 PretrainedConfig 的文档。

示例

>>> from transformers import PromptDepthAnythingConfig, PromptDepthAnythingForDepthEstimation

>>> # Initializing a PromptDepthAnything small style configuration
>>> configuration = PromptDepthAnythingConfig()

>>> # Initializing a model from the PromptDepthAnything small style configuration
>>> model = PromptDepthAnythingForDepthEstimation(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

to_dict

< source >

( )

将此实例序列化为 Python 字典。覆盖默认的 to_dict()。返回值： Dict[str, any]：构成此配置实例的所有属性的字典，

PromptDepthAnythingForDepthEstimation

class transformers.PromptDepthAnythingForDepthEstimation

< source >

( config )

参数

config (PromptDepthAnythingConfig) — 具有模型所有参数的模型配置类。使用配置文件初始化不会加载与模型关联的权重，仅加载配置。查看 from_pretrained() 方法以加载模型权重。

Prompt Depth Anything 模型，顶部带有深度估计头（由 3 个卷积层组成），例如用于 KITTI、NYUv2。

此模型是 PyTorch torch.nn.Module 子类。将其用作常规 PyTorch 模块，并参阅 PyTorch 文档，了解与常规用法和行为相关的所有事项。

forward

< source >

( pixel_values: FloatTensor prompt_depth: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.DepthEstimatorOutput 或 tuple(torch.FloatTensor)

参数

pixel_values (torch.FloatTensor，形状为 (batch_size, num_channels, height, width)) — 像素值。像素值可以使用 AutoImageProcessor 获得。有关详细信息，请参阅 DPTImageProcessor.call()。
output_attentions (bool, 可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参阅返回张量下的 attentions。
output_hidden_states (bool, 可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参阅返回张量下的 hidden_states。
prompt_depth (torch.FloatTensor，形状为 (batch_size, 1, height, width), 可选) — Prompt 深度是从多视图几何或低分辨率深度传感器获得的稀疏或低分辨率深度。它通常具有形状 (height, width)，其中 height 和 width 可以小于图像的 height 和 width。它是可选的，可以为 None，这意味着将不使用 prompt 深度。如果为 None，则输出将是单目相对深度。建议值以米为单位，但这并非必要。
return_dict (bool, 可选) — 是否返回 ModelOutput 而不是普通元组。
Returns

transformers.modeling_outputs.DepthEstimatorOutput or tuple(torch.FloatTensor)

A transformers.modeling_outputs.DepthEstimatorOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (PromptDepthAnythingConfig) and inputs.

loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Classification (or regression if config.num_labels==1) loss.
predicted_depth (torch.FloatTensor of shape (batch_size, height, width)) — Predicted depth for each pixel.
hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, num_channels, height, width).

Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, patch_size, sequence_length).

Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

The PromptDepthAnythingForDepthEstimation forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples

>>> from transformers import AutoImageProcessor, AutoModelForDepthEstimation
>>> import torch
>>> import numpy as np
>>> from PIL import Image
>>> import requests

>>> url = "https://github.com/DepthAnything/PromptDA/blob/main/assets/example_images/image.jpg?raw=true"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image_processor = AutoImageProcessor.from_pretrained("depth-anything/prompt-depth-anything-vits-hf")
>>> model = AutoModelForDepthEstimation.from_pretrained("depth-anything/prompt-depth-anything-vits-hf")

>>> prompt_depth_url = "https://github.com/DepthAnything/PromptDA/blob/main/assets/example_images/arkit_depth.png?raw=true"
>>> prompt_depth = Image.open(requests.get(prompt_depth_url, stream=True).raw)

>>> # prepare image for the model
>>> inputs = image_processor(images=image, return_tensors="pt", prompt_depth=prompt_depth)

>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> # interpolate to original size
>>> post_processed_output = image_processor.post_process_depth_estimation(
...     outputs,
...     target_sizes=[(image.height, image.width)],
... )

>>> # visualize the prediction
>>> predicted_depth = post_processed_output[0]["predicted_depth"]
>>> depth = predicted_depth * 1000.
>>> depth = depth.detach().cpu().numpy()
>>> depth = Image.fromarray(depth.astype("uint16")) # mm

PromptDepthAnythingImageProcessor

class transformers.PromptDepthAnythingImageProcessor

< source >

( do_resize: bool = True size: typing.Dict[str, int] = None resample: Resampling = <Resampling.BICUBIC: 3> keep_aspect_ratio: bool = False ensure_multiple_of: int = 1 do_rescale: bool = True rescale_factor: typing.Union[int, float] = 0.00392156862745098 do_normalize: bool = True image_mean: typing.Union[float, typing.List[float], NoneType] = None image_std: typing.Union[float, typing.List[float], NoneType] = None do_pad: bool = False size_divisor: int = None prompt_scale_to_meter: float = 0.001 **kwargs )

参数

do_resize (bool, optional, defaults to True) — Whether to resize the image’s (height, width) dimensions. Can be overidden by do_resize in preprocess.
size (Dict[str, int] optional, defaults to {"height" -- 384, "width": 384}): 调整大小后的图像尺寸。可以被 preprocess 中的 size 参数覆盖。
resample (PILImageResampling, optional, defaults to Resampling.BICUBIC) — 定义调整图像大小时使用的重采样滤波器。可以被 preprocess 中的 resample 参数覆盖。
keep_aspect_ratio (bool, optional, defaults to False) — 如果为 True，则将图像调整为尽可能大的尺寸，以保持宽高比。可以被 preprocess 中的 keep_aspect_ratio 参数覆盖。
ensure_multiple_of (int, optional, defaults to 1) — 如果 do_resize 为 True，则将图像调整为该值的倍数的大小。可以被 preprocess 中的 ensure_multiple_of 参数覆盖。
do_rescale (bool, optional, defaults to True) — 是否按指定的比例 rescale_factor 缩放图像。可以被 preprocess 中的 do_rescale 参数覆盖。
rescale_factor (int or float, optional, defaults to 1/255) — 如果缩放图像，则使用的缩放因子。可以被 preprocess 中的 rescale_factor 参数覆盖。
do_normalize (bool, optional, defaults to True) — 是否标准化图像。可以被 preprocess 方法中的 do_normalize 参数覆盖。
image_mean (float or List[float], optional, defaults to IMAGENET_STANDARD_MEAN) — 标准化图像时使用的均值。这可以是浮点数或浮点数列表，其长度等于图像中的通道数。可以被 preprocess 方法中的 image_mean 参数覆盖。
image_std (float or List[float], optional, defaults to IMAGENET_STANDARD_STD) — 标准化图像时使用的标准差。这可以是浮点数或浮点数列表，其长度等于图像中的通道数。可以被 preprocess 方法中的 image_std 参数覆盖。
do_pad (bool, optional, defaults to False) — 是否应用中心填充。这是在 DINOv2 论文中引入的，该论文将模型与 DPT 结合使用。
size_divisor (int, optional) — 如果 do_pad 为 True，则将图像尺寸填充为可被此值整除。这是在 DINOv2 论文中引入的，该论文将模型与 DPT 结合使用。
prompt_scale_to_meter (float, optional, defaults to 0.001) — 将提示深度转换为米尺度的比例因子。

构建 PromptDepthAnything 图像处理器。

preprocess

< source >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] prompt_depth: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], NoneType] = None do_resize: typing.Optional[bool] = None size: typing.Optional[int] = None keep_aspect_ratio: typing.Optional[bool] = None ensure_multiple_of: typing.Optional[int] = None resample: typing.Optional[PIL.Image.Resampling] = None do_rescale: typing.Optional[bool] = None rescale_factor: typing.Optional[float] = None do_normalize: typing.Optional[bool] = None image_mean: typing.Union[float, typing.List[float], NoneType] = None image_std: typing.Union[float, typing.List[float], NoneType] = None do_pad: typing.Optional[bool] = None size_divisor: typing.Optional[int] = None prompt_scale_to_meter: typing.Optional[float] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None data_format: ChannelDimension = <ChannelDimension.FIRST: 'channels_first'> input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None )

参数

images (ImageInput) — 要预处理的图像。接受像素值范围为 0 到 255 的单张或批量图像。如果传入的图像像素值在 0 到 1 之间，请设置 do_rescale=False。
prompt_depth (ImageInput, optional) — 要预处理的提示深度，可以是来自多视图几何的稀疏深度或来自深度传感器的低分辨率深度。通常具有形状 (height, width)，其中 height 和 width 可以小于图像的尺寸。它是可选的，可以为 None，这意味着不使用提示深度。如果为 None，则输出深度将是单眼相对深度。建议提供 prompt_scale_to_meter 值，这是将提示深度转换为米尺度的比例因子。当提示深度不是以米为单位时，这很有用。
do_resize (bool, optional, defaults to self.do_resize) — 是否调整图像大小。
size (Dict[str, int], optional, defaults to self.size) — 调整大小后的图像尺寸。如果 keep_aspect_ratio 为 True，则将图像调整为尽可能大的尺寸，以保持宽高比。如果设置了 ensure_multiple_of，则将图像调整为该值的倍数的大小。
keep_aspect_ratio (bool, optional, defaults to self.keep_aspect_ratio) — 是否保持图像的宽高比。如果为 False，则图像将被调整为 (size, size)。如果为 True，则图像将被调整大小以保持宽高比，并且尺寸将是最大可能尺寸。
ensure_multiple_of (int, optional, defaults to self.ensure_multiple_of) — 确保图像尺寸是该值的倍数。
resample (int, optional, defaults to self.resample) — 如果调整图像大小，则使用的重采样滤波器。这可以是枚举 PILImageResampling 之一。仅当 do_resize 设置为 True 时才有效。
do_rescale (bool, optional, defaults to self.do_rescale) — 是否将图像值重新缩放到 [0 - 1] 之间。
rescale_factor (float, optional, defaults to self.rescale_factor) — 如果 do_rescale 设置为 True，则用于缩放图像的缩放因子。
do_normalize (bool, optional, defaults to self.do_normalize) — 是否标准化图像。
image_mean (float or List[float], optional, defaults to self.image_mean) — 图像均值。
image_std (float or List[float], optional, defaults to self.image_std) — 图像标准差。
prompt_scale_to_meter (float, optional, defaults to self.prompt_scale_to_meter) — 用于将提示深度转换为米的比例因子。
return_tensors (str or TensorType, optional) — 返回张量的类型。可以是以下之一：
- Unset: 返回 np.ndarray 列表。
- TensorType.TENSORFLOW 或 'tf': 返回 tf.Tensor 类型的批次。
- TensorType.PYTORCH 或 'pt': 返回 torch.Tensor 类型的批次。
- TensorType.NUMPY 或 'np': 返回 np.ndarray 类型的批次。
- TensorType.JAX 或 'jax': 返回 jax.numpy.ndarray 类型的批次。
data_format (ChannelDimension or str, optional, defaults to ChannelDimension.FIRST) — 输出图像的通道维度格式。可以是以下之一：
- ChannelDimension.FIRST: 图像格式为 (num_channels, height, width)。
- ChannelDimension.LAST: 图像格式为 (height, width, num_channels)。
input_data_format (ChannelDimension or str, optional) — 输入图像的通道维度格式。如果未设置，则通道维度格式将从输入图像中推断。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST: 图像格式为 (num_channels, height, width)。
- "channels_last" 或 ChannelDimension.LAST: 图像格式为 (height, width, num_channels)。
- "none" 或 ChannelDimension.NONE: 图像格式为 (height, width)。

预处理图像或图像批次。

post_process_depth_estimation

< source >

( outputs: DepthEstimatorOutput target_sizes: typing.Union[transformers.utils.generic.TensorType, typing.List[typing.Tuple[int, int]], NoneType] = None ) → List[Dict[str, TensorType]]

参数

outputs (DepthEstimatorOutput) — 模型的原始输出。
target_sizes (TensorType or List[Tuple[int, int]], optional) — 形状为 (batch_size, 2) 的张量或元组列表 (Tuple[int, int])，其中包含批次中每个图像的目标大小（高度，宽度）。如果留空，则不会调整预测大小。

Returns

List[Dict[str, TensorType]]

表示已处理深度预测的张量字典列表。

将 DepthEstimatorOutput 的原始输出转换为最终深度预测和深度 PIL 图像。仅支持 PyTorch。

< > 在 GitHub 上更新

←PoolFormer 金字塔视觉Transformer (PVT)→