DepthPro

概述

DepthPro 模型在 Depth Pro: Sharp Monocular Metric Depth in Less Than a Second 一文中被提出，作者是 Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R. Richter, Vladlen Koltun。

DepthPro 是用于零样本度量单目深度估计的基础模型，旨在生成具有卓越清晰度和精细细节的高分辨率深度图。它采用基于多尺度 Vision Transformer (ViT) 的架构，其中图像被下采样、分割成 patches，并使用共享的 Dinov2 编码器进行处理。提取的 patch 级别特征被合并、上采样，并使用类似 DPT 的融合阶段进行细化，从而实现精确的深度估计。

论文摘要如下：

我们提出了一个用于零样本度量单目深度估计的基础模型。我们的模型 Depth Pro 合成了具有无与伦比的清晰度和高频细节的高分辨率深度图。预测是度量的，具有绝对尺度，无需依赖诸如相机内参之类的元数据的可用性。而且该模型速度很快，在标准 GPU 上在 0.3 秒内生成 2.25 兆像素的深度图。这些特性得益于多项技术贡献，包括用于密集预测的高效多尺度视觉 transformer、结合真实和合成数据集以实现高精度度量和精细边界追踪的训练协议、估计深度图中边界精度的专用评估指标以及来自单张图像的最先进的焦距估计。广泛的实验分析了具体的设计选择，并证明 Depth Pro 在多个维度上优于以往的工作。

DepthPro 输出。取自官方代码。

此模型由 geetu040 贡献。原始代码可以在这里找到。

使用技巧

DepthPro 模型处理输入图像的方式是首先在多个尺度上对其进行下采样，并将每个缩放版本分割成 patches。这些 patches 随后使用共享的基于 Vision Transformer (ViT) 的 Dinov2 patch 编码器进行编码，而完整图像则由单独的图像编码器处理。提取的 patch 特征被合并到特征图中，上采样，并使用类似 DPT 的解码器进行融合，以生成最终的深度估计。如果启用，额外的视场 (FOV) 编码器会处理图像以估计相机的视场，从而帮助提高深度精度。

>>> import requests
>>> from PIL import Image
>>> import torch
>>> from transformers import DepthProImageProcessorFast, DepthProForDepthEstimation

>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

>>> url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image_processor = DepthProImageProcessorFast.from_pretrained("apple/DepthPro-hf")
>>> model = DepthProForDepthEstimation.from_pretrained("apple/DepthPro-hf").to(device)

>>> inputs = image_processor(images=image, return_tensors="pt").to(device)

>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> post_processed_output = image_processor.post_process_depth_estimation(
...     outputs, target_sizes=[(image.height, image.width)],
... )

>>> field_of_view = post_processed_output[0]["field_of_view"]
>>> focal_length = post_processed_output[0]["focal_length"]
>>> depth = post_processed_output[0]["predicted_depth"]
>>> depth = (depth - depth.min()) / depth.max()
>>> depth = depth * 255.
>>> depth = depth.detach().cpu().numpy()
>>> depth = Image.fromarray(depth.astype("uint8"))

架构和配置

DepthPro 架构。取自原始论文。

DepthProForDepthEstimation 模型使用 DepthProEncoder 来编码输入图像，并使用 FeatureFusionStage 来融合来自编码器的输出特征。

DepthProEncoder 进一步使用两个编码器

patch_encoder
- 输入图像以多个比例进行缩放，如 scaled_images_ratios 配置中所指定。
- 每个缩放后的图像被分割成大小为 patch_size 的较小 patches，重叠区域由 scaled_images_overlap_ratios 确定。
- 这些 patches 由 patch_encoder 处理
image_encoder
- 输入图像也被重新缩放到 patch_size 并由 image_encoder 处理

这两个编码器都可以分别通过 patch_model_config 和 image_model_config 进行配置，它们默认都是单独的 Dinov2Model。

来自两个编码器的输出 (last_hidden_state) 和来自 patch_encoder 的选定中间状态 (hidden_states) 由基于 DPT 的 FeatureFusionStage 融合，用于深度估计。

视场 (FOV) 预测

该网络补充了一个焦距估计头。一个小的卷积头摄取来自深度估计网络的冻结特征和来自单独 ViT 图像编码器的特定于任务的特征，以预测水平角视场。

DepthProConfig 中的 use_fov_model 参数控制是否启用 FOV 预测。默认情况下，它设置为 False 以节省内存和计算。启用后，FOV 编码器 将根据 fov_model_config 参数实例化，该参数默认为 Dinov2Model。在初始化 DepthProForDepthEstimation 模型时，也可以传递 use_fov_model 参数。

检查点 apple/DepthPro-hf 的预训练模型使用 FOV 编码器。要使用不带 FOV 编码器的预训练模型，请在加载模型时设置 use_fov_model=False，这可以节省计算量。

>>> from transformers import DepthProForDepthEstimation
>>> model = DepthProForDepthEstimation.from_pretrained("apple/DepthPro-hf", use_fov_model=False)

要实例化一个带有 FOV 编码器的新模型，请在配置中设置 use_fov_model=True。

>>> from transformers import DepthProConfig, DepthProForDepthEstimation
>>> config = DepthProConfig(use_fov_model=True)
>>> model = DepthProForDepthEstimation(config)

或者在初始化模型时设置 use_fov_model=True，这将覆盖配置中的值。

>>> from transformers import DepthProConfig, DepthProForDepthEstimation
>>> config = DepthProConfig()
>>> model = DepthProForDepthEstimation(config, use_fov_model=True)

使用缩放点积注意力 (SDPA)

PyTorch 包含一个原生的缩放点积注意力 (SDPA) 运算符，作为 torch.nn.functional 的一部分。此函数包含多个实现，可以根据输入和正在使用的硬件应用。有关更多信息，请参阅官方文档或 GPU 推理页面。

当实现可用时，torch>=2.1.1 默认使用 SDPA，但您也可以在 from_pretrained() 中设置 attn_implementation="sdpa" 以显式请求使用 SDPA。

from transformers import DepthProForDepthEstimation
model = DepthProForDepthEstimation.from_pretrained("apple/DepthPro-hf", attn_implementation="sdpa", torch_dtype=torch.float16)

为了获得最佳加速效果，我们建议以半精度加载模型（例如 torch.float16 或 torch.bfloat16）。

在本地基准测试 (A100-40GB, PyTorch 2.3.0, OS Ubuntu 22.04) 中，使用 float32 和 google/vit-base-patch16-224 模型，我们在推理期间看到了以下加速效果。

批次大小	平均推理时间 (毫秒)，eager 模式	平均推理时间 (毫秒)，sdpa 模型	加速，Sdpa / Eager (倍)
1	7	6	1.17
2	8	6	1.33
4	8	6	1.33
8	8	6	1.33

资源

Hugging Face 官方和社区（🌎 表示）资源列表，可帮助您开始使用 DepthPro

研究论文：Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
官方实现：apple/ml-depth-pro
DepthPro 推理 Notebook: DepthPro Inference
DepthPro 用于超分辨率和图像分割
- 阅读 Medium 上的博客：Depth Pro: Beyond Depth
- Github 上的代码：geetu040/depthpro-beyond-depth

如果您有兴趣提交资源以包含在此处，请随时打开 Pull Request，我们将对其进行审核！该资源理想情况下应展示一些新的东西，而不是重复现有资源。

DepthProConfig

class transformers.DepthProConfig

< source >

( fusion_hidden_size = 256 patch_size = 384 initializer_range = 0.02 intermediate_hook_ids = [11, 5] intermediate_feature_dims = [256, 256] scaled_images_ratios = [0.25, 0.5, 1] scaled_images_overlap_ratios = [0.0, 0.5, 0.25] scaled_images_feature_dims = [1024, 1024, 512] merge_padding_value = 3 use_batch_norm_in_fusion_residual = False use_bias_in_fusion_residual = True use_fov_model = False num_fov_head_layers = 2 image_model_config = None patch_model_config = None fov_model_config = None **kwargs )

参数

fusion_hidden_size (int, 可选, 默认为 256) — 融合前的通道数。
patch_size (int, 可选, 默认为 384) — 每个 patch 的大小（分辨率）。这也是 backbone 模型的 image_size。
initializer_range (float, 可选, 默认为 0.02) — 用于初始化所有权重矩阵的 truncated_normal_initializer 的标准差。
intermediate_hook_ids (List[int], 可选, 默认为 [11, 5]) — 用于融合的patch encoder中间隐藏状态的索引。
intermediate_feature_dims (List[int], 可选, 默认为 [256, 256]) — 在 intermediate_hook_ids 中，每个中间隐藏状态进行上采样时的隐藏状态维度。
scaled_images_ratios (List[float], 可选, 默认为 [0.25, 0.5, 1]) — patch encoder要使用的缩放图像的比例。
scaled_images_overlap_ratios (List[float], 可选, 默认为 [0.0, 0.5, 0.25]) — scaled_images_ratios 中每个缩放图像的patches之间的重叠率。
scaled_images_feature_dims (List[int], 可选, 默认为 [1024, 1024, 512]) — scaled_images_ratios 中每个缩放图像进行上采样时的隐藏状态维度。
merge_padding_value (int, 可选, 默认为 3) — 当将较小的patches合并回图像尺寸时，将移除此尺寸的重叠部分。
use_batch_norm_in_fusion_residual (bool, 可选, 默认为 False) — 是否在融合块的预激活残差单元中使用批归一化。
use_bias_in_fusion_residual (bool, 可选, 默认为 True) — 是否在融合块的预激活残差单元中使用偏置。
use_fov_model (bool, 可选, 默认为 False) — 是否使用 DepthProFovModel 生成视场 (field of view)。
num_fov_head_layers (int, 可选, 默认为 2) — DepthProFovModel head中的卷积层数。
image_model_config (Union[Dict[str, Any], PretrainedConfig], 可选) — 图像编码器模型的配置，它使用 AutoModel API加载。默认情况下，Dinov2 模型用作backbone。
patch_model_config (Union[Dict[str, Any], PretrainedConfig], 可选) — patch encoder模型的配置，它使用 AutoModel API加载。默认情况下，Dinov2 模型用作backbone。
fov_model_config (Union[Dict[str, Any], PretrainedConfig], 可选) — fov encoder模型的配置，它使用 AutoModel API加载。默认情况下，Dinov2 模型用作backbone。

这是用于存储 DepthProModel 配置的配置类。它用于根据指定的参数实例化 DepthPro 模型，从而定义模型架构。使用默认值实例化配置将产生与 DepthPro apple/DepthPro 架构类似的配置。

配置对象继承自 PretrainedConfig，可用于控制模型输出。有关更多信息，请阅读 PretrainedConfig 的文档。

示例

>>> from transformers import DepthProConfig, DepthProModel

>>> # Initializing a DepthPro apple/DepthPro style configuration
>>> configuration = DepthProConfig()

>>> # Initializing a model (with random weights) from the apple/DepthPro style configuration
>>> model = DepthProModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

DepthProImageProcessor

class transformers.DepthProImageProcessor

< source >

( do_resize: bool = True size: typing.Optional[typing.Dict[str, int]] = None resample: Resampling = <Resampling.BILINEAR: 2> do_rescale: bool = True rescale_factor: typing.Union[int, float] = 0.00392156862745098 do_normalize: bool = True image_mean: typing.Union[float, typing.List[float], NoneType] = None image_std: typing.Union[float, typing.List[float], NoneType] = None **kwargs )

参数

do_resize (bool, 可选, 默认为 True) — 是否将图像的（高度，宽度）尺寸调整为指定的 (size["height"], size["width"])。可以被 preprocess 方法中的 do_resize 参数覆盖。
size (dict, 可选, 默认为 {"height" -- 1536, "width": 1536}): 调整大小后输出图像的尺寸。可以被 preprocess 方法中的 size 参数覆盖。
resample (PILImageResampling, 可选, 默认为 Resampling.BILINEAR) — 如果调整图像大小，则使用的重采样滤波器。可以被 preprocess 方法中的 resample 参数覆盖。
do_rescale (bool, 可选, 默认为 True) — 是否按指定的比例 rescale_factor 缩放图像。可以被 preprocess 方法中的 do_rescale 参数覆盖。
rescale_factor (int 或 float, 可选, 默认为 1/255) — 如果缩放图像，则使用的缩放因子。可以被 preprocess 方法中的 rescale_factor 参数覆盖。
do_normalize (bool, 可选, 默认为 True) — 是否对图像进行归一化。可以被 preprocess 方法中的 do_normalize 参数覆盖。
image_mean (float 或 List[float], 可选, 默认为 IMAGENET_STANDARD_MEAN) — 如果对图像进行归一化，则使用的均值。这是一个浮点数或浮点数列表，其长度等于图像中通道的数量。可以被 preprocess 方法中的 image_mean 参数覆盖。
image_std (float 或 List[float], 可选, 默认为 IMAGENET_STANDARD_STD) — 如果对图像进行归一化，则使用的标准差。这是一个浮点数或浮点数列表，其长度等于图像中通道的数量。可以被 preprocess 方法中的 image_std 参数覆盖。

构建 DepthPro 图像处理器。

preprocess

< source >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] do_resize: typing.Optional[bool] = None size: typing.Optional[typing.Dict[str, int]] = None resample: typing.Optional[PIL.Image.Resampling] = None do_rescale: typing.Optional[bool] = None rescale_factor: typing.Optional[float] = None do_normalize: typing.Optional[bool] = None image_mean: typing.Union[float, typing.List[float], NoneType] = None image_std: typing.Union[float, typing.List[float], NoneType] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None data_format: typing.Union[str, transformers.image_utils.ChannelDimension] = <ChannelDimension.FIRST: 'channels_first'> input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None )

参数

images (ImageInput) — 要预处理的图像。期望是像素值范围为 0 到 255 的单张或批量图像。如果传入像素值在 0 到 1 之间的图像，请设置 do_rescale=False。
do_resize (bool, 可选, 默认为 self.do_resize) — 是否调整图像大小。
size (Dict[str, int], 可选, 默认为 self.size) — 字典格式为 {"height": h, "width": w}，指定调整大小后输出图像的尺寸。
resample (PILImageResampling filter, 可选, 默认为 self.resample) — 如果调整图像大小，则使用的 PILImageResampling 过滤器，例如 PILImageResampling.BILINEAR。仅当 do_resize 设置为 True 时才有效。
do_rescale (bool, 可选, 默认为 self.do_rescale) — 是否将图像值重新缩放到 [0 - 1] 之间。
rescale_factor (float, 可选, 默认为 self.rescale_factor) — 如果 do_rescale 设置为 True，则按此比例因子重新缩放图像。
do_normalize (bool, 可选, 默认为 self.do_normalize) — 是否标准化图像。
image_mean (float 或 List[float], 可选, 默认为 self.image_mean) — 如果 do_normalize 设置为 True，则使用的图像均值。
image_std (float 或 List[float], 可选, 默认为 self.image_std) — 如果 do_normalize 设置为 True，则使用的图像标准差。
return_tensors (str 或 TensorType, 可选) — 返回的张量类型。可以是以下之一：
- Unset: 返回 np.ndarray 列表。
- TensorType.TENSORFLOW 或 'tf': 返回 tf.Tensor 类型的批次。
- TensorType.PYTORCH 或 'pt': 返回 torch.Tensor 类型的批次。
- TensorType.NUMPY 或 'np': 返回 np.ndarray 类型的批次。
- TensorType.JAX 或 'jax': 返回 jax.numpy.ndarray 类型的批次。
data_format (ChannelDimension 或 str, 可选, 默认为 ChannelDimension.FIRST) — 输出图像的通道维度格式。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST: (num_channels, height, width) 格式的图像。
- "channels_last" 或 ChannelDimension.LAST: (height, width, num_channels) 格式的图像。
- Unset: 使用输入图像的通道维度格式。
input_data_format (ChannelDimension 或 str, 可选) — 输入图像的通道维度格式。如果未设置，则从输入图像推断通道维度格式。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST: (num_channels, height, width) 格式的图像。
- "channels_last" 或 ChannelDimension.LAST: (height, width, num_channels) 格式的图像。
- "none" 或 ChannelDimension.NONE: (height, width) 格式的图像。

预处理单张图像或批量图像。

post_process_depth_estimation

< source >

( outputs: DepthProDepthEstimatorOutput target_sizes: typing.Union[transformers.utils.generic.TensorType, typing.List[typing.Tuple[int, int]], NoneType] = None ) → List[Dict[str, TensorType]]

参数

outputs (DepthProDepthEstimatorOutput) — 模型的原始输出。
target_sizes (Optional[Union[TensorType, List[Tuple[int, int]], None]], 可选, 默认为 None) — 调整深度预测大小的目标尺寸。可以是形状为 (batch_size, 2) 的张量，也可以是批次中每个图像的元组列表 (height, width)。如果为 None，则不执行调整大小。

返回值

List[Dict[str, TensorType]]

表示已处理深度预测的张量字典列表，以及如果 outputs 中给出了 field_of_view，则包含视场（度）和焦距（像素）。

Raises

ValueError

ValueError — 如果 predicted_depths、fovs 或 target_sizes 的长度不匹配。

后处理模型的原始深度预测，以生成最终深度预测，如果提供了视场，则使用视场进行校准，如果提供了目标尺寸，则调整为指定的目标尺寸。

DepthProImageProcessorFast

class transformers.DepthProImageProcessorFast

< source >

( **kwargs: typing_extensions.Unpack[transformers.image_processing_utils_fast.DefaultFastImageProcessorKwargs] )

参数

do_resize (bool, 可选, 默认为 self.do_resize) — 是否将图像的 (height, width) 尺寸调整为指定的 size。可以被 preprocess 方法中的 do_resize 参数覆盖。
size (dict, 可选, 默认为 self.size) — 调整大小后输出图像的尺寸。可以被 preprocess 方法中的 size 参数覆盖。
default_to_square (bool, 可选, 默认为 self.default_to_square) — 如果 size 是整数，调整大小时是否默认使用正方形图像。
resample (PILImageResampling, 可选, 默认为 self.resample) — 如果调整图像大小，则使用的重采样过滤器。仅当 do_resize 设置为 True 时才有效。可以被 preprocess 方法中的 resample 参数覆盖。
do_center_crop (bool, 可选, 默认为 self.do_center_crop) — 是否将图像中心裁剪为指定的 crop_size。可以被 preprocess 方法中的 do_center_crop 覆盖。
crop_size (Dict[str, int] 可选, 默认为 self.crop_size) — 应用 center_crop 后输出图像的尺寸。可以被 preprocess 方法中的 crop_size 覆盖。
do_rescale (bool, 可选, 默认为 self.do_rescale) — 是否按指定的比例 rescale_factor 重新缩放图像。可以被 preprocess 方法中的 do_rescale 参数覆盖。
rescale_factor (int 或 float, 可选, 默认为 self.rescale_factor) — 如果重新缩放图像，则使用的比例因子。仅当 do_rescale 设置为 True 时才有效。可以被 preprocess 方法中的 rescale_factor 参数覆盖。
do_normalize (bool, 可选, 默认为 self.do_normalize) — 是否标准化图像。可以被 preprocess 方法中的 do_normalize 参数覆盖。可以被 preprocess 方法中的 do_normalize 参数覆盖。
image_mean (float 或 List[float], 可选, 默认为 self.image_mean) — 如果标准化图像，则使用的均值。这是一个浮点数或浮点数列表，其长度为图像中通道的数量。可以被 preprocess 方法中的 image_mean 参数覆盖。可以被 preprocess 方法中的 image_mean 参数覆盖。
image_std (float 或 List[float], 可选, 默认为 self.image_std) — 如果标准化图像，则使用的标准差。这是一个浮点数或浮点数列表，其长度为图像中通道的数量。可以被 preprocess 方法中的 image_std 参数覆盖。可以被 preprocess 方法中的 image_std 参数覆盖。
do_convert_rgb (bool, 可选, 默认为 self.do_convert_rgb) — 是否将图像转换为 RGB 格式。
return_tensors (str 或 TensorType, 可选, 默认为 self.return_tensors) — 如果设置为 `pt`，则返回堆叠的张量，否则返回张量列表。
data_format (ChannelDimension 或 str, 可选, 默认为 self.data_format) — 仅支持 ChannelDimension.FIRST。为了与慢速处理器兼容而添加。
input_data_format (ChannelDimension 或 str, 可选, 默认为 self.input_data_format) — 输入图像的通道维度格式。如果未设置，则从输入图像推断通道维度格式。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST：图像格式为 (num_channels, height, width)。
- "channels_last" 或 ChannelDimension.LAST：图像格式为 (height, width, num_channels)。
- "none" 或 ChannelDimension.NONE：图像格式为 (height, width)。
device (torch.device, 可选, 默认为 self.device) — 用于处理图像的设备。如果未设置，则从输入图像推断设备。

构建一个快速的 DepthPro 图像处理器。

preprocess

< source >

参数

images (ImageInput) — 要预处理的图像。期望单张或批量图像，像素值范围为 0 到 255。如果传入像素值在 0 到 1 之间的图像，请设置 do_rescale=False。
do_resize (bool, 可选, 默认为 self.do_resize) — 是否调整图像大小。
size (Dict[str, int], 可选, 默认为 self.size) — 描述模型的最大输入尺寸。
resample (PILImageResampling 或 InterpolationMode, 可选, 默认为 self.resample) — 如果调整图像大小，则使用的重采样滤波器。这可以是枚举 PILImageResampling 之一。仅当 do_resize 设置为 True 时才有效。
do_center_crop (bool, 可选, 默认为 self.do_center_crop) — 是否对图像进行中心裁剪。
crop_size (Dict[str, int], 可选, 默认为 self.crop_size) — 应用 center_crop 后输出图像的大小。
do_rescale (bool, 可选, 默认为 self.do_rescale) — 是否对图像进行重新缩放。
rescale_factor (float, 可选, 默认为 self.rescale_factor) — 如果 do_rescale 设置为 True，则用于重新缩放图像的缩放因子。
do_normalize (bool, 可选, 默认为 self.do_normalize) — 是否对图像进行归一化。
image_mean (float 或 List[float], 可选, 默认为 self.image_mean) — 用于归一化的图像均值。仅当 do_normalize 设置为 True 时才有效。
image_std (float 或 List[float], 可选, 默认为 self.image_std) — 用于归一化的图像标准差。仅当 do_normalize 设置为 True 时才有效。
do_convert_rgb (bool, 可选, 默认为 self.do_convert_rgb) — 是否将图像转换为 RGB 格式。
return_tensors (str 或 TensorType, 可选, 默认为 self.return_tensors) — 如果设置为 `pt`，则返回堆叠的张量，否则返回张量列表。
data_format (ChannelDimension 或 str, 可选, 默认为 self.data_format) — 仅支持 ChannelDimension.FIRST。为了与慢速处理器兼容而添加。
input_data_format (ChannelDimension 或 str, 可选, 默认为 self.input_data_format) — 输入图像的通道维度格式。如果未设置，则从输入图像推断通道维度格式。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST：图像格式为 (num_channels, height, width)。
- "channels_last" 或 ChannelDimension.LAST：图像格式为 (height, width, num_channels)。
- "none" 或 ChannelDimension.NONE：图像格式为 (height, width)。
device (torch.device, 可选, 默认为 self.device) — 用于处理图像的设备。如果未设置，则从输入图像推断设备。

预处理单张图像或批量图像。

post_process_depth_estimation

< source >

( outputs: DepthProDepthEstimatorOutput target_sizes: typing.Union[transformers.utils.generic.TensorType, typing.List[typing.Tuple[int, int]], NoneType] = None ) → List[Dict[str, TensorType]]

参数

outputs (DepthProDepthEstimatorOutput) — 模型的原始输出。
target_sizes (Optional[Union[TensorType, List[Tuple[int, int]], None]], 可选, 默认为 None) — 调整深度预测大小的目标尺寸。可以是形状为 (batch_size, 2) 的张量，也可以是批次中每张图像的元组列表 (height, width)。如果为 None，则不执行调整大小。

返回值

List[Dict[str, TensorType]]

表示已处理深度预测的张量字典列表，以及如果 outputs 中给出了 field_of_view，则包含视场（度）和焦距（像素）。

Raises

ValueError

ValueError — 如果 predicted_depths、fovs 或 target_sizes 的长度不匹配。

后处理模型的原始深度预测，以生成最终深度预测，如果提供了视场，则使用视场进行校准，如果提供了目标尺寸，则调整为指定的目标尺寸。

DepthProModel

class transformers.DepthProModel

< source >

( config )

参数

config (DepthProConfig) — 带有模型所有参数的模型配置类。使用配置文件初始化不会加载与模型相关的权重，仅加载配置。查看 from_pretrained() 方法来加载模型权重。

裸 DepthPro 模型 Transformer，输出原始的隐藏状态，顶部没有任何特定的头部。此模型是 PyTorch torch.nn.Module 子类。可将其用作常规 PyTorch 模块，并参阅 PyTorch 文档以了解所有与常规用法和行为相关的事项。

forward

< source >

( pixel_values: FloatTensor head_mask: typing.Optional[torch.FloatTensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.BaseModelOutput 或 tuple(torch.FloatTensor)

参数

pixel_values (torch.FloatTensor，形状为 (batch_size, num_channels, height, width)) — 像素值。像素值可以使用 AutoImageProcessor 获得。有关详细信息，请参阅 DPTImageProcessor.call()。
head_mask (torch.FloatTensor，形状为 (num_heads,) 或 (num_layers, num_heads)，可选) — 用于使自注意力模块的选定头无效的掩码。在 [0, 1] 中选择的掩码值：
- 1 表示头未被掩蔽，
- 0 表示头被掩蔽。
output_attentions (bool，可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参见返回张量下的 attentions。
output_hidden_states (bool，可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参见返回张量下的 hidden_states。
return_dict (bool，可选) — 是否返回 ModelOutput 而不是普通元组。

返回值

transformers.modeling_outputs.BaseModelOutput 或 tuple(torch.FloatTensor)

一个 transformers.modeling_outputs.BaseModelOutput 或一个 torch.FloatTensor 元组（如果传递 return_dict=False 或当 config.return_dict=False 时），包含各种元素，具体取决于配置 (DepthProConfig) 和输入。

last_hidden_state (torch.FloatTensor，形状为 (batch_size, sequence_length, hidden_size)) — 模型最后一层的输出处的隐藏状态序列。
hidden_states (tuple(torch.FloatTensor)，可选，当传递 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — torch.FloatTensor 元组（如果模型具有嵌入层，则为嵌入层输出一个，加上每层输出一个），形状为 (batch_size, sequence_length, hidden_size)。

模型在每一层输出以及可选的初始嵌入输出处的隐藏状态。
attentions (tuple(torch.FloatTensor)，可选，当传递 output_attentions=True 或当 config.output_attentions=True 时返回) — torch.FloatTensor 元组（每层一个），形状为 (batch_size, num_heads, sequence_length, sequence_length)。

注意力 softmax 之后的注意力权重，用于计算自注意力头中的加权平均值。

DepthProModel forward 方法，覆盖了 __call__ 特殊方法。

虽然 forward 传递的配方需要在该函数中定义，但应该在之后调用 Module 实例而不是此函数，因为前者负责运行预处理和后处理步骤，而后者会默默地忽略它们。

示例

>>> import torch
>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, DepthProModel

>>> url = "https://www.ilankelman.org/stopsigns/australia.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> checkpoint = "apple/DepthPro-hf"
>>> processor = AutoProcessor.from_pretrained(checkpoint)
>>> model = DepthProModel.from_pretrained(checkpoint)

>>> # prepare image for the model
>>> inputs = processor(images=image, return_tensors="pt")

>>> with torch.no_grad():
...     output = model(**inputs)

>>> output.last_hidden_state.shape
torch.Size([1, 35, 577, 1024])

DepthProForDepthEstimation

class transformers.DepthProForDepthEstimation

< source >

( config use_fov_model = None )

参数

config (DepthProConfig) — 带有模型所有参数的模型配置类。使用配置文件初始化不会加载与模型相关的权重，仅加载配置。查看 from_pretrained() 方法来加载模型权重。
use_fov_model (bool，可选，默认为 True) — 是否使用 DepthProFovModel 生成视野。

带有深度估计头部的 DepthPro 模型（由 3 个卷积层组成）。

此模型是 PyTorch torch.nn.Module 子类。可将其用作常规 PyTorch 模块，并参阅 PyTorch 文档以了解所有与常规用法和行为相关的事项。

forward

< source >

( pixel_values: FloatTensor head_mask: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.models.depth_pro.modeling_depth_pro.DepthProDepthEstimatorOutput 或 tuple(torch.FloatTensor)

参数

pixel_values (torch.FloatTensor，形状为 (batch_size, num_channels, height, width)) — 像素值。像素值可以使用 AutoImageProcessor 获得。有关详细信息，请参阅 DPTImageProcessor.call()。
head_mask (torch.FloatTensor，形状为 (num_heads,) 或 (num_layers, num_heads)，可选) — 用于使自注意力模块的选定头无效的掩码。在 [0, 1] 中选择的掩码值：
- 1 表示头未被掩蔽，
- 0 表示头被掩蔽。
output_attentions (bool，可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参见返回张量下的 attentions。
output_hidden_states (bool，可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参见返回张量下的 hidden_states。
return_dict (bool，可选) — 是否返回 ModelOutput 而不是普通元组。
labels (torch.LongTensor，形状为 (batch_size, height, width)，可选) — 用于计算损失的真实深度估计图。

返回值

transformers.models.depth_pro.modeling_depth_pro.DepthProDepthEstimatorOutput 或 tuple(torch.FloatTensor)

一个 transformers.models.depth_pro.modeling_depth_pro.DepthProDepthEstimatorOutput 或一个 torch.FloatTensor 元组（如果传递 return_dict=False 或当 config.return_dict=False 时），包含各种元素，具体取决于配置 (DepthProConfig) 和输入。

loss (torch.FloatTensor，形状为 (1,)，可选，当提供 labels 时返回) — 分类（或回归，如果 config.num_labels==1）损失。
predicted_depth (torch.FloatTensor，形状为 (batch_size, height, width)) — 每个像素的预测深度。
field_of_view (torch.FloatTensor，形状为 (batch_size,)，可选，当提供 use_fov_model 时返回) — 视野缩放器。
hidden_states (tuple(torch.FloatTensor)，可选，当传递 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — torch.FloatTensor 元组（如果模型具有嵌入层，则为嵌入层输出一个，加上每层输出一个），形状为 (batch_size, n_patches_per_batch, sequence_length, hidden_size)。

模型在每一层输出以及可选的初始嵌入输出处的隐藏状态。
attentions (tuple(torch.FloatTensor)，可选，当传递 output_attentions=True 或当 config.output_attentions=True 时返回) — torch.FloatTensor 元组（每层一个），形状为 (batch_size, n_patches_per_batch, num_heads, sequence_length, sequence_length)。

注意力 softmax 之后的注意力权重，用于计算自注意力头中的加权平均值。

DepthProForDepthEstimation forward 方法，覆盖了 __call__ 特殊方法。

示例

>>> from transformers import AutoImageProcessor, DepthProForDepthEstimation
>>> import torch
>>> from PIL import Image
>>> import requests

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> checkpoint = "apple/DepthPro-hf"
>>> processor = AutoImageProcessor.from_pretrained(checkpoint)
>>> model = DepthProForDepthEstimation.from_pretrained(checkpoint)

>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
>>> model.to(device)

>>> # prepare image for the model
>>> inputs = processor(images=image, return_tensors="pt").to(device)

>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> # interpolate to original size
>>> post_processed_output = processor.post_process_depth_estimation(
...     outputs, target_sizes=[(image.height, image.width)],
... )

>>> # get the field of view (fov) predictions
>>> field_of_view = post_processed_output[0]["field_of_view"]
>>> focal_length = post_processed_output[0]["focal_length"]

>>> # visualize the prediction
>>> predicted_depth = post_processed_output[0]["predicted_depth"]
>>> depth = predicted_depth * 255 / predicted_depth.max()
>>> depth = depth.detach().cpu().numpy()
>>> depth = Image.fromarray(depth.astype("uint8"))

< > 在 GitHub 上更新

Transformers

DepthPro

概述

使用技巧

架构和配置

视场 (FOV) 预测

使用缩放点积注意力 (SDPA)

资源

DepthProConfig

class transformers.DepthProConfig

DepthProImageProcessor

class transformers.DepthProImageProcessor

preprocess

post_process_depth_estimation

DepthProImageProcessorFast

class transformers.DepthProImageProcessorFast

preprocess

post_process_depth_estimation

DepthProModel

class transformers.DepthProModel

forward

DepthProForDepthEstimation

class transformers.DepthProForDepthEstimation

forward