YOLOS

概述

YOLOS 模型在 You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection 中被提出，作者是 Yuxin Fang, Bencheng Liao, Xinggang Wang, Jiemin Fang, Jiyang Qi, Rui Wu, Jianwei Niu, Wenyu Liu。YOLOS 提议仅利用普通的 Vision Transformer (ViT) 进行物体检测，灵感来自 DETR。结果表明，基础尺寸的仅编码器 Transformer 也可以在 COCO 上实现 42 AP，类似于 DETR 和更复杂的框架（如 Faster R-CNN）。

该论文的摘要如下：

Transformer 能否仅从纯粹的序列到序列的角度，以最少的关于 2D 空间结构的知识，执行 2D 物体和区域级别的识别？为了回答这个问题，我们提出了 You Only Look at One Sequence (YOLOS) 系列物体检测模型，该模型基于原始 Vision Transformer，并尽可能少地修改区域先验以及目标任务的归纳偏置。我们发现，仅在中等大小的 ImageNet-1k 数据集上预训练的 YOLOS 已经可以在具有挑战性的 COCO 物体检测基准上实现相当有竞争力的性能，例如，直接从 BERT-Base 架构采用的 YOLOS-Base 可以在 COCO val 上获得 42.0 box AP。我们还通过 YOLOS 讨论了当前预训练方案和模型缩放策略对视觉 Transformer 的影响和局限性。

YOLOS 架构。摘自原始论文。

此模型由 nielsr 贡献。原始代码可以在这里找到。

使用缩放点积注意力 (SDPA)

PyTorch 包括一个原生的缩放点积注意力 (SDPA) 运算符，作为 torch.nn.functional 的一部分。此函数包含多个实现，可以根据输入和正在使用的硬件应用。有关更多信息，请参阅官方文档或GPU 推理页面。

当实现可用时，torch>=2.1.1 默认使用 SDPA，但您也可以在 from_pretrained() 中设置 attn_implementation="sdpa" 以显式请求使用 SDPA。

from transformers import AutoModelForObjectDetection
model = AutoModelForObjectDetection.from_pretrained("hustvl/yolos-base", attn_implementation="sdpa", torch_dtype=torch.float16)
...

为了获得最佳加速，我们建议以半精度加载模型（例如 torch.float16 或 torch.bfloat16）。

在本地基准测试 (A100-40GB, PyTorch 2.3.0, OS Ubuntu 22.04) 中，使用 float32 和 hustvl/yolos-base 模型，我们在推理期间看到了以下加速。

批次大小	平均推理时间（毫秒），eager 模式	平均推理时间（毫秒），sdpa 模型	加速，Sdpa / Eager (x)
1	106	76	1.39
2	154	90	1.71
4	222	116	1.91
8	368	168	2.19

资源

官方 Hugging Face 和社区（🌎 表示）资源列表，可帮助您开始使用 YOLOS。

物体检测

所有示例笔记本，说明了在自定义数据集上对 YolosForObjectDetection 进行推理 + 微调，都可以在这里找到。
使用 Trainer 或 Accelerate 微调 YolosForObjectDetection 的脚本可以在这里找到。
另请参阅：物体检测任务指南

如果您有兴趣提交资源以包含在此处，请随时打开 Pull Request，我们将对其进行审核！该资源应理想地展示一些新的东西，而不是复制现有资源。

使用 YolosImageProcessor 准备模型的图像（和可选目标）。与 DETR 相反，YOLOS 不需要创建 pixel_mask。

YolosConfig

class transformers.YolosConfig

< source >

( hidden_size = 768 num_hidden_layers = 12 num_attention_heads = 12 intermediate_size = 3072 hidden_act = 'gelu' hidden_dropout_prob = 0.0 attention_probs_dropout_prob = 0.0 initializer_range = 0.02 layer_norm_eps = 1e-12 image_size = [512, 864] patch_size = 16 num_channels = 3 qkv_bias = True num_detection_tokens = 100 use_mid_position_embeddings = True auxiliary_loss = False class_cost = 1 bbox_cost = 5 giou_cost = 2 bbox_loss_coefficient = 5 giou_loss_coefficient = 2 eos_coefficient = 0.1 **kwargs )

参数

hidden_size (int, 可选, 默认为 768) — 编码器层和池化器层的维度。
num_hidden_layers (int, optional, defaults to 12) — Transformer 编码器中的隐藏层数。
num_attention_heads (int, optional, defaults to 12) — Transformer 编码器中每个注意力层的注意力头数。
intermediate_size (int, optional, defaults to 3072) — Transformer 编码器中“中间层”（即，前馈层）的维度。
hidden_act (str 或 function, optional, defaults to "gelu") — 编码器和池化器中的非线性激活函数（函数或字符串）。如果为字符串，则支持 "gelu"、"relu"、"selu" 和 "gelu_new"。
hidden_dropout_prob (float, optional, defaults to 0.0) — 嵌入层、编码器和池化器中所有全连接层的 dropout 概率。
attention_probs_dropout_prob (float, optional, defaults to 0.0) — 注意力概率的 dropout 比率。
initializer_range (float, optional, defaults to 0.02) — 用于初始化所有权重矩阵的 truncated_normal_initializer 的标准差。
layer_norm_eps (float, optional, defaults to 1e-12) — 层归一化层使用的 epsilon 值。
image_size (List[int], optional, defaults to [512, 864]) — 每张图片的大小（分辨率）。
patch_size (int, optional, defaults to 16) — 每个 patch 的大小（分辨率）。
num_channels (int, optional, defaults to 3) — 输入通道数。
qkv_bias (bool, optional, defaults to True) — 是否向 queries, keys 和 values 添加偏置。
num_detection_tokens (int, optional, defaults to 100) — 检测 token 的数量。
use_mid_position_embeddings (bool, optional, defaults to True) — 是否使用中间层位置编码。
auxiliary_loss (bool, optional, defaults to False) — 是否使用辅助解码损失（每个解码器层的损失）。
class_cost (float, optional, defaults to 1) — 匈牙利匹配代价中分类错误的相对权重。
bbox_cost (float, optional, defaults to 5) — 匈牙利匹配代价中边界框坐标 L1 误差的相对权重。
giou_cost (float, optional, defaults to 2) — 匈牙利匹配代价中边界框的 generalized IoU 损失的相对权重。
bbox_loss_coefficient (float, optional, defaults to 5) — 对象检测损失中 L1 边界框损失的相对权重。
giou_loss_coefficient (float, optional, defaults to 2) — 对象检测损失中 generalized IoU 损失的相对权重。
eos_coefficient (float, optional, defaults to 0.1) — 对象检测损失中 ‘no-object’ 类的相对分类权重。

这是用于存储 YolosModel 配置的配置类。它用于根据指定的参数实例化 YOLOS 模型，定义模型架构。使用默认值实例化配置将产生与 YOLOS hustvl/yolos-base 架构类似的配置。

配置对象继承自 PretrainedConfig，可用于控制模型输出。有关更多信息，请阅读 PretrainedConfig 的文档。

示例

>>> from transformers import YolosConfig, YolosModel

>>> # Initializing a YOLOS hustvl/yolos-base style configuration
>>> configuration = YolosConfig()

>>> # Initializing a model (with random weights) from the hustvl/yolos-base style configuration
>>> model = YolosModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

YolosImageProcessor

class transformers.YolosImageProcessor

< source >

( format: typing.Union[str, transformers.image_utils.AnnotationFormat] = <AnnotationFormat.COCO_DETECTION: 'coco_detection'> do_resize: bool = True size: typing.Dict[str, int] = None resample: Resampling = <Resampling.BILINEAR: 2> do_rescale: bool = True rescale_factor: typing.Union[int, float] = 0.00392156862745098 do_normalize: bool = True image_mean: typing.Union[float, typing.List[float]] = None image_std: typing.Union[float, typing.List[float]] = None do_convert_annotations: typing.Optional[bool] = None do_pad: bool = True pad_size: typing.Optional[typing.Dict[str, int]] = None **kwargs )

参数

format (str, optional, defaults to "coco_detection") — 注释的数据格式。可以是 “coco_detection” 或 “coco_panoptic” 之一。
do_resize (bool, optional, defaults to True) — 控制是否将图像的 (height, width) 尺寸调整为指定的 size。可以被 preprocess 方法中的 do_resize 参数覆盖。
size (Dict[str, int] optional, defaults to {"shortest_edge" -- 800, "longest_edge": 1333}): 调整大小后图像的 (height, width) 尺寸大小。可以被 preprocess 方法中的 size 参数覆盖。可用选项包括：
- {"height": int, "width": int}: 图像将被调整为精确尺寸 (height, width)。不保持宽高比。
- {"shortest_edge": int, "longest_edge": int}: 图像将被调整为最大尺寸，同时保持宽高比，并保持最短边小于或等于 shortest_edge，最长边小于或等于 longest_edge。
- {"max_height": int, "max_width": int}: 图像将被调整为最大尺寸，同时保持宽高比，并保持高度小于或等于 max_height，宽度小于或等于 max_width。
resample (PILImageResampling, optional, defaults to PILImageResampling.BILINEAR) — 如果调整图像大小，则使用的重采样过滤器。
do_rescale (bool, optional, defaults to True) — 控制是否按指定的比例 rescale_factor 重新缩放图像。可以被 preprocess 方法中的 do_rescale 参数覆盖。
rescale_factor (int 或 float, optional, defaults to 1/255) — 如果重新缩放图像，则使用的缩放因子。可以被 preprocess 方法中的 rescale_factor 参数覆盖。
do_normalize — 控制是否对图像进行归一化处理。可以通过 preprocess 方法中的 do_normalize 参数进行覆盖。
image_mean (float 或 List[float], 可选, 默认为 IMAGENET_DEFAULT_MEAN) — 归一化图像时使用的均值。可以是单个值或值列表，每个通道一个值。可以通过 preprocess 方法中的 image_mean 参数进行覆盖。
image_std (float 或 List[float], 可选, 默认为 IMAGENET_DEFAULT_STD) — 归一化图像时使用的标准差值。可以是单个值或值列表，每个通道一个值。可以通过 preprocess 方法中的 image_std 参数进行覆盖。
do_pad (bool, 可选, 默认为 True) — 控制是否对图像进行填充。可以通过 preprocess 方法中的 do_pad 参数进行覆盖。如果为 True，则会在图像的底部和右侧填充零。如果提供了 pad_size，则图像将被填充到指定的尺寸。否则，图像将被填充到批次中的最大高度和宽度。
pad_size (Dict[str, int], 可选) — 要将图像填充到的尺寸 {"height": int, "width" int}。必须大于预处理提供的任何图像尺寸。如果未提供 pad_size，则图像将被填充到批次中最大的高度和宽度。

构建 Detr 图像处理器。

preprocess

< source >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] annotations: typing.Union[dict[str, typing.Union[int, str, list[dict]]], typing.List[dict[str, typing.Union[int, str, list[dict]]]], NoneType] = None return_segmentation_masks: bool = None masks_path: typing.Union[str, pathlib.Path, NoneType] = None do_resize: typing.Optional[bool] = None size: typing.Optional[typing.Dict[str, int]] = None resample = None do_rescale: typing.Optional[bool] = None rescale_factor: typing.Union[int, float, NoneType] = None do_normalize: typing.Optional[bool] = None image_mean: typing.Union[float, typing.List[float], NoneType] = None image_std: typing.Union[float, typing.List[float], NoneType] = None do_convert_annotations: typing.Optional[bool] = None do_pad: typing.Optional[bool] = None format: typing.Union[str, transformers.image_utils.AnnotationFormat, NoneType] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None data_format: typing.Union[str, transformers.image_utils.ChannelDimension] = <ChannelDimension.FIRST: 'channels_first'> input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None pad_size: typing.Optional[typing.Dict[str, int]] = None **kwargs )

参数

images (ImageInput) — 要预处理的图像或图像批次。期望是像素值范围为 0 到 255 的单个或一批图像。如果传入像素值在 0 到 1 之间的图像，请设置 do_rescale=False。
annotations (AnnotationType 或 List[AnnotationType], 可选) — 与图像或图像批次关联的注释列表。如果注释用于对象检测，则注释应为字典，包含以下键:
- “image_id” (int): 图像 ID。
- “annotations” (List[Dict]): 图像的注释列表。每个注释都应是一个字典。图像可以没有注释，在这种情况下，列表应为空。如果注释用于分割，则注释应为字典，包含以下键:
- “image_id” (int): 图像 ID。
- “segments_info” (List[Dict]): 图像的分割信息列表。每个分割信息都应是一个字典。图像可以没有分割信息，在这种情况下，列表应为空。
- “file_name” (str): 图像的文件名。
return_segmentation_masks (bool, 可选, 默认为 self.return_segmentation_masks) — 是否返回分割掩码。
masks_path (str 或 pathlib.Path, 可选) — 包含分割掩码的目录路径。
do_resize (bool, 可选, 默认为 self.do_resize) — 是否调整图像大小。
size (Dict[str, int], 可选, 默认为 self.size) — 调整大小后图像的 (height, width) 尺寸大小。可用选项包括:
- {"height": int, "width": int}: 图像将被调整为精确尺寸 (height, width)。不保持宽高比。
- {"shortest_edge": int, "longest_edge": int}: 图像将被调整为最大尺寸，同时保持宽高比，并保持最短边小于或等于 shortest_edge，最长边小于或等于 longest_edge。
- {"max_height": int, "max_width": int}: 图像将被调整为最大尺寸，同时保持宽高比，并保持高度小于或等于 max_height，宽度小于或等于 max_width。
resample (PILImageResampling, 可选, 默认为 self.resample) — 调整图像大小时使用的重采样滤波器。
do_rescale (bool, 可选, 默认为 self.do_rescale) — 是否重新缩放图像。
rescale_factor (float, 可选, 默认为 self.rescale_factor) — 重新缩放图像时使用的缩放因子。
do_normalize (bool, 可选, 默认为 self.do_normalize) — 是否对图像进行归一化处理。
image_mean (float 或 List[float], 可选, 默认为 self.image_mean) — 归一化图像时使用的均值。
image_std (float 或 List[float], 可选, 默认为 self.image_std) — 归一化图像时使用的标准差。
do_convert_annotations (bool, 可选, 默认为 self.do_convert_annotations) — 是否将注释转换为模型期望的格式。将边界框从 (top_left_x, top_left_y, width, height) 格式转换为 (center_x, center_y, width, height) 格式，并使用相对坐标。
do_pad (bool, 可选, 默认为 self.do_pad) — 是否填充图像。如果为 True，则会在图像的底部和右侧填充零。如果提供了 pad_size，则图像将被填充到指定的尺寸。否则，图像将被填充到批次中的最大高度和宽度。
format (str 或 AnnotationFormat, 可选, 默认为 self.format) — 注释的格式。
return_tensors (str 或 TensorType, 可选, 默认为 self.return_tensors) — 要返回的张量类型。如果为 None，则返回图像列表。
data_format (str 或 ChannelDimension, 可选, 默认为 self.data_format) — 图像的通道维度格式。如果未提供，则与输入图像的格式相同。
input_data_format (ChannelDimension 或 str, 可选) — 输入图像的通道维度格式。如果未设置，则通道维度格式从输入图像推断。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST：图像格式为 (num_channels, height, width)。
- "channels_last" 或 ChannelDimension.LAST：图像格式为 (height, width, num_channels)。
- "none" 或 ChannelDimension.NONE：图像格式为 (height, width)。
pad_size (Dict[str, int], 可选) — 要将图像填充到的尺寸 {"height": int, "width" int}。必须大于预处理提供的任何图像尺寸。如果未提供 pad_size，则图像将被填充到批次中最大的高度和宽度。

预处理单个图像或一批图像，以便模型可以使用它们。

pad

< source >

( images: typing.List[numpy.ndarray] annotations: typing.Optional[typing.List[typing.Dict[str, typing.Any]]] = None constant_values: typing.Union[float, typing.Iterable[float]] = 0 return_pixel_mask: bool = False return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None data_format: typing.Optional[transformers.image_utils.ChannelDimension] = None input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None update_bboxes: bool = True pad_size: typing.Optional[typing.Dict[str, int]] = None )

参数

image (np.ndarray) — 要填充的图像。
annotations (List[Dict[str, any]], 可选) — 要与图像一起填充的注释。如果提供，则边界框将更新以匹配填充后的图像。
constant_values (float 或 Iterable[float], 可选) — 如果 mode 为 "constant"，则用于填充的值。
return_pixel_mask (bool, 可选, 默认为 True) — 是否返回像素掩码。
return_tensors (str 或 TensorType, 可选) — 返回张量的类型。可以是以下之一：
- Unset：返回 np.ndarray 列表。
- TensorType.TENSORFLOW 或 'tf'：返回 tf.Tensor 类型的批次。
- TensorType.PYTORCH 或 'pt'：返回 torch.Tensor 类型的批次。
- TensorType.NUMPY 或 'np'：返回 np.ndarray 类型的批次。
- TensorType.JAX 或 'jax'：返回 jax.numpy.ndarray 类型的批次。
data_format (str 或 ChannelDimension, 可选) — 图像的通道维度格式。如果未提供，则与输入图像的格式相同。
input_data_format (ChannelDimension 或 str, 可选) — 输入图像的通道维度格式。如果未提供，则将进行推断。
update_bboxes (bool, 可选, 默认为 True) — 是否更新注释中的边界框以匹配填充后的图像。如果边界框尚未转换为相对坐标和 (centre_x, centre_y, width, height) 格式，则不会更新边界框。
pad_size (Dict[str, int], 可选) — 要将图像填充到的尺寸 {"height": int, "width" int}。必须大于预处理提供的任何图像尺寸。如果未提供 pad_size，则图像将被填充到批次中最大的高度和宽度。

将一批图像在图像的底部和右侧用零填充到批次中最大高度和宽度的尺寸，并可选择返回其对应的像素掩码。

post_process_object_detection

< source >

( outputs threshold: float = 0.5 target_sizes: typing.Union[transformers.utils.generic.TensorType, typing.List[typing.Tuple]] = None ) → List[Dict]

参数

outputs (YolosObjectDetectionOutput) — 模型的原始输出。
threshold (float, 可选) — 用于保留目标检测预测的分数阈值。
target_sizes (torch.Tensor 或 List[Tuple[int, int]], 可选) — 形状为 (batch_size, 2) 的张量或元组列表 (Tuple[int, int])，其中包含批次中每个图像的目标尺寸 (height, width)。如果未设置，则不会调整预测大小。

返回值

List[Dict]

字典列表，每个字典包含模型预测的批次中图像的分数、标签和框。

将 YolosForObjectDetection 的原始输出转换为 (top_left_x, top_left_y, bottom_right_x, bottom_right_y) 格式的最终边界框。仅支持 PyTorch。

YolosFeatureExtractor

class transformers.YolosFeatureExtractor

< source >

( *args **kwargs )

call

< source >

( images **kwargs )

预处理单个图像或一批图像。

pad

< source >

参数

image (np.ndarray) — 要填充的图像。
annotations (List[Dict[str, any]], 可选) — 要与图像一起填充的注释。如果提供，则边界框将更新以匹配填充后的图像。
constant_values (float 或 Iterable[float], 可选) — 如果 mode 为 "constant"，则用于填充的值。
return_pixel_mask (bool, 可选, 默认为 True) — 是否返回像素掩码。
return_tensors (str 或 TensorType, 可选) — 返回张量的类型。可以是以下之一：
- Unset：返回 np.ndarray 列表。
- TensorType.TENSORFLOW 或 'tf'：返回 tf.Tensor 类型的批次。
- TensorType.PYTORCH 或 'pt'：返回 torch.Tensor 类型的批次。
- TensorType.NUMPY 或 'np'：返回 np.ndarray 类型的批次。
- TensorType.JAX 或 'jax'：返回 jax.numpy.ndarray 类型的批次。
data_format (str 或 ChannelDimension, 可选) — 图像的通道维度格式。如果未提供，则与输入图像的格式相同。
input_data_format (ChannelDimension 或 str, 可选) — 输入图像的通道维度格式。如果未提供，则将进行推断。
update_bboxes (bool, 可选, 默认为 True) — 是否更新注释中的边界框以匹配填充后的图像。如果边界框尚未转换为相对坐标和 (centre_x, centre_y, width, height) 格式，则不会更新边界框。
pad_size (Dict[str, int], 可选) — 要将图像填充到的尺寸 {"height": int, "width" int}。必须大于预处理提供的任何图像尺寸。如果未提供 pad_size，则图像将填充到批次中最大的高度和宽度。

将一批图像在图像的底部和右侧用零填充到批次中最大高度和宽度的尺寸，并可选择返回其对应的像素掩码。

post_process_object_detection

< 源代码 >

( outputs threshold: float = 0.5 target_sizes: typing.Union[transformers.utils.generic.TensorType, typing.List[typing.Tuple]] = None ) → List[Dict]

参数

outputs (YolosObjectDetectionOutput) — 模型的原始输出。
threshold (float, 可选) — 用于保留目标检测预测的分数阈值。
target_sizes (torch.Tensor 或 List[Tuple[int, int]], 可选) — 形状为 (batch_size, 2) 的张量或元组列表 ( `Tuple[int, int]` )，包含批次中每个图像的目标尺寸 `(height, width)`。如果未设置，则不会调整预测大小。

返回值

List[Dict]

字典列表，每个字典包含模型预测的批次中图像的分数、标签和框。

将 YolosForObjectDetection 的原始输出转换为 (top_left_x, top_left_y, bottom_right_x, bottom_right_y) 格式的最终边界框。仅支持 PyTorch。

YolosModel

class transformers.YolosModel

< 源代码 >

( config: YolosConfig add_pooling_layer: bool = True )

参数

config (YolosConfig) — 模型配置类，包含模型的所有参数。使用配置文件初始化不会加载与模型关联的权重，仅加载配置。查看 from_pretrained() 方法以加载模型权重。

裸 YOLOS 模型 Transformer，输出原始隐藏状态，顶部没有任何特定的头。此模型是 PyTorch torch.nn.Module 子类。将其用作常规 PyTorch 模块，并参考 PyTorch 文档以了解与常规用法和行为相关的所有事项。

forward

< 源代码 >

( pixel_values: typing.Optional[torch.Tensor] = None head_mask: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.BaseModelOutputWithPooling 或 tuple(torch.FloatTensor)

参数

pixel_values (形状为 (batch_size, num_channels, height, width) 的 torch.FloatTensor) — 像素值。像素值可以使用 AutoImageProcessor 获取。有关详细信息，请参阅 YolosImageProcessor.call()。
head_mask (形状为 (num_heads,) 或 (num_layers, num_heads) 的 torch.FloatTensor, 可选) — 用于使自注意力模块的选定头无效的掩码。掩码值在 [0, 1] 中选择：
- 1 表示头**未被掩盖**，
- 0 表示头**被掩盖**。
output_attentions (bool, 可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参阅返回张量下的 attentions。
output_hidden_states (bool, 可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参阅返回张量下的 hidden_states。
return_dict (bool, 可选) — 是否返回 ModelOutput 而不是普通元组。

返回值

transformers.modeling_outputs.BaseModelOutputWithPooling 或 tuple(torch.FloatTensor)

一个 transformers.modeling_outputs.BaseModelOutputWithPooling 或 torch.FloatTensor 元组（如果传递了 return_dict=False 或当 config.return_dict=False 时），其中包含各种元素，具体取决于配置 (YolosConfig) 和输入。

last_hidden_state (形状为 (batch_size, sequence_length, hidden_size) 的 torch.FloatTensor) — 模型最后一层输出的隐藏状态序列。
pooler_output (形状为 (batch_size, hidden_size) 的 torch.FloatTensor) — 序列的第一个标记（分类标记）的最后一层隐藏状态，在通过用于辅助预训练任务的层进一步处理之后。例如，对于 BERT 系列模型，这会在通过线性层和 tanh 激活函数处理后返回分类标记。线性层权重是从预训练期间的下一句预测（分类）目标训练而来的。
hidden_states (tuple(torch.FloatTensor), 可选, 当传递 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — `torch.FloatTensor` 元组（如果模型具有嵌入层，则为嵌入输出一个，+ 每层输出一个），形状为 (batch_size, sequence_length, hidden_size)。模型在每层输出处的隐藏状态以及可选的初始嵌入输出。

模型在每层输出处的隐藏状态以及可选的初始嵌入输出。
attentions (tuple(torch.FloatTensor), 可选, 当传递 output_attentions=True 或当 config.output_attentions=True 时返回) — `torch.FloatTensor` 元组（每层一个），形状为 (batch_size, num_heads, sequence_length, sequence_length)。注意力 softmax 之后的注意力权重，用于计算自注意力头中的加权平均值。

注意力 softmax 之后的注意力权重，用于计算自注意力头中的加权平均值。

`YolosModel` 的 forward 方法，覆盖了 __call__ 特殊方法。

虽然正向传递的配方需要在该函数中定义，但之后应该调用 `Module` 实例而不是此函数，因为前者负责运行预处理和后处理步骤，而后者会默默地忽略它们。

示例

>>> from transformers import AutoImageProcessor, YolosModel
>>> import torch
>>> from datasets import load_dataset

>>> dataset = load_dataset("huggingface/cats-image", trust_remote_code=True)
>>> image = dataset["test"]["image"][0]

>>> image_processor = AutoImageProcessor.from_pretrained("hustvl/yolos-small")
>>> model = YolosModel.from_pretrained("hustvl/yolos-small")

>>> inputs = image_processor(image, return_tensors="pt")

>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> last_hidden_states = outputs.last_hidden_state
>>> list(last_hidden_states.shape)
[1, 3401, 384]

YolosForObjectDetection

class transformers.YolosForObjectDetection

< 源代码 >

( config: YolosConfig )

参数

config (YolosConfig) — 模型配置类，包含模型的所有参数。使用配置文件初始化不会加载与模型关联的权重，仅加载配置。查看 from_pretrained() 方法以加载模型权重。

YOLOS 模型（由 ViT 编码器组成），顶部带有目标检测头，用于诸如 COCO 检测之类的任务。

此模型是 PyTorch torch.nn.Module 子类。将其用作常规 PyTorch 模块，并参考 PyTorch 文档以了解与常规用法和行为相关的所有事项。

forward

< 源代码 >

( pixel_values: FloatTensor labels: typing.Optional[typing.List[typing.Dict]] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.models.yolos.modeling_yolos.YolosObjectDetectionOutput 或 tuple(torch.FloatTensor)

参数

pixel_values (形状为 (batch_size, num_channels, height, width) 的 torch.FloatTensor) — 像素值。像素值可以使用 AutoImageProcessor 获取。有关详细信息，请参阅 YolosImageProcessor.call()。
head_mask (形状为 (num_heads,) 或 (num_layers, num_heads) 的 torch.FloatTensor, 可选) — 用于使自注意力模块的选定头无效的掩码。掩码值在 [0, 1] 中选择：
- 1 表示头**未被掩盖**，
- 0 表示头**被掩盖**。
output_attentions (bool, 可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参阅返回张量下的 attentions。
output_hidden_states (bool, optional) — 是否返回所有层的隐藏状态。有关更多详细信息，请参见返回张量下的 hidden_states。
return_dict (bool, optional) — 是否返回 ModelOutput 而不是普通元组。
labels (List[Dict] of len (batch_size,), optional) — 用于计算二分图匹配损失的标签。字典列表，每个字典至少包含以下 2 个键：'class_labels' 和 'boxes'（分别是批次中图像的类别标签和边界框）。类别标签本身应为长度为 (number of bounding boxes in the image,) 的 torch.LongTensor，而框应为形状为 (number of bounding boxes in the image, 4) 的 torch.FloatTensor。

返回值

transformers.models.yolos.modeling_yolos.YolosObjectDetectionOutput 或 tuple(torch.FloatTensor)

一个 transformers.models.yolos.modeling_yolos.YolosObjectDetectionOutput 或一个 torch.FloatTensor 元组（如果传递 return_dict=False 或当 config.return_dict=False 时），其中包含各种元素，具体取决于配置 (YolosConfig) 和输入。

loss (torch.FloatTensor of shape (1,), optional, 当提供 labels 时返回)) — 总损失，作为类别预测的负对数似然（交叉熵）和边界框损失的线性组合。后者定义为 L1 损失和广义尺度不变 IoU 损失的线性组合。
loss_dict (Dict, optional) — 包含各个损失的字典。用于记录日志很有用。
logits (torch.FloatTensor of shape (batch_size, num_queries, num_classes + 1)) — 所有查询的分类 logits（包括无对象）。
pred_boxes (torch.FloatTensor of shape (batch_size, num_queries, 4)) — 所有查询的归一化框坐标，表示为（center_x，center_y，width，height）。这些值在 [0, 1] 中归一化，相对于批次中每个单独图像的大小（忽略可能的填充）。您可以使用 post_process() 来检索未归一化的边界框。
auxiliary_outputs (list[Dict], optional) — 可选，仅当辅助损失被激活时返回（即 config.auxiliary_loss 设置为 True）并且提供了标签。它是字典列表，其中包含每个解码器层的上述两个键（logits 和 pred_boxes）。
last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — 模型解码器最后一层输出的隐藏状态序列。
hidden_states (tuple(torch.FloatTensor), optional, 当传递 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — torch.FloatTensor 元组（如果模型具有嵌入层，则为嵌入输出的一个，+ 每层输出的一个），形状为 (batch_size, sequence_length, hidden_size)。模型在每层输出的隐藏状态以及可选的初始嵌入输出。
attentions (tuple(torch.FloatTensor), optional, 当传递 output_attentions=True 或当 config.output_attentions=True 时返回) — torch.FloatTensor 元组（每层一个），形状为 (batch_size, num_heads, sequence_length, sequence_length)。注意力 softmax 之后的注意力权重，用于计算自注意力头中的加权平均值。

YolosForObjectDetection 前向方法，覆盖了 __call__ 特殊方法。

示例

>>> from transformers import AutoImageProcessor, AutoModelForObjectDetection
>>> import torch
>>> from PIL import Image
>>> import requests

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image_processor = AutoImageProcessor.from_pretrained("hustvl/yolos-tiny")
>>> model = AutoModelForObjectDetection.from_pretrained("hustvl/yolos-tiny")

>>> inputs = image_processor(images=image, return_tensors="pt")
>>> outputs = model(**inputs)

>>> # convert outputs (bounding boxes and class logits) to Pascal VOC format (xmin, ymin, xmax, ymax)
>>> target_sizes = torch.tensor([image.size[::-1]])
>>> results = image_processor.post_process_object_detection(outputs, threshold=0.9, target_sizes=target_sizes)[
...     0
... ]

>>> for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
...     box = [round(i, 2) for i in box.tolist()]
...     print(
...         f"Detected {model.config.id2label[label.item()]} with confidence "
...         f"{round(score.item(), 3)} at location {box}"
...     )
Detected remote with confidence 0.991 at location [46.48, 72.78, 178.98, 119.3]
Detected remote with confidence 0.908 at location [336.48, 79.27, 368.23, 192.36]
Detected cat with confidence 0.934 at location [337.18, 18.06, 638.14, 373.09]
Detected cat with confidence 0.979 at location [10.93, 53.74, 313.41, 470.67]
Detected remote with confidence 0.974 at location [41.63, 72.23, 178.09, 119.99]

< > 在 GitHub 上更新

Transformers

YOLOS

概述

使用缩放点积注意力 (SDPA)

资源

YolosConfig

class transformers.YolosConfig

YolosImageProcessor

class transformers.YolosImageProcessor

preprocess

pad

post_process_object_detection

YolosFeatureExtractor

class transformers.YolosFeatureExtractor

__call__

pad

post_process_object_detection

YolosModel

class transformers.YolosModel

forward

YolosForObjectDetection

class transformers.YolosForObjectDetection

forward

call