OWL-ViT

概述

OWL-ViT（Open-World Localization 视觉 Transformer 的缩写）由 Matthias Minderer、Alexey Gritsenko、Austin Stone、Maxim Neumann、Dirk Weissenborn、Alexey Dosovitskiy、Aravindh Mahendran、Anurag Arnab、Mostafa Dehghani、Zhuoran Shen、Xiao Wang、Xiaohua Zhai、Thomas Kipf 和 Neil Houlsby 在 Simple Open-Vocabulary Object Detection with Vision Transformers 中提出。OWL-ViT 是一个在各种（图像，文本）对上训练的开放词汇目标检测网络。它可用于使用一个或多个文本查询来查询图像，以搜索和检测文本中描述的目标对象。

论文摘要如下：

将简单的架构与大规模预训练相结合，极大地改进了图像分类。对于目标检测，预训练和缩放方法不太成熟，尤其是在长尾和开放词汇设置中，训练数据相对稀缺。在本文中，我们为将图像-文本模型迁移到开放词汇目标检测提出了一个强有力的方案。我们使用标准的视觉 Transformer 架构，进行最小的修改、对比图像-文本预训练和端到端检测微调。我们对这种设置的缩放属性的分析表明，增加图像级预训练和模型大小可以在下游检测任务中产生持续的改进。我们提供了在零样本文本条件和单样本图像条件下实现非常强大的目标检测性能所需的适配策略和正则化方法。代码和模型可在 GitHub 上获取。

OWL-ViT 架构。取自原始论文。

此模型由 adirik 贡献。原始代码可以在这里找到。

使用技巧

OWL-ViT 是一个零样本文本条件目标检测模型。OWL-ViT 使用 CLIP 作为其多模态骨干，使用类似 ViT 的 Transformer 获取视觉特征，并使用因果语言模型获取文本特征。为了将 CLIP 用于检测，OWL-ViT 移除了视觉模型的最终 token 池化层，并将轻量级分类和框头连接到每个 Transformer 输出 token。通过将固定的分类层权重替换为从文本模型获得的类名嵌入，可以实现开放词汇分类。作者首先从头开始训练 CLIP，然后使用二分匹配损失在标准检测数据集上使用分类和框头对其进行端到端微调。每个图像可以使用一个或多个文本查询来执行零样本文本条件目标检测。

OwlViTImageProcessor 可用于调整大小（或重新缩放）和归一化模型的图像，CLIPTokenizer 用于编码文本。OwlViTProcessor 将 OwlViTImageProcessor 和 CLIPTokenizer 包装到单个实例中，以同时编码文本和准备图像。以下示例展示了如何使用 OwlViTProcessor 和 OwlViTForObjectDetection 执行目标检测。

>>> import requests
>>> from PIL import Image
>>> import torch

>>> from transformers import OwlViTProcessor, OwlViTForObjectDetection

>>> processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
>>> model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> text_labels = [["a photo of a cat", "a photo of a dog"]]
>>> inputs = processor(text=text_labels, images=image, return_tensors="pt")
>>> outputs = model(**inputs)

>>> # Target image sizes (height, width) to rescale box predictions [batch_size, 2]
>>> target_sizes = torch.tensor([(image.height, image.width)])
>>> # Convert outputs (bounding boxes and class logits) to Pascal VOC format (xmin, ymin, xmax, ymax)
>>> results = processor.post_process_grounded_object_detection(
...     outputs=outputs, target_sizes=target_sizes, threshold=0.1, text_labels=text_labels
... )
>>> # Retrieve predictions for the first image for the corresponding text queries
>>> result = results[0]
>>> boxes, scores, text_labels = result["boxes"], result["scores"], result["text_labels"]
>>> for box, score, text_label in zip(boxes, scores, text_labels):
...     box = [round(i, 2) for i in box.tolist()]
...     print(f"Detected {text_label} with confidence {round(score.item(), 3)} at location {box}")
Detected a photo of a cat with confidence 0.707 at location [324.97, 20.44, 640.58, 373.29]
Detected a photo of a cat with confidence 0.717 at location [1.46, 55.26, 315.55, 472.17]

资源

关于使用 OWL-ViT 进行零样本和单样本（图像引导）目标检测的演示 notebook 可以在这里找到。

Transformers

OWL-ViT

概述

使用技巧

资源

OwlViTConfig

class transformers.OwlViTConfig

from_text_vision_configs

OwlViTTextConfig

类 transformers.OwlViTTextConfig

OwlViTVisionConfig

类 transformers.OwlViTVisionConfig

OwlViTImageProcessor

class transformers.OwlViTImageProcessor

preprocess

post_process_object_detection

post_process_image_guided_detection

OwlViTProcessor

class transformers.OwlViTProcessor

__call__

post_process_grounded_object_detection

post_process_image_guided_detection

OwlViTModel

class transformers.OwlViTModel

forward

get_text_features

get_image_features

OwlViTTextModel

class transformers.OwlViTTextModel

forward

OwlViTVisionModel

class transformers.OwlViTVisionModel

forward

OwlViTForObjectDetection

class transformers.OwlViTForObjectDetection

forward

image_guided_detection

call