零样本图像分类

零样本图像分类是一项任务，它涉及使用未明确训练包含来自这些特定类别的标记示例数据的模型，将图像分类为不同的类别。

传统上，图像分类需要在一组特定的标记图像上训练模型，并且该模型学习将某些图像特征“映射”到标签。当需要将此类模型用于引入一组新标签的分类任务时，需要进行微调以“重新校准”模型。

相比之下，零样本或开放词汇图像分类模型通常是在大型图像数据集和相关描述上训练的多模态模型。这些模型学习对齐的视觉-语言表示，这些表示可用于许多下游任务，包括零样本图像分类。

这是一种更灵活的图像分类方法，它允许模型推广到新的和未见过的类别，而无需额外的训练数据，并使用户能够使用目标对象的自由文本描述来查询图像。

在本指南中，您将学习如何

创建零样本图像分类管道
手动运行零样本图像分类推理

在开始之前，请确保您已安装所有必要的库

pip install -q "transformers[torch]" pillow

零样本图像分类管道

尝试使用支持零样本图像分类的模型进行推理的最简单方法是使用相应的 pipeline()。从 Hugging Face Hub 上的检查点实例化一个管道

>>> from transformers import pipeline

>>> checkpoint = "openai/clip-vit-large-patch14"
>>> detector = pipeline(model=checkpoint, task="zero-shot-image-classification")

接下来，选择您要分类的图像。

>>> from PIL import Image
>>> import requests

>>> url = "https://unsplash.com/photos/g8oS8-82DxI/download?ixid=MnwxMjA3fDB8MXx0b3BpY3x8SnBnNktpZGwtSGt8fHx8fDJ8fDE2NzgxMDYwODc&force=true&w=640"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image

将图像和候选对象标签传递给管道。这里我们直接传递图像；其他合适的选项包括图像的本地路径或图像 URL。候选标签可以是像本例中这样的简单单词，也可以是更具描述性的。

>>> predictions = detector(image, candidate_labels=["fox", "bear", "seagull", "owl"])
>>> predictions
[{'score': 0.9996670484542847, 'label': 'owl'},
 {'score': 0.000199399160919711, 'label': 'seagull'},
 {'score': 7.392891711788252e-05, 'label': 'fox'},
 {'score': 5.96074532950297e-05, 'label': 'bear'}]

手动零样本图像分类

现在您已经了解了如何使用零样本图像分类管道，让我们看看如何手动运行零样本图像分类。

首先从 Hugging Face Hub 上的检查点加载模型和相关的处理器。这里我们将使用与之前相同的检查点

>>> from transformers import AutoProcessor, AutoModelForZeroShotImageClassification

>>> model = AutoModelForZeroShotImageClassification.from_pretrained(checkpoint)
>>> processor = AutoProcessor.from_pretrained(checkpoint)

让我们换一张不同的图片来改变一下。

>>> from PIL import Image
>>> import requests

>>> url = "https://unsplash.com/photos/xBRQfR2bqNI/download?ixid=MnwxMjA3fDB8MXxhbGx8fHx8fHx8fHwxNjc4Mzg4ODEx&force=true&w=640"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image

使用处理器准备模型的输入。处理器结合了一个图像处理器（通过调整大小和归一化来为模型准备图像）和一个分词器（负责处理文本输入）。

>>> candidate_labels = ["tree", "car", "bike", "cat"]
# follows the pipeline prompt template to get same results
>>> candidate_labels = [f'This is a photo of {label}.' for label in candidate_labels]
>>> inputs = processor(images=image, text=candidate_labels, return_tensors="pt", padding=True)

将输入传递给模型，并对结果进行后处理

>>> import torch

>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> logits = outputs.logits_per_image[0]
>>> probs = logits.softmax(dim=-1).numpy()
>>> scores = probs.tolist()

>>> result = [
...     {"score": score, "label": candidate_label}
...     for score, candidate_label in sorted(zip(probs, candidate_labels), key=lambda x: -x[0])
... ]

>>> result
[{'score': 0.998572, 'label': 'car'},
 {'score': 0.0010570387, 'label': 'bike'},
 {'score': 0.0003393686, 'label': 'tree'},
 {'score': 3.1572064e-05, 'label': 'cat'}]

< > 在 GitHub 上更新

Transformers

零样本图像分类

零样本图像分类管道

手动零样本图像分类