Transformers 文档

CLIP

Transformers

加入 Hugging Face 社区

并获得增强的文档体验

在模型、数据集和 Spaces 上协作

通过加速推理获得更快的示例

切换文档主题

开始使用

CLIP

概述

CLIP 模型由 Alec Radford、Jong Wook Kim、Chris Hallacy、Aditya Ramesh、Gabriel Goh、Sandhini Agarwal、Girish Sastry、Amanda Askell、Pamela Mishkin、Jack Clark、Gretchen Krueger 和 Ilya Sutskever 在论文 Learning Transferable Visual Models From Natural Language Supervision 中提出。CLIP (Contrastive Language-Image Pre-Training，对比语言-图像预训练) 是一个在各种（图像，文本）对上训练的神经网络。它可以被自然语言指示，以预测给定图像的最相关的文本片段，而无需直接针对任务进行优化，类似于 GPT-2 和 3 的零样本能力。

以下是论文的摘要

最先进的计算机视觉系统被训练来预测一组固定的预定对象类别。这种受限的监督形式限制了它们的通用性和可用性，因为需要额外的标记数据来指定任何其他视觉概念。直接从关于图像的原始文本中学习是一种有希望的替代方案，它可以利用更广泛的监督来源。我们证明，预测哪个标题与哪个图像匹配的简单预训练任务是一种高效且可扩展的方式，可以在从互联网收集的 4 亿（图像，文本）对的数据集上从头开始学习 SOTA 图像表示。在预训练之后，自然语言被用来引用学习到的视觉概念（或描述新的概念），从而实现模型到下游任务的零样本迁移。我们通过在超过 30 个不同的现有计算机视觉数据集上进行基准测试来研究这种方法的性能，涵盖了诸如 OCR、视频中的动作识别、地理定位和多种类型的细粒度对象分类等任务。该模型非平凡地迁移到大多数任务，并且通常与完全监督的基线竞争，而无需任何数据集特定的训练。例如，我们在 ImageNet 上零样本匹配了原始 ResNet-50 的准确率，而无需使用它训练的 128 万个训练示例中的任何一个。我们在此 https URL 上发布了我们的代码和预训练模型权重。

此模型由 valhalla 贡献。原始代码可以在这里找到。

使用技巧和示例

CLIP 是一种多模态视觉和语言模型。它可用于图像-文本相似性和零样本图像分类。CLIP 使用类似 ViT 的 Transformer 来获取视觉特征，并使用因果语言模型来获取文本特征。然后，文本和视觉特征都被投影到具有相同维度的潜在空间。投影的图像和文本特征之间的点积然后被用作相似度分数。

为了将图像输入到 Transformer 编码器，每张图像被分割成一系列固定大小的非重叠patches，然后进行线性嵌入。[CLS] 标记被添加以用作整个图像的表示。作者还添加了绝对位置嵌入，并将生成的向量序列馈送到标准 Transformer 编码器。CLIPImageProcessor 可用于调整大小（或重新缩放）和标准化模型的图像。

CLIPTokenizer 用于编码文本。CLIPProcessor 将 CLIPImageProcessor 和 CLIPTokenizer 包装到一个实例中，以同时编码文本和准备图像。以下示例展示了如何使用 CLIPProcessor 和 CLIPModel 获取图像-文本相似度分数。

>>> from PIL import Image
>>> import requests

>>> from transformers import CLIPProcessor, CLIPModel

>>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
>>> processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

>>> outputs = model(**inputs)
>>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
>>> probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities

结合 CLIP 和 Flash Attention 2

首先，请确保安装最新版本的 Flash Attention 2。

pip install -U flash-attn --no-build-isolation

还要确保您的硬件与 Flash-Attention 2 兼容。请阅读 flash-attn 仓库的官方文档以了解更多信息。还要确保以半精度 (例如 torch.float16) 加载您的模型。

对于小批量大小，您可能会在使用 flash attention 时注意到模型速度变慢。请参考下面的 Flash Attention 和 SDPA 的预期加速部分，并选择合适的注意力实现。

要加载和运行使用 Flash Attention 2 的模型，请参考以下代码片段

>>> import torch
>>> import requests
>>> from PIL import Image

>>> from transformers import CLIPProcessor, CLIPModel

>>> device = "cuda"
>>> torch_dtype = torch.float16

>>> model = CLIPModel.from_pretrained(
...     "openai/clip-vit-base-patch32",
...     attn_implementation="flash_attention_2",
...     device_map=device,
...     torch_dtype=torch_dtype,
... )
>>> processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
>>> inputs.to(device)

>>> with torch.no_grad():
...     with torch.autocast(device):
...         outputs = model(**inputs)

>>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
>>> probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
>>> print(probs)
tensor([[0.9946, 0.0052]], device='cuda:0', dtype=torch.float16)

使用缩放点积注意力 (SDPA)

PyTorch 包含一个原生的缩放点积注意力 (SDPA) 运算符，作为 torch.nn.functional 的一部分。此函数包含多个实现，可以根据输入和正在使用的硬件应用。有关更多信息，请参阅官方文档或 GPU 推理页面。

当实现可用时，SDPA 默认用于 torch>=2.1.1，但您也可以在 from_pretrained() 中设置 attn_implementation="sdpa" 以显式请求使用 SDPA。

from transformers import CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32", torch_dtype=torch.float16, attn_implementation="sdpa")

为了获得最佳加速，我们建议以半精度 (例如 torch.float16 或 torch.bfloat16) 加载模型。

Flash Attention 和 SDPA 的预期加速

在本地基准测试 (NVIDIA A10G, PyTorch 2.3.1+cu121) 中，使用 float16，我们在 "openai/clip-vit-large-patch14" 检查点的推理期间看到了以下加速 (代码)

CLIPTextModel

文本标签数量	Eager (秒/迭代)	FA2 (秒/迭代)	FA2 加速	SDPA (秒/迭代)	SDPA 加速
4	0.009	0.012	0.737	0.007	1.269
16	0.009	0.014	0.659	0.008	1.187
32	0.018	0.021	0.862	0.016	1.142
64	0.034	0.034	1.001	0.03	1.163
128	0.063	0.058	1.09	0.054	1.174

clip_text_model_viz_3

CLIPVisionModel

图像批大小	Eager (秒/迭代)	FA2 (秒/迭代)	FA2 加速	SDPA (秒/迭代)	SDPA 加速
1	0.016	0.013	1.247	0.012	1.318
4	0.025	0.021	1.198	0.021	1.202
16	0.093	0.075	1.234	0.075	1.24
32	0.181	0.147	1.237	0.146	1.241

clip_image_model_viz_3

CLIPModel

图像批大小	文本标签数量	Eager (秒/迭代)	FA2 (秒/迭代)	FA2 加速	SDPA (秒/迭代)	SDPA 加速
1	4	0.025	0.026	0.954	0.02	1.217
1	16	0.026	0.028	0.918	0.02	1.287
1	64	0.042	0.046	0.906	0.036	1.167
4	4	0.028	0.033	0.849	0.024	1.189
4	16	0.034	0.035	0.955	0.029	1.169
4	64	0.059	0.055	1.072	0.05	1.179
16	4	0.096	0.088	1.091	0.078	1.234
16	16	0.102	0.09	1.129	0.083	1.224
16	64	0.127	0.11	1.157	0.105	1.218
32	4	0.185	0.159	1.157	0.149	1.238
32	16	0.19	0.162	1.177	0.154	1.233
32	64	0.216	0.181	1.19	0.176	1.228

资源

以下列出了官方 Hugging Face 和社区 (以 🌎 标示) 资源，可帮助您开始使用 CLIP。

使用遥感（卫星）图像和标题微调 CLIP，一篇关于如何使用 RSICD 数据集微调 CLIP 以及数据增强导致的性能变化的比较的博文。
此示例脚本展示了如何使用预训练的视觉和文本编码器，以及 COCO 数据集，来训练类似 CLIP 的视觉-文本双编码器模型。

图像到文本

关于如何使用预训练的 CLIP 进行推理，并使用束搜索进行图像字幕生成的 notebook。 🌎

图像检索

关于使用预训练的 CLIP 进行图像检索并计算 MRR（平均倒数排名）分数的 notebook。 🌎
关于图像检索并显示相似度分数的 notebook。 🌎
关于如何使用多语言 CLIP 将图像和文本映射到同一向量空间的 notebook。 🌎
关于如何使用 Unsplash 和 TMDB 数据集在语义图像搜索上运行 CLIP 的 notebook。 🌎

可解释性

关于如何可视化输入 token 和图像片段之间相似性的 notebook。 🌎

如果您有兴趣提交资源并将其包含在此处，请随时打开 Pull Request，我们将对其进行审核。理想情况下，资源应展示一些新的内容，而不是重复现有资源。

Transformers

CLIP

概述

使用技巧和示例

结合 CLIP 和 Flash Attention 2

使用缩放点积注意力 (SDPA)

Flash Attention 和 SDPA 的预期加速

CLIPTextModel

CLIPVisionModel

CLIPModel

资源

CLIPConfig

class transformers.CLIPConfig

from_text_vision_configs

CLIPTextConfig

class transformers.CLIPTextConfig

CLIPVisionConfig

class transformers.CLIPVisionConfig

CLIPTokenizer

class transformers.CLIPTokenizer

build_inputs_with_special_tokens

get_special_tokens_mask

create_token_type_ids_from_sequences

save_vocabulary

CLIPTokenizerFast

class transformers.CLIPTokenizerFast

build_inputs_with_special_tokens

create_token_type_ids_from_sequences

CLIPImageProcessor

class transformers.CLIPImageProcessor

preprocess

CLIPImageProcessorFast

class transformers.CLIPImageProcessorFast

preprocess

CLIPFeatureExtractor

class transformers.CLIPFeatureExtractor

CLIPProcessor

class transformers.CLIPProcessor

batch_decode

decode

CLIPModel

class transformers.CLIPModel

forward

get_text_features

get_image_features

CLIPTextModel

class transformers.CLIPTextModel

forward

CLIPTextModelWithProjection

class transformers.CLIPTextModelWithProjection

forward

CLIPVisionModelWithProjection

class transformers.CLIPVisionModelWithProjection

forward

CLIPVisionModel

class transformers.CLIPVisionModel

forward

CLIPForImageClassification

class transformers.CLIPForImageClassification

forward

TFCLIPModel

class transformers.TFCLIPModel

call

get_text_features

get_image_features

TFCLIPTextModel

class transformers.TFCLIPTextModel

call

TFCLIPVisionModel

class transformers.TFCLIPVisionModel

call

FlaxCLIPModel

class transformers.FlaxCLIPModel

__call__

get_text_features

get_image_features

FlaxCLIPTextModel

class transformers.FlaxCLIPTextModel

__call__

FlaxCLIPTextModelWithProjection

call

call

call

call