LeViT

概述

LeViT 模型在 LeViT: Introducing Convolutions to Vision Transformers 中被提出，作者是 Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, Matthijs Douze。LeViT 通过一些架构上的差异，例如在 Transformers 中使用分辨率递减的激活图以及引入注意力偏差以整合位置信息，改进了 Vision Transformer (ViT) 的性能和效率。

论文摘要如下

我们设计了一系列图像分类架构，这些架构优化了高速模式下精度和效率之间的权衡。我们的工作利用了最近在基于注意力的架构中的发现，这些架构在高度并行处理硬件上具有竞争力。我们重新审视了关于卷积神经网络的大量文献中的原则，并将它们应用于 transformers，特别是分辨率递减的激活图。我们还引入了注意力偏差，这是一种在视觉 transformers 中整合位置信息的新方法。因此，我们提出了 LeViT：一种用于快速推理图像分类的混合神经网络。我们考虑了不同硬件平台上的不同效率度量，以便最好地反映广泛的应用场景。我们广泛的实验经验性地验证了我们的技术选择，并表明它们适用于大多数架构。总的来说，LeViT 在速度/精度权衡方面显着优于现有的卷积网络和视觉 transformers。例如，在 ImageNet top-1 精度达到 80% 时，LeViT 在 CPU 上的速度比 EfficientNet 快 5 倍。

LeViT 架构。摘自原始论文。

此模型由 anugunj 贡献。原始代码可以在这里找到。

使用技巧

与 ViT 相比，LeViT 模型使用额外的蒸馏头来有效地从教师模型（在 LeViT 论文中，这是一个类似 ResNet 的模型）中学习。蒸馏头通过反向传播在类似 ResNet 模型的监督下学习。他们还从卷积神经网络中汲取灵感，使用分辨率递减的激活图来提高效率。
微调蒸馏模型有两种方法，(1) 经典方法，仅在最终隐藏状态之上放置一个预测头，而不使用蒸馏头，或 (2) 在最终隐藏状态之上同时放置一个预测头和一个蒸馏头。在这种情况下，预测头使用头部的预测和真实标签之间的常规交叉熵进行训练，而蒸馏预测头使用硬蒸馏（蒸馏头的预测与教师模型预测的标签之间的交叉熵）进行训练。在推理时，将两个头的平均预测作为最终预测。(2) 也称为“使用蒸馏进行微调”，因为它依赖于已在下游数据集上微调的教师模型。在模型方面，(1) 对应于 LevitForImageClassification，(2) 对应于 LevitForImageClassificationWithTeacher。
所有发布的检查点都在 ImageNet-1k（也称为 ILSVRC 2012，包含 130 万张图像和 1,000 个类别的集合）上进行了预训练和微调。仅使用了 ImageNet-1k 数据集。没有使用外部数据。这与原始 ViT 模型形成对比，后者使用了 JFT-300M 数据集/Imagenet-21k 等外部数据进行预训练。
LeViT 的作者发布了 5 个经过训练的 LeViT 模型，您可以直接将其插入 LevitModel 或 LevitForImageClassification。使用了数据增强、优化和正则化等技术，以便模拟在更大的数据集上进行训练（同时仅使用 ImageNet-1k 进行预训练）。可用的 5 个变体是（全部在 224x224 大小的图像上训练）：facebook/levit-128S、facebook/levit-128、facebook/levit-192、facebook/levit-256 和 facebook/levit-384。请注意，应使用 LevitImageProcessor 以准备模型的图像。
LevitForImageClassificationWithTeacher 目前仅支持推理，不支持训练或微调。
您可以查看有关推理以及在自定义数据上进行微调的演示 notebook 此处（您可以将 ViTFeatureExtractor 替换为 LevitImageProcessor，并将 ViTForImageClassification 替换为 LevitForImageClassification 或 LevitForImageClassificationWithTeacher）。

资源

官方 Hugging Face 和社区（标有 🌎）资源列表，可帮助您开始使用 LeViT。

图像分类

LevitForImageClassification 由此示例脚本和notebook提供支持。
另请参阅：图像分类任务指南

如果您有兴趣提交资源以包含在此处，请随时打开 Pull Request，我们将对其进行审核！该资源最好展示一些新的东西，而不是重复现有资源。

LevitConfig

class transformers.LevitConfig

< source >

( image_size = 224 num_channels = 3 kernel_size = 3 stride = 2 padding = 1 patch_size = 16 hidden_sizes = [128, 256, 384] num_attention_heads = [4, 8, 12] depths = [4, 4, 4] key_dim = [16, 16, 16] drop_path_rate = 0 mlp_ratio = [2, 2, 2] attention_ratio = [2, 2, 2] initializer_range = 0.02 **kwargs )

参数

image_size (int, optional, defaults to 224) — 输入图像的大小。
num_channels (int, optional, defaults to 3) — 输入图像中的通道数。
kernel_size (int, optional, defaults to 3) — 补丁嵌入的初始卷积层的内核大小。
stride (int, optional, defaults to 2) — 补丁嵌入的初始卷积层的步幅大小。
padding (int, optional, defaults to 1) — 补丁嵌入的初始卷积层的填充大小。
patch_size (int, optional, defaults to 16) — 用于嵌入的 patch 大小。
hidden_sizes (List[int], optional, defaults to [128, 256, 384]) — 每个编码器块的维度。
num_attention_heads (List[int], optional, defaults to [4, 8, 12]) — Transformer 编码器中每个块的每个注意力层中的注意力头数。
depths (List[int], optional, defaults to [4, 4, 4]) — 每个编码器块中的层数。
key_dim (List[int], optional, defaults to [16, 16, 16]) — 每个编码器块中 key 的大小。
drop_path_rate (int, optional, defaults to 0) — Transformer 编码器块中使用的随机深度（stochastic depths）的 dropout 概率。
mlp_ratios (List[int], optional, defaults to [2, 2, 2]) — 编码器块中 Mix FFN 的隐藏层大小与输入层大小之比。
attention_ratios (List[int], optional, defaults to [2, 2, 2]) — 注意力层的输出维度与输入维度之比。
initializer_range (float, optional, defaults to 0.02) — 用于初始化所有权重矩阵的 truncated_normal_initializer 的标准差。

这是用于存储 LevitModel 配置的配置类。它用于根据指定的参数实例化 LeViT 模型，定义模型架构。使用默认值实例化配置将产生与 LeViT facebook/levit-128S 架构类似的配置。

配置对象继承自 PretrainedConfig，可用于控制模型输出。有关更多信息，请阅读 PretrainedConfig 的文档。

示例

>>> from transformers import LevitConfig, LevitModel

>>> # Initializing a LeViT levit-128S style configuration
>>> configuration = LevitConfig()

>>> # Initializing a model (with random weights) from the levit-128S style configuration
>>> model = LevitModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

LevitFeatureExtractor

class transformers.LevitFeatureExtractor

< source >

( *args **kwargs )

call

< source >

( images **kwargs )

预处理图像或一批图像。

LevitImageProcessor

class transformers.LevitImageProcessor

< source >

( do_resize: bool = True size: typing.Dict[str, int] = None resample: Resampling = <Resampling.BICUBIC: 3> do_center_crop: bool = True crop_size: typing.Dict[str, int] = None do_rescale: bool = True rescale_factor: typing.Union[int, float] = 0.00392156862745098 do_normalize: bool = True image_mean: typing.Union[float, typing.Iterable[float], NoneType] = [0.485, 0.456, 0.406] image_std: typing.Union[float, typing.Iterable[float], NoneType] = [0.229, 0.224, 0.225] **kwargs )

参数

do_resize (bool, optional, defaults to True) — 是否将输入图像的最短边调整为 int(256/224 *size)。可以通过 preprocess 方法中的 do_resize 参数覆盖。
size (Dict[str, int], optional, defaults to {"shortest_edge" -- 224}): 调整大小后输出图像的大小。如果 size 是一个包含键 “width” 和 “height” 的字典，则图像将被调整为 (size["height"], size["width"])。如果 size 是一个包含键 “shortest_edge” 的字典，则最短边值 c 将被重新缩放为 int(c * (256/224))。图像的较小边将与此值匹配，即，如果 height > width，则图像将被重新缩放为 (size["shortest_egde"] * height / width, size["shortest_egde"])。可以通过 preprocess 方法中的 size 参数覆盖。
resample (PILImageResampling, optional, defaults to Resampling.BICUBIC) — 如果调整图像大小，则使用的重采样滤波器。可以通过 preprocess 方法中的 resample 参数覆盖。
do_center_crop (bool, optional, defaults to True) — 是否将输入图像居中裁剪为 (crop_size["height"], crop_size["width"])。可以通过 preprocess 方法中的 do_center_crop 参数覆盖。
crop_size (Dict, optional, defaults to {"height" -- 224, "width": 224}): center_crop 后所需的图像大小。可以通过 preprocess 方法中的 crop_size 参数覆盖。
do_rescale (bool, optional, defaults to True) — 控制是否按指定的比例 rescale_factor 重新缩放图像。可以通过 preprocess 方法中的 do_rescale 参数覆盖。
rescale_factor (int or float, optional, defaults to 1/255) — 如果重新缩放图像，则使用的比例因子。可以通过 preprocess 方法中的 rescale_factor 参数覆盖。
do_normalize (bool, optional, defaults to True) — 控制是否对图像进行归一化。可以通过 preprocess 方法中的 do_normalize 参数覆盖。
image_mean (List[int], optional, defaults to [0.485, 0.456, 0.406]) — 如果对图像进行归一化，则使用的均值。这是一个浮点数或浮点数列表，其长度是图像中通道数。可以通过 preprocess 方法中的 image_mean 参数覆盖。
image_std (List[int], optional, defaults to [0.229, 0.224, 0.225]) — 如果对图像进行归一化，则使用的标准差。这是一个浮点数或浮点数列表，其长度是图像中通道数。可以通过 preprocess 方法中的 image_std 参数覆盖。

构建 LeViT 图像处理器。

preprocess

< source >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] do_resize: typing.Optional[bool] = None size: typing.Optional[typing.Dict[str, int]] = None resample: Resampling = None do_center_crop: typing.Optional[bool] = None crop_size: typing.Optional[typing.Dict[str, int]] = None do_rescale: typing.Optional[bool] = None rescale_factor: typing.Optional[float] = None do_normalize: typing.Optional[bool] = None image_mean: typing.Union[float, typing.Iterable[float], NoneType] = None image_std: typing.Union[float, typing.Iterable[float], NoneType] = None return_tensors: typing.Optional[transformers.utils.generic.TensorType] = None data_format: ChannelDimension = <ChannelDimension.FIRST: 'channels_first'> input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None )

参数

images (ImageInput) — 预处理的图像或图像批次。期望像素值范围在 0 到 255 之间的单个或批量的图像。如果传入像素值在 0 到 1 之间的图像，请设置 do_rescale=False。
do_resize (bool, optional, defaults to self.do_resize) — 是否调整图像大小。
size (Dict[str, int], optional, defaults to self.size) — 调整大小后输出图像的尺寸。如果 size 是一个字典，键为 “width” 和 “height”，图像将被调整为 (height, width)。如果 size 是一个字典，键为 “shortest_edge”，则最短边值 c 将被缩放到 int(c (256/224))。图像的较小边将与此值匹配，即，如果 height > width，则图像将被缩放到 (size height / width, size)。
resample (PILImageResampling, optional, defaults to PILImageResampling.BICUBIC) — 调整图像大小时使用的重采样过滤器。
do_center_crop (bool, optional, defaults to self.do_center_crop) — 是否对图像进行中心裁剪。
crop_size (Dict[str, int], optional, defaults to self.crop_size) — 中心裁剪后输出图像的尺寸。将图像裁剪为 (crop_size[“height”], crop_size[“width”])。
do_rescale (bool, optional, defaults to self.do_rescale) — 是否通过 rescaling_factor 缩放图像像素值 - 通常缩放到 0 到 1 之间的值。
rescale_factor (float, optional, defaults to self.rescale_factor) — 用于缩放图像像素值的因子。
do_normalize (bool, optional, defaults to self.do_normalize) — 是否通过 image_mean 和 image_std 标准化图像像素值。
image_mean (float or List[float], optional, defaults to self.image_mean) — 用于标准化图像像素值的均值。
image_std (float or List[float], optional, defaults to self.image_std) — 用于标准化图像像素值的标准差。
return_tensors (str 或 TensorType, optional) — 返回的张量类型。可以是以下之一：
- Unset: 返回 np.ndarray 列表。
- TensorType.TENSORFLOW 或 'tf': 返回 tf.Tensor 类型的批次。
- TensorType.PYTORCH 或 'pt': 返回 torch.Tensor 类型的批次。
- TensorType.NUMPY 或 'np': 返回 np.ndarray 类型的批次。
- TensorType.JAX 或 'jax': 返回 jax.numpy.ndarray 类型的批次。
data_format (str 或 ChannelDimension, optional, defaults to ChannelDimension.FIRST) — 输出图像的通道维度格式。如果未设置，则使用输入图像的通道维度格式。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST: 格式为 (num_channels, height, width) 的图像。
- "channels_last" 或 ChannelDimension.LAST: 格式为 (height, width, num_channels) 的图像。
input_data_format (ChannelDimension 或 str, optional) — 输入图像的通道维度格式。如果未设置，则从输入图像推断通道维度格式。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST: 格式为 (num_channels, height, width) 的图像。
- "channels_last" 或 ChannelDimension.LAST: 格式为 (height, width, num_channels) 的图像。
- "none" 或 ChannelDimension.NONE: 格式为 (height, width) 的图像。

预处理单个或批量的图像，以便用作 LeViT 模型的输入。

LevitModel

class transformers.LevitModel

< source >

( config )

参数

config (LevitConfig) — 包含模型所有参数的模型配置类。使用配置文件初始化不会加载与模型关联的权重，仅加载配置。查看 from_pretrained() 方法以加载模型权重。

裸 Levit 模型输出原始特征，顶部没有任何特定的头部。该模型是 PyTorch torch.nn.Module 子类。可以将其用作常规 PyTorch 模块，并参阅 PyTorch 文档以了解与常规用法和行为相关的所有事项。

forward

< source >

( pixel_values: FloatTensor = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.BaseModelOutputWithPoolingAndNoAttention 或 tuple(torch.FloatTensor)

参数

pixel_values (torch.FloatTensor，形状为 (batch_size, num_channels, height, width)) — 像素值。像素值可以使用 AutoImageProcessor 获得。有关详细信息，请参阅 LevitImageProcessor.call()。
output_hidden_states (bool, optional) — 是否返回所有层的隐藏状态。有关更多详细信息，请参见返回张量下的 hidden_states。
return_dict (bool, optional) — 是否返回 ModelOutput 而不是普通元组。

返回值

transformers.modeling_outputs.BaseModelOutputWithPoolingAndNoAttention 或 tuple(torch.FloatTensor)

一个 transformers.modeling_outputs.BaseModelOutputWithPoolingAndNoAttention 或一个 torch.FloatTensor 元组（如果传递了 return_dict=False 或当 config.return_dict=False 时），其中包含取决于配置 (LevitConfig) 和输入的各种元素。

last_hidden_state (torch.FloatTensor，形状为 (batch_size, num_channels, height, width)) — 模型最后一层输出端的隐藏状态序列。
pooler_output (torch.FloatTensor，形状为 (batch_size, hidden_size)) — 在空间维度上进行池化操作后的最后一层隐藏状态。
hidden_states (tuple(torch.FloatTensor), optional, 当传递 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — torch.FloatTensor 的元组（如果模型具有嵌入层，则为嵌入输出 + 每层输出一个），形状为 (batch_size, num_channels, height, width)。

模型在每层输出端的隐藏状态，加上可选的初始嵌入输出。

LevitModel 的 forward 方法，覆盖了 __call__ 特殊方法。

虽然 forward 传递的配方需要在该函数中定义，但应该在之后调用 Module 实例，而不是调用此函数，因为前者负责运行预处理和后处理步骤，而后者会默默地忽略它们。

示例

>>> from transformers import AutoImageProcessor, LevitModel
>>> import torch
>>> from datasets import load_dataset

>>> dataset = load_dataset("huggingface/cats-image", trust_remote_code=True)
>>> image = dataset["test"]["image"][0]

>>> image_processor = AutoImageProcessor.from_pretrained("facebook/levit-128S")
>>> model = LevitModel.from_pretrained("facebook/levit-128S")

>>> inputs = image_processor(image, return_tensors="pt")

>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> last_hidden_states = outputs.last_hidden_state
>>> list(last_hidden_states.shape)
[1, 16, 384]

LevitForImageClassification

class transformers.LevitForImageClassification

< source >

( config )

参数

config (LevitConfig) — 包含模型所有参数的模型配置类。使用配置文件初始化不会加载与模型关联的权重，仅加载配置。查看 from_pretrained() 方法以加载模型权重。

Levit 模型，顶部带有一个图像分类头（池化特征顶部的线性层），例如用于 ImageNet。

此模型是 PyTorch torch.nn.Module 子类。像常规 PyTorch 模块一样使用它，并参阅 PyTorch 文档以了解与常规用法和行为相关的所有事项。

forward

< source >

( pixel_values: FloatTensor = None labels: typing.Optional[torch.LongTensor] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.ImageClassifierOutputWithNoAttention 或 tuple(torch.FloatTensor)

参数

pixel_values (torch.FloatTensor，形状为 (batch_size, num_channels, height, width)) — 像素值。像素值可以使用 AutoImageProcessor 获得。有关详细信息，请参阅 LevitImageProcessor.call()。
output_hidden_states (bool, 可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参阅返回张量下的 hidden_states。
return_dict (bool, 可选) — 是否返回 ModelOutput 而不是普通元组。
labels (torch.LongTensor，形状为 (batch_size,), 可选) — 用于计算图像分类/回归损失的标签。索引应在 [0, ..., config.num_labels - 1] 中。如果 config.num_labels == 1，则计算回归损失（均方损失）；如果 config.num_labels > 1，则计算分类损失（交叉熵）。

返回值

transformers.modeling_outputs.ImageClassifierOutputWithNoAttention 或 tuple(torch.FloatTensor)

一个 transformers.modeling_outputs.ImageClassifierOutputWithNoAttention 或 torch.FloatTensor 的元组（如果传递了 return_dict=False 或当 config.return_dict=False 时），其中包含各种元素，具体取决于配置 (LevitConfig) 和输入。

loss (torch.FloatTensor，形状为 (1,), 可选, 当提供 labels 时返回) — 分类损失（或回归损失，如果 config.num_labels==1）。
logits (torch.FloatTensor，形状为 (batch_size, config.num_labels)) — 分类（或回归，如果 config.num_labels==1）得分（在 SoftMax 之前）。
hidden_states (tuple(torch.FloatTensor), 可选, 当传递 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — torch.FloatTensor 的元组（对于嵌入的输出，如果模型具有嵌入层，则为一个；对于每个阶段的输出，则为一个），形状为 (batch_size, num_channels, height, width)。模型在每个阶段输出端的隐藏状态（也称为特征图）。

LevitForImageClassification 的 forward 方法，覆盖了 __call__ 特殊方法。

示例

>>> from transformers import AutoImageProcessor, LevitForImageClassification
>>> import torch
>>> from datasets import load_dataset

>>> dataset = load_dataset("huggingface/cats-image", trust_remote_code=True)
>>> image = dataset["test"]["image"][0]

>>> image_processor = AutoImageProcessor.from_pretrained("facebook/levit-128S")
>>> model = LevitForImageClassification.from_pretrained("facebook/levit-128S")

>>> inputs = image_processor(image, return_tensors="pt")

>>> with torch.no_grad():
...     logits = model(**inputs).logits

>>> # model predicts one of the 1000 ImageNet classes
>>> predicted_label = logits.argmax(-1).item()
>>> print(model.config.id2label[predicted_label])
tabby, tabby cat

LevitForImageClassificationWithTeacher

class transformers.LevitForImageClassificationWithTeacher

< source >

( config )

参数

config (LevitConfig) — 包含模型所有参数的模型配置类。使用配置文件初始化不会加载与模型关联的权重，仅加载配置。查看 from_pretrained() 方法以加载模型权重。

LeViT 模型转换器，顶部带有图像分类头（最终隐藏状态顶部的线性层和蒸馏令牌的最终隐藏状态顶部的线性层），例如用于 ImageNet。 .. warning:: 此模型仅支持推理。尚不支持使用蒸馏（即使用教师模型）进行微调。

此模型是 PyTorch torch.nn.Module 子类。像常规 PyTorch 模块一样使用它，并参阅 PyTorch 文档以了解与常规用法和行为相关的所有事项。

forward

< source >

( pixel_values: FloatTensor = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.models.levit.modeling_levit.LevitForImageClassificationWithTeacherOutput 或 tuple(torch.FloatTensor)

参数

pixel_values (torch.FloatTensor，形状为 (batch_size, num_channels, height, width)) — 像素值。像素值可以使用 AutoImageProcessor 获得。有关详细信息，请参阅 LevitImageProcessor.call()。
output_hidden_states (bool, 可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参阅返回张量下的 hidden_states。
return_dict (bool, 可选) — 是否返回 ModelOutput 而不是普通元组。

返回值

transformers.models.levit.modeling_levit.LevitForImageClassificationWithTeacherOutput 或 tuple(torch.FloatTensor)

一个 transformers.models.levit.modeling_levit.LevitForImageClassificationWithTeacherOutput 或 torch.FloatTensor 的元组（如果传递了 return_dict=False 或当 config.return_dict=False 时），其中包含各种元素，具体取决于配置 (LevitConfig) 和输入。

logits (torch.FloatTensor，形状为 (batch_size, config.num_labels)) — 预测分数，为 cls_logits 和 distillation_logits 的平均值。
cls_logits (torch.FloatTensor，形状为 (batch_size, config.num_labels)) — 分类头的预测分数（即类令牌的最终隐藏状态顶部的线性层）。
distillation_logits (torch.FloatTensor，形状为 (batch_size, config.num_labels)) — 蒸馏头的预测分数（即蒸馏令牌的最终隐藏状态顶部的线性层）。
hidden_states (tuple(torch.FloatTensor), 可选, 当传递 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — torch.FloatTensor 的元组（对于嵌入的输出为一个，对于每个层的输出为一个），形状为 (batch_size, sequence_length, hidden_size)。模型在每个层输出端以及初始嵌入输出端的隐藏状态。

LevitForImageClassificationWithTeacher 的 forward 方法，覆盖了 __call__ 特殊方法。

示例

>>> from transformers import AutoImageProcessor, LevitForImageClassificationWithTeacher
>>> import torch
>>> from datasets import load_dataset

>>> dataset = load_dataset("huggingface/cats-image", trust_remote_code=True)
>>> image = dataset["test"]["image"][0]

>>> image_processor = AutoImageProcessor.from_pretrained("facebook/levit-128S")
>>> model = LevitForImageClassificationWithTeacher.from_pretrained("facebook/levit-128S")

>>> inputs = image_processor(image, return_tensors="pt")

>>> with torch.no_grad():
...     logits = model(**inputs).logits

>>> # model predicts one of the 1000 ImageNet classes
>>> predicted_label = logits.argmax(-1).item()
>>> print(model.config.id2label[predicted_label])
tabby, tabby cat

< > 在 GitHub 上更新