Vision Transformer (ViT)

Vision Transformer (ViT) 是一种用于计算机视觉任务的 Transformer。图像被分割成固定大小的较小补丁，这些补丁被视为一系列标记，类似于 NLP 任务中的单词。与卷积架构相比，ViT 预训练所需的资源更少，其在大型数据集上的性能可以迁移到较小的下游任务中。

您可以在 Google 组织下找到所有原始 ViT 检查点。

单击右侧边栏中的 ViT 模型，可查看更多如何将 ViT 应用于不同计算机视觉任务的示例。

下面的示例演示了如何使用 Pipeline 或 AutoModel 类对图像进行分类。

流水线

自动模型

注意事项

最佳结果通过有监督预训练获得，在微调期间，最好使用分辨率高于 224x224 的图像。
使用 ViTImageProcessorFast 来调整（或重新缩放）和标准化图像至预期大小。
补丁和图像分辨率反映在检查点名称中。例如，google/vit-base-patch16-224 是一个 **基础大小** 的架构，其补丁分辨率为 16x16，微调分辨率为 224x224。

ViTConfig

class transformers.ViTConfig

< 来源 >

( hidden_size = 768 num_hidden_layers = 12 num_attention_heads = 12 intermediate_size = 3072 hidden_act = 'gelu' hidden_dropout_prob = 0.0 attention_probs_dropout_prob = 0.0 initializer_range = 0.02 layer_norm_eps = 1e-12 image_size = 224 patch_size = 16 num_channels = 3 qkv_bias = True encoder_stride = 16 pooler_output_size = None pooler_act = 'tanh' **kwargs )

参数

hidden_size (int, 可选, 默认为 768) — 编码器层和池化层维度。
num_hidden_layers (int, 可选, 默认为 12) — Transformer 编码器中的隐藏层数量。
num_attention_heads (int, 可选, 默认为 12) — Transformer 编码器中每个注意力层的注意力头数量。
intermediate_size (int, 可选, 默认为 3072) — Transformer 编码器中“中间”（即，前馈）层的维度。
hidden_act (str 或 function, 可选, 默认为 "gelu") — 编码器和池化器中的非线性激活函数（函数或字符串）。如果是字符串，支持 "gelu"、"relu"、"selu" 和 "gelu_new"。
hidden_dropout_prob (float, 可选, 默认为 0.0) — 嵌入、编码器和池化器中所有全连接层的 dropout 概率。
attention_probs_dropout_prob (float, 可选, 默认为 0.0) — 注意力概率的 dropout 比率。
initializer_range (float, 可选, 默认为 0.02) — 用于初始化所有权重矩阵的 truncated_normal_initializer 的标准差。
layer_norm_eps (float, 可选, 默认为 1e-12) — 层归一化层使用的 epsilon 值。
image_size (int, 可选, 默认为 224) — 每张图像的大小（分辨率）。
patch_size (int, 可选, 默认为 16) — 每个补丁的大小（分辨率）。
num_channels (int, 可选, 默认为 3) — 输入通道数量。
qkv_bias (bool, 可选, 默认为 True) — 是否在查询、键和值中添加偏置。
encoder_stride (int, 可选, 默认为 16) — 用于掩码图像建模的解码器头部增加空间分辨率的因子。
pooler_output_size (int, 可选) — 池化层维度。如果为 None，默认为 hidden_size。
pooler_act (str, 可选, 默认为 "tanh") — 池化器将使用的激活函数。支持 Flax 和 Pytorch 的 ACT2FN 键，以及 https://tensorflowcn.cn/api_docs/python/tf/keras/activations 的元素，适用于 Tensorflow。

这是用于存储 ViTModel 配置的配置类。它用于根据指定参数实例化 ViT 模型，定义模型架构。使用默认值实例化配置将生成与 ViT google/vit-base-patch16-224 架构类似的配置。

配置对象继承自 PretrainedConfig，可用于控制模型输出。有关更多信息，请阅读 PretrainedConfig 的文档。

示例

>>> from transformers import ViTConfig, ViTModel

>>> # Initializing a ViT vit-base-patch16-224 style configuration
>>> configuration = ViTConfig()

>>> # Initializing a model (with random weights) from the vit-base-patch16-224 style configuration
>>> model = ViTModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

ViTFeatureExtractor

class transformers.ViTFeatureExtractor

< 来源 >

( *args **kwargs )

call

< 来源 >

( images **kwargs )

预处理单张或批量图像。

ViTImageProcessor

class transformers.ViTImageProcessor

< 来源 >

( do_resize: bool = True size: typing.Optional[dict[str, int]] = None resample: Resampling = <Resampling.BILINEAR: 2> do_rescale: bool = True rescale_factor: typing.Union[int, float] = 0.00392156862745098 do_normalize: bool = True image_mean: typing.Union[float, list[float], NoneType] = None image_std: typing.Union[float, list[float], NoneType] = None do_convert_rgb: typing.Optional[bool] = None **kwargs )

参数

do_resize (bool, 可选, 默认为 True) — 是否将图像的（高、宽）维度调整为指定的 (size["height"], size["width"])。可通过 preprocess 方法中的 do_resize 参数覆盖。
size (dict, 可选, 默认为 {"height" -- 224, "width": 224}): 调整大小后输出图像的尺寸。可通过 preprocess 方法中的 size 参数覆盖。
resample (PILImageResampling, 可选, 默认为 Resampling.BILINEAR) — 如果调整图像大小，使用的重采样过滤器。可通过 preprocess 方法中的 resample 参数覆盖。
do_rescale (bool, 可选, 默认为 True) — 是否通过指定比例 rescale_factor 重新缩放图像。可通过 preprocess 方法中的 do_rescale 参数覆盖。
rescale_factor (int 或 float, 可选, 默认为 1/255) — 如果重新缩放图像，使用的缩放因子。可通过 preprocess 方法中的 rescale_factor 参数覆盖。
do_normalize (bool, 可选, 默认为 True) — 是否标准化图像。可通过 preprocess 方法中的 do_normalize 参数覆盖。
image_mean (float 或 list[float], 可选, 默认为 IMAGENET_STANDARD_MEAN) — 如果标准化图像，使用的均值。这是一个浮点数或浮点数列表，其长度为图像中的通道数。可通过 preprocess 方法中的 image_mean 参数覆盖。
image_std (float 或 list[float], 可选, 默认为 IMAGENET_STANDARD_STD) — 如果标准化图像，使用的标准差。这是一个浮点数或浮点数列表，其长度为图像中的通道数。可通过 preprocess 方法中的 image_std 参数覆盖。
do_convert_rgb (bool, 可选) — 是否将图像转换为 RGB。

构造 ViT 图像处理器。

preprocess

< 来源 >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] do_resize: typing.Optional[bool] = None size: typing.Optional[dict[str, int]] = None resample: Resampling = None do_rescale: typing.Optional[bool] = None rescale_factor: typing.Optional[float] = None do_normalize: typing.Optional[bool] = None image_mean: typing.Union[float, list[float], NoneType] = None image_std: typing.Union[float, list[float], NoneType] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None data_format: typing.Union[str, transformers.image_utils.ChannelDimension] = <ChannelDimension.FIRST: 'channels_first'> input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None do_convert_rgb: typing.Optional[bool] = None )

参数

images (ImageInput) — 要预处理的图像。期望单个或批量图像，像素值范围为 0 到 255。如果传入的图像像素值在 0 到 1 之间，请设置 do_rescale=False。
do_resize (bool, 可选, 默认为 self.do_resize) — 是否调整图像大小。
size (dict[str, int], 可选, 默认为 self.size) — 格式为 {"height": h, "width": w} 的字典，指定调整大小后输出图像的尺寸。
resample (PILImageResampling filter, optional, defaults to self.resample) — 图像大小调整时使用的 PILImageResampling 过滤器，例如 PILImageResampling.BILINEAR。仅当 do_resize 设置为 True 时有效。
do_rescale (bool, optional, defaults to self.do_rescale) — 是否将图像值重新缩放为 [0 - 1] 之间。
rescale_factor (float, optional, defaults to self.rescale_factor) — 如果 do_rescale 设置为 True，则图像的重新缩放因子。
do_normalize (bool, optional, defaults to self.do_normalize) — 是否对图像进行归一化。
image_mean (float or list[float], optional, defaults to self.image_mean) — 如果 do_normalize 设置为 True，则图像使用的平均值。
image_std (float or list[float], optional, defaults to self.image_std) — 如果 do_normalize 设置为 True，则图像使用的标准差。
return_tensors (str or TensorType, optional) — 返回张量的类型。可以是以下之一：
- 未设置：返回 np.ndarray 列表。
- TensorType.TENSORFLOW 或 'tf'：返回类型为 tf.Tensor 的批处理。
- TensorType.PYTORCH 或 'pt'：返回类型为 torch.Tensor 的批处理。
- TensorType.NUMPY 或 'np'：返回类型为 np.ndarray 的批处理。
- TensorType.JAX 或 'jax'：返回类型为 jax.numpy.ndarray 的批处理。
data_format (ChannelDimension or str, optional, defaults to ChannelDimension.FIRST) — 输出图像的通道维度格式。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST：图像格式为 (num_channels, height, width)。
- "channels_last" 或 ChannelDimension.LAST：图像格式为 (height, width, num_channels)。
- 未设置：使用输入图像的通道维度格式。
input_data_format (ChannelDimension or str, optional) — 输入图像的通道维度格式。如果未设置，则从输入图像推断通道维度格式。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST：图像格式为 (num_channels, height, width)。
- "channels_last" 或 ChannelDimension.LAST：图像格式为 (height, width, num_channels)。
- "none" 或 ChannelDimension.NONE：图像格式为 (height, width)。
do_convert_rgb (bool, optional, defaults to self.do_convert_rgb) — 是否将图像转换为 RGB。

预处理一张或一批图像。

ViTImageProcessorFast

class transformers.ViTImageProcessorFast

< 来源 >

( **kwargs: typing_extensions.Unpack[transformers.image_processing_utils_fast.DefaultFastImageProcessorKwargs] )

构建一个快速的 Vit 图像处理器。

preprocess

< 来源 >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] *args **kwargs: typing_extensions.Unpack[transformers.image_processing_utils_fast.DefaultFastImageProcessorKwargs] ) → <class 'transformers.image_processing_base.BatchFeature'>

参数

images (Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]) — 要预处理的图像。期望单个或批量图像，像素值范围为 0 到 255。如果传入像素值在 0 到 1 之间的图像，请设置 do_rescale=False。
do_resize (bool, optional) — 是否调整图像大小。
size (dict[str, int], optional) — 描述模型最大输入维度。
default_to_square (bool, optional) — 如果大小为整数，是否在调整大小后默认为正方形图像。
resample (Union[PILImageResampling, F.InterpolationMode, NoneType]) — 调整图像大小时使用的重采样过滤器。这可以是枚举 PILImageResampling 之一。仅当 do_resize 设置为 True 时有效。
do_center_crop (bool, optional) — 是否对图像进行中心裁剪。
crop_size (dict[str, int], optional) — 应用 center_crop 后输出图像的大小。
do_rescale (bool, optional) — 是否重新缩放图像。
rescale_factor (Union[int, float, NoneType]) — 如果 do_rescale 设置为 True，则图像的重新缩放因子。
do_normalize (bool, optional) — 是否对图像进行归一化。
image_mean (Union[float, list[float], NoneType]) — 用于归一化的图像平均值。仅当 do_normalize 设置为 True 时有效。
image_std (Union[float, list[float], NoneType]) — 用于归一化的图像标准差。仅当 do_normalize 设置为 True 时有效。
do_convert_rgb (bool, optional) — 是否将图像转换为 RGB。
return_tensors (Union[str, ~utils.generic.TensorType, NoneType]) — 如果设置为“pt”，则返回堆叠的张量，否则返回张量列表。
data_format (~image_utils.ChannelDimension, optional) — 仅支持 ChannelDimension.FIRST。为与慢速处理器兼容而添加。
input_data_format (Union[str, ~image_utils.ChannelDimension, NoneType]) — 输入图像的通道维度格式。如果未设置，则从输入图像推断通道维度格式。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST：图像格式为 (num_channels, height, width)。
- "channels_last" 或 ChannelDimension.LAST：图像格式为 (height, width, num_channels)。
- "none" 或 ChannelDimension.NONE：图像格式为 (height, width)。
device (torch.device, optional) — 处理图像的设备。如果未设置，则从输入图像推断设备。
disable_grouping (bool, optional) — 是否禁用按大小对图像进行分组，以便单独处理而不是批量处理。如果为 None，则在图像位于 CPU 上时设置为 True，否则设置为 False。此选择基于经验观察，详情请参阅：https://github.com/huggingface/transformers/pull/38157

<class 'transformers.image_processing_base.BatchFeature'>

data (dict) — 由 call 方法返回的列表/数组/张量字典（“pixel_values”等）。
tensor_type (Union[None, str, TensorType], 可选) — 您可以在此处提供一个`tensor_type`，以便在初始化时将整数列表转换为PyTorch/TensorFlow/Numpy张量。

ViTModel

class transformers.ViTModel

< 来源 >

( config: ViTConfig add_pooling_layer: bool = True use_mask_token: bool = False )

参数

config (ViTConfig) — 模型配置类，包含模型的所有参数。使用配置文件初始化不会加载与模型关联的权重，仅加载配置。请查看 from_pretrained() 方法以加载模型权重。
add_pooling_layer (bool, optional, defaults to True) — 是否添加池化层
use_mask_token (bool, optional, defaults to False) — 是否使用掩码标记进行掩码图像建模。

裸 Vit 模型，输出原始隐藏状态，顶部没有任何特定头部。

此模型继承自 PreTrainedModel。请查看超类文档，了解库为其所有模型实现的通用方法（例如下载或保存、调整输入嵌入大小、修剪头部等）。

此模型也是 PyTorch torch.nn.Module 子类。将其作为常规 PyTorch 模块使用，并参考 PyTorch 文档以了解所有与一般用法和行为相关的事项。

forward

< 来源 >

( pixel_values: typing.Optional[torch.Tensor] = None bool_masked_pos: typing.Optional[torch.BoolTensor] = None head_mask: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None interpolate_pos_encoding: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.BaseModelOutputWithPooling 或 tuple(torch.FloatTensor)

参数

pixel_values (torch.Tensor of shape (batch_size, num_channels, image_size, image_size), optional) — 与输入图像对应的张量。像素值可以使用 {image_processor_class} 获取。有关详细信息，请参见 {image_processor_class}.__call__（{processor_class} 使用 {image_processor_class} 处理图像）。
bool_masked_pos (torch.BoolTensor of shape (batch_size, num_patches), optional) — 布尔掩码位置。指示哪些补丁被掩码（1），哪些没有（0）。
head_mask (torch.Tensor of shape (num_heads,) or (num_layers, num_heads), optional) — 用于使自注意力模块的选定头部无效的掩码。掩码值选择在 [0, 1]：
- 1 表示头部未被掩码，
- 0 表示头部被掩码。
output_attentions (bool, optional) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参见返回张量中的 attentions。
output_hidden_states (bool, optional) — 是否返回所有层的隐藏状态。有关更多详细信息，请参见返回张量中的 hidden_states。
interpolate_pos_encoding (bool, optional) — 是否插值预训练的位置编码。
return_dict (bool, optional) — 是否返回 ModelOutput 而不是纯元组。

transformers.modeling_outputs.BaseModelOutputWithPooling 或 tuple(torch.FloatTensor)

一个 transformers.modeling_outputs.BaseModelOutputWithPooling 或一个 torch.FloatTensor 元组（如果传入 return_dict=False 或 config.return_dict=False），包含根据配置（ViTConfig）和输入的不同元素。

last_hidden_state (torch.FloatTensor, 形状为 (batch_size, sequence_length, hidden_size)) — 模型最后一层输出的隐藏状态序列。
pooler_output (torch.FloatTensor of shape (batch_size, hidden_size)) — 序列中第一个标记（分类标记）的最后一层隐藏状态，经过用于辅助预训练任务的层进一步处理。例如，对于 BERT 家族模型，这会在经过线性层和 tanh 激活函数处理后返回分类标记。线性层权重在预训练期间根据下一个句子预测（分类）目标进行训练。
hidden_states (tuple(torch.FloatTensor), optional, 当传入 output_hidden_states=True 或 config.output_hidden_states=True 时返回) — torch.FloatTensor 元组（如果模型有嵌入层，则包括嵌入层的输出，加上每一层的输出），形状为 (batch_size, sequence_length, hidden_size)。

模型在每个层输出的隐藏状态以及可选的初始嵌入输出。
attentions (tuple(torch.FloatTensor), optional, 当传入 output_attentions=True 或 config.output_attentions=True 时返回) — torch.FloatTensor 元组（每个层一个），形状为 (batch_size, num_heads, sequence_length, sequence_length)。

注意力 softmax 后的注意力权重，用于计算自注意力头中的加权平均值。

ViTModel 的前向方法，重写了 __call__ 特殊方法。

虽然前向传播的配方需要在此函数中定义，但此后应调用 Module 实例而不是此函数，因为前者负责运行预处理和后处理步骤，而后者则默默忽略它们。

示例

ViTForMaskedImageModeling

class transformers.ViTForMaskedImageModeling

< 来源 >

( config: ViTConfig )

参数

config (ViTConfig) — 模型配置类，包含模型的所有参数。使用配置文件初始化不会加载与模型关联的权重，仅加载配置。请查看 from_pretrained() 方法以加载模型权重。

带有解码器的 Vit 模型，用于掩码图像建模，如 SimMIM 中所提出的。

请注意，我们在 examples directory 中提供了一个脚本，用于在自定义数据上预训练此模型。

此模型继承自 PreTrainedModel。请查看超类文档，了解库为其所有模型实现的通用方法（例如下载或保存、调整输入嵌入大小、修剪头部等）。

此模型也是 PyTorch torch.nn.Module 子类。将其作为常规 PyTorch 模块使用，并参考 PyTorch 文档以了解所有与一般用法和行为相关的事项。

forward

< 来源 >

( pixel_values: typing.Optional[torch.Tensor] = None bool_masked_pos: typing.Optional[torch.BoolTensor] = None head_mask: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None interpolate_pos_encoding: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.MaskedImageModelingOutput 或 tuple(torch.FloatTensor)

参数

pixel_values (torch.Tensor of shape (batch_size, num_channels, image_size, image_size), optional) — 与输入图像对应的张量。像素值可以使用 {image_processor_class} 获取。有关详细信息，请参见 {image_processor_class}.__call__（{processor_class} 使用 {image_processor_class} 处理图像）。
bool_masked_pos (torch.BoolTensor of shape (batch_size, num_patches)) — 布尔掩码位置。指示哪些补丁被掩码（1），哪些没有（0）。
head_mask (torch.Tensor of shape (num_heads,) or (num_layers, num_heads), optional) — 用于使自注意力模块的选定头部无效的掩码。掩码值选择在 [0, 1]：
- 1 表示头部未被掩码，
- 0 表示头部被掩码。
output_attentions (bool, optional) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参见返回张量中的 attentions。
output_hidden_states (bool, optional) — 是否返回所有层的隐藏状态。有关更多详细信息，请参见返回张量中的 hidden_states。
interpolate_pos_encoding (bool, optional) — 是否插入预训练位置编码。
return_dict (bool, optional) — 是否返回ModelOutput而不是普通元组。

transformers.modeling_outputs.MaskedImageModelingOutput 或 tuple(torch.FloatTensor)

一个 transformers.modeling_outputs.MaskedImageModelingOutput 或一个 torch.FloatTensor 元组（如果传入 return_dict=False 或当 config.return_dict=False 时），包含根据配置（ViTConfig）和输入而定的各种元素。

loss (形状为 (1,) 的 torch.FloatTensor，可选，当提供 bool_masked_pos 时返回) — 重构损失。
reconstruction (形状为 (batch_size, num_channels, height, width) 的 torch.FloatTensor) — 重构/完成的图像。
hidden_states (tuple(torch.FloatTensor)，可选，当传入 output_hidden_states=True 时返回，或者
当 config.output_hidden_states=True 时) — 形状为 (batch_size, sequence_length, hidden_size) 的 torch.FloatTensor 元组（如果模型有嵌入层，则一个用于嵌入层的输出，加上一个用于每个阶段的输出）。模型在每个阶段输出的隐藏状态（也称为特征图）。
attentions (tuple(torch.FloatTensor)，可选，当传入 output_attentions=True 时返回，或者当
config.output_attentions=True 时): 形状为 (batch_size, num_heads, patch_size, sequence_length) 的 torch.FloatTensor 元组（每个层一个）。注意力 softmax 后的注意力权重，用于计算自注意力头中的加权平均值。

ViTForMaskedImageModeling 的前向方法，覆盖了 __call__ 特殊方法。

虽然前向传播的配方需要在此函数中定义，但此后应调用 Module 实例而不是此函数，因为前者负责运行预处理和后处理步骤，而后者则默默忽略它们。

示例

>>> from transformers import AutoImageProcessor, ViTForMaskedImageModeling
>>> import torch
>>> from PIL import Image
>>> import requests

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
>>> model = ViTForMaskedImageModeling.from_pretrained("google/vit-base-patch16-224-in21k")

>>> num_patches = (model.config.image_size // model.config.patch_size) ** 2
>>> pixel_values = image_processor(images=image, return_tensors="pt").pixel_values
>>> # create random boolean mask of shape (batch_size, num_patches)
>>> bool_masked_pos = torch.randint(low=0, high=2, size=(1, num_patches)).bool()

>>> outputs = model(pixel_values, bool_masked_pos=bool_masked_pos)
>>> loss, reconstructed_pixel_values = outputs.loss, outputs.reconstruction
>>> list(reconstructed_pixel_values.shape)
[1, 3, 224, 224]

ViTForImageClassification

class transformers.ViTForImageClassification

< source >

( config: ViTConfig )

参数

config (ViTConfig) — 包含模型所有参数的模型配置类。用配置文件初始化不会加载与模型相关的权重，只加载配置。查看 from_pretrained() 方法加载模型权重。

ViT 模型转换器，顶部带有一个图像分类头（[CLS] token 最终隐藏状态顶部的一个线性层），例如用于 ImageNet。

请注意，通过在模型的前向传播中将 interpolate_pos_encoding 设置为 True，可以在比其训练图像分辨率更高的图像上微调 ViT。这将把预训练的位置嵌入插值到更高的分辨率。

此模型继承自 PreTrainedModel。请查看超类文档，了解库为其所有模型实现的通用方法（例如下载或保存、调整输入嵌入大小、修剪头部等）。

此模型也是 PyTorch torch.nn.Module 子类。将其作为常规 PyTorch 模块使用，并参考 PyTorch 文档以了解所有与一般用法和行为相关的事项。

forward

< source >

( pixel_values: typing.Optional[torch.Tensor] = None head_mask: typing.Optional[torch.Tensor] = None labels: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None interpolate_pos_encoding: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.ImageClassifierOutput 或 tuple(torch.FloatTensor)

参数

pixel_values (形状为 (batch_size, num_channels, image_size, image_size) 的 torch.Tensor，可选) — 对应输入图像的张量。像素值可以使用 {image_processor_class} 获取。有关详细信息，请参见 {image_processor_class}.__call__（{processor_class} 使用 {image_processor_class} 处理图像）。
head_mask (形状为 (num_heads,) 或 (num_layers, num_heads) 的 torch.Tensor，可选) — 用于使自注意力模块的选定头部无效的掩码。掩码值选自 [0, 1]：
- 1 表示头部未被遮蔽，
- 0 表示头部已被遮蔽。
labels (形状为 (batch_size,) 的 torch.LongTensor，可选) — 用于计算图像分类/回归损失的标签。索引应在 [0, ..., config.num_labels - 1] 之间。如果 config.num_labels == 1，则计算回归损失（均方损失），如果 config.num_labels > 1，则计算分类损失（交叉熵）。
output_attentions (bool, optional) — 是否返回所有注意力层的注意力张量。有关详细信息，请参阅返回张量下的 attentions。
output_hidden_states (bool, 可选) — 是否返回所有层的隐藏状态。有关详细信息，请参阅返回张量下的 hidden_states。
interpolate_pos_encoding (bool, 可选) — 是否插入预训练位置编码。
return_dict (bool, 可选) — 是否返回ModelOutput而不是普通元组。

transformers.modeling_outputs.ImageClassifierOutput 或 tuple(torch.FloatTensor)

一个 transformers.modeling_outputs.ImageClassifierOutput 或一个 torch.FloatTensor 元组（如果传入 return_dict=False 或当 config.return_dict=False 时），包含根据配置（ViTConfig）和输入而定的各种元素。

loss (形状为 (1,) 的 torch.FloatTensor，可选，当提供 labels 时返回) — 分类损失（如果 config.num_labels==1，则为回归损失）。
logits (形状为 (batch_size, config.num_labels) 的 torch.FloatTensor) — 分类（如果 config.num_labels==1，则为回归）分数（SoftMax 之前）。
hidden_states (tuple(torch.FloatTensor)，可选，当传入 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — 形状为 (batch_size, sequence_length, hidden_size) 的 torch.FloatTensor 元组（如果模型有嵌入层，则一个用于嵌入层的输出，加上一个用于每个阶段的输出）。模型在每个阶段输出的隐藏状态（也称为特征图）。
attentions (tuple(torch.FloatTensor)，可选，当传入 output_attentions=True 或当 config.output_attentions=True 时返回) — 形状为 (batch_size, num_heads, patch_size, sequence_length) 的 torch.FloatTensor 元组（每个层一个）。

注意力 softmax 后的注意力权重，用于计算自注意力头中的加权平均值。

ViTForImageClassification 的前向方法，覆盖了 __call__ 特殊方法。

虽然前向传播的配方需要在此函数中定义，但此后应调用 Module 实例而不是此函数，因为前者负责运行预处理和后处理步骤，而后者则默默忽略它们。

示例

>>> from transformers import AutoImageProcessor, ViTForImageClassification
>>> import torch
>>> from datasets import load_dataset

>>> dataset = load_dataset("huggingface/cats-image")
>>> image = dataset["test"]["image"][0]

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
>>> model = ViTForImageClassification.from_pretrained("google/vit-base-patch16-224")

>>> inputs = image_processor(image, return_tensors="pt")

>>> with torch.no_grad():
...     logits = model(**inputs).logits

>>> # model predicts one of the 1000 ImageNet classes
>>> predicted_label = logits.argmax(-1).item()
>>> print(model.config.id2label[predicted_label])
...

TFViTModel

class transformers.TFViTModel

< source >

( config: ViTConfig *inputs add_pooling_layer = True **kwargs )

参数

config (ViTConfig) — 包含模型所有参数的模型配置类。用配置文件初始化不会加载与模型相关的权重，只加载配置。查看 from_pretrained() 方法加载模型权重。

裸 ViT 模型转换器，输出原始隐藏状态，顶部没有任何特定头部。

此模型继承自 TFPreTrainedModel。查看超类文档，了解库为其所有模型实现的通用方法（例如下载或保存、调整输入嵌入大小、修剪头部等）。

此模型也是 keras.Model 的子类。将其作为常规的 TF 2.0 Keras 模型使用，并参考 TF 2.0 文档了解所有与通用用法和行为相关的事项。

transformers 中的 TensorFlow 模型和层接受两种输入格式

所有输入作为关键字参数（如 PyTorch 模型），或
所有输入作为第一个位置参数中的列表、元组或字典。

支持第二种格式的原因是 Keras 方法在将输入传递给模型和层时更喜欢这种格式。由于这种支持，当使用 model.fit() 等方法时，一切都应该“正常工作”——只需以 model.fit() 支持的任何格式传递您的输入和标签即可！但是，如果您想在 fit() 和 predict() 等 Keras 方法之外使用第二种格式，例如在使用 Keras Functional API 创建您自己的层或模型时，您可以使用三种可能性将所有输入张量收集到第一个位置参数中

一个只包含 pixel_values 的独立张量：model(pixel_values)
一个长度可变的列表，其中包含一个或多个输入张量，按文档字符串中给定的顺序排列：model([pixel_values, attention_mask]) 或 model([pixel_values, attention_mask, token_type_ids])
一个字典，包含一个或多个与文档字符串中给定的输入名称相关联的输入张量：model({"pixel_values": pixel_values, "token_type_ids": token_type_ids})

请注意，当使用子类化创建模型和层时，您无需担心任何这些，因为您可以像调用任何其他 Python 函数一样传递输入！

调用

< source >

( pixel_values: TFModelInputType | None = None head_mask: np.ndarray | tf.Tensor | None = None output_attentions: Optional[bool] = None output_hidden_states: Optional[bool] = None interpolate_pos_encoding: Optional[bool] = None return_dict: Optional[bool] = None training: bool = False ) → transformers.modeling_tf_outputs.TFBaseModelOutputWithPooling 或 tuple(tf.Tensor)

参数

pixel_values (np.ndarray, tf.Tensor, list[tf.Tensor] `dict[str, tf.Tensor] 或 dict[str, np.ndarray]，并且每个示例必须具有形状 (batch_size, num_channels, height, width)) — 像素值。像素值可以使用 AutoImageProcessor 获取。有关详细信息，请参见 ViTImageProcessor.call()。
head_mask (形状为 (num_heads,) 或 (num_layers, num_heads) 的 np.ndarray 或 tf.Tensor，可选) — 用于使自注意力模块的选定头部无效的掩码。掩码值选自 [0, 1]：
- 1 表示头部未被遮蔽，
- 0 表示头部已被遮蔽。
output_attentions (bool, 可选) — 是否返回所有注意力层的注意力张量。有关详细信息，请参阅返回张量下的 attentions。此参数仅在 eager 模式下可用，在 graph 模式下将使用配置中的值。
output_hidden_states (bool, 可选) — 是否返回所有层的隐藏状态。有关详细信息，请参阅返回张量下的 hidden_states。此参数仅在 eager 模式下可用，在 graph 模式下将使用配置中的值。
interpolate_pos_encoding (bool, 可选) — 是否插入预训练位置编码。
return_dict (bool, 可选) — 是否返回ModelOutput而不是普通元组。此参数在 eager 模式下可用，在 graph 模式下将始终设置为 True。
training (bool, 可选，默认为 `False“) — 是否在训练模式下使用模型（某些模块如 dropout 模块在训练和评估之间有不同的行为）。

transformers.modeling_tf_outputs.TFBaseModelOutputWithPooling 或 tuple(tf.Tensor)

一个 transformers.modeling_tf_outputs.TFBaseModelOutputWithPooling 或一个 tf.Tensor 元组（如果传入 return_dict=False 或当 config.return_dict=False 时），包含根据配置（ViTConfig）和输入而定的各种元素。

last_hidden_state (tf.Tensor of shape (batch_size, sequence_length, hidden_size)) — 模型最后一层输出的隐藏状态序列。
pooler_output (形状为 (batch_size, hidden_size) 的 tf.Tensor) — 序列第一个 token（分类 token）的最后一层隐藏状态，经线性层和 Tanh 激活函数进一步处理。线性层权重在预训练期间通过下一个句子预测（分类）目标进行训练。

此输出通常不是输入语义内容的良好摘要，通常最好对整个输入序列的隐藏状态进行平均或池化。
hidden_states (tuple(tf.Tensor)，可选，当传入 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — 形状为 (batch_size, sequence_length, hidden_size) 的 tf.Tensor 元组（一个用于嵌入层的输出 + 一个用于每个层的输出）。

模型在每个层输出的隐藏状态加上初始嵌入输出。
attentions (tuple(tf.Tensor)，可选，当传入 output_attentions=True 或当 config.output_attentions=True 时返回) — 形状为 (batch_size, num_heads, sequence_length, sequence_length) 的 tf.Tensor 元组（每个层一个）。

注意力 softmax 后的注意力权重，用于计算自注意力头中的加权平均值。

TFViTModel 的前向方法，覆盖了 __call__ 特殊方法。

虽然前向传播的配方需要在此函数中定义，但此后应调用 Module 实例而不是此函数，因为前者负责运行预处理和后处理步骤，而后者则默默忽略它们。

示例

>>> from transformers import AutoImageProcessor, TFViTModel
>>> from datasets import load_dataset

>>> dataset = load_dataset("huggingface/cats-image")
>>> image = dataset["test"]["image"][0]

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
>>> model = TFViTModel.from_pretrained("google/vit-base-patch16-224-in21k")

>>> inputs = image_processor(image, return_tensors="tf")
>>> outputs = model(**inputs)

>>> last_hidden_states = outputs.last_hidden_state
>>> list(last_hidden_states.shape)
[1, 197, 768]

TFViTForImageClassification

class transformers.TFViTForImageClassification

< source >

( config: ViTConfig *inputs **kwargs )

参数

config (ViTConfig) — 包含模型所有参数的模型配置类。用配置文件初始化不会加载与模型相关的权重，只加载配置。查看 from_pretrained() 方法加载模型权重。

ViT 模型转换器，顶部带有一个图像分类头（[CLS] token 最终隐藏状态顶部的一个线性层），例如用于 ImageNet。

此模型继承自 TFPreTrainedModel。查看超类文档，了解库为其所有模型实现的通用方法（例如下载或保存、调整输入嵌入大小、修剪头部等）。

此模型也是 keras.Model 的子类。将其作为常规的 TF 2.0 Keras 模型使用，并参考 TF 2.0 文档了解所有与通用用法和行为相关的事项。

transformers 中的 TensorFlow 模型和层接受两种输入格式

所有输入作为关键字参数（如 PyTorch 模型），或
所有输入作为第一个位置参数中的列表、元组或字典。

一个只包含 pixel_values 的独立张量：model(pixel_values)
一个长度可变的列表，其中包含一个或多个输入张量，按文档字符串中给定的顺序排列：model([pixel_values, attention_mask]) 或 model([pixel_values, attention_mask, token_type_ids])
一个字典，包含一个或多个与文档字符串中给定的输入名称相关联的输入张量：model({"pixel_values": pixel_values, "token_type_ids": token_type_ids})

请注意，当使用子类化创建模型和层时，您无需担心任何这些，因为您可以像调用任何其他 Python 函数一样传递输入！

调用

< source >

( pixel_values: TFModelInputType | None = None head_mask: np.ndarray | tf.Tensor | None = None output_attentions: Optional[bool] = None output_hidden_states: Optional[bool] = None interpolate_pos_encoding: Optional[bool] = None return_dict: Optional[bool] = None labels: np.ndarray | tf.Tensor | None = None training: Optional[bool] = False ) → transformers.modeling_tf_outputs.TFSequenceClassifierOutput 或 tuple(tf.Tensor)

参数

pixel_values (np.ndarray, tf.Tensor, list[tf.Tensor] `dict[str, tf.Tensor] 或 dict[str, np.ndarray]，且每个示例必须具有形状 (batch_size, num_channels, height, width)) — 像素值。像素值可以使用 AutoImageProcessor 获取。有关详细信息，请参见 ViTImageProcessor.call()。
head_mask (形状为 (num_heads,) 或 (num_layers, num_heads) 的 np.ndarray 或 tf.Tensor，可选) — 用于使自注意力模块的选定头部无效的掩码。掩码值选自 [0, 1]：
- 1 表示头部未被遮蔽，
- 0 表示头部已被遮蔽。
output_attentions (bool, 可选) — 是否返回所有注意力层的注意力张量。有关详细信息，请参阅返回张量下的 attentions。此参数仅在 eager 模式下可用，在 graph 模式下将使用配置中的值。
output_hidden_states (bool, 可选) — 是否返回所有层的隐藏状态。有关详细信息，请参阅返回张量下的 hidden_states。此参数仅在 eager 模式下可用，在 graph 模式下将使用配置中的值。
interpolate_pos_encoding (bool, 可选) — 是否插入预训练位置编码。
return_dict (bool, 可选) — 是否返回ModelOutput而不是普通元组。此参数在 eager 模式下可用，在 graph 模式下将始终设置为 True。
training (bool, 可选，默认为 `False“) — 是否在训练模式下使用模型（某些模块如 dropout 模块在训练和评估之间有不同的行为）。
labels (形状为 (batch_size,) 的 tf.Tensor 或 np.ndarray，可选) — 用于计算图像分类/回归损失的标签。索引应在 [0, ..., config.num_labels - 1] 之间。如果 config.num_labels == 1，则计算回归损失（均方损失），如果 config.num_labels > 1，则计算分类损失（交叉熵）。

transformers.modeling_tf_outputs.TFSequenceClassifierOutput 或 tuple(tf.Tensor)

一个 transformers.modeling_tf_outputs.TFSequenceClassifierOutput 或一个 tf.Tensor 元组（如果传入 return_dict=False 或当 config.return_dict=False 时），包含根据配置（ViTConfig）和输入而定的各种元素。

loss (tf.Tensor，形状为 (batch_size, )，可选，当提供 labels 时返回) — 分类损失（如果 config.num_labels==1，则为回归损失）。
logits (tf.Tensor，形状为 (batch_size, config.num_labels)) — 分类（或回归，如果 config.num_labels==1）分数（SoftMax 之前）。
hidden_states (tuple(tf.Tensor)，可选，当传入 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — 形状为 (batch_size, sequence_length, hidden_size) 的 tf.Tensor 元组（一个用于嵌入层的输出 + 一个用于每个层的输出）。

模型在每个层输出的隐藏状态加上初始嵌入输出。
attentions (tuple(tf.Tensor)，可选，当传入 output_attentions=True 或当 config.output_attentions=True 时返回) — 形状为 (batch_size, num_heads, sequence_length, sequence_length) 的 tf.Tensor 元组（每个层一个）。

注意力 softmax 后的注意力权重，用于计算自注意力头中的加权平均值。

TFViTForImageClassification 的前向方法，覆盖了 __call__ 特殊方法。

虽然前向传播的配方需要在此函数中定义，但此后应调用 Module 实例而不是此函数，因为前者负责运行预处理和后处理步骤，而后者则默默忽略它们。

示例

>>> from transformers import AutoImageProcessor, TFViTForImageClassification
>>> import tensorflow as tf
>>> from datasets import load_dataset

>>> dataset = load_dataset("huggingface/cats-image"))
>>> image = dataset["test"]["image"][0]

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
>>> model = TFViTForImageClassification.from_pretrained("google/vit-base-patch16-224")

>>> inputs = image_processor(image, return_tensors="tf")
>>> logits = model(**inputs).logits

>>> # model predicts one of the 1000 ImageNet classes
>>> predicted_label = int(tf.math.argmax(logits, axis=-1))
>>> print(model.config.id2label[predicted_label])
Egyptian cat

FlaxVitModel

class transformers.FlaxViTModel

< source >

( config: ViTConfig input_shape = None seed: int = 0 dtype: dtype = <class 'jax.numpy.float32'> _do_init: bool = True **kwargs )

参数

config (ViTConfig) — 包含模型所有参数的模型配置类。用配置文件初始化不会加载与模型相关的权重，只加载配置。查看 from_pretrained() 方法加载模型权重。
dtype (jax.numpy.dtype, 可选，默认为 jax.numpy.float32) — 计算的数据类型。可以是 jax.numpy.float32、jax.numpy.float16（在 GPU 上）和 jax.numpy.bfloat16（在 TPU 上）之一。

这可用于在 GPU 或 TPU 上启用混合精度训练或半精度推理。如果指定，所有计算将以给定的 dtype 执行。

请注意，这仅指定了计算的 dtype，不影响模型参数的 dtype。

如果要更改模型参数的 dtype，请参见 to_fp16() 和 to_bf16()。

裸 ViT 模型转换器，输出原始隐藏状态，顶部没有任何特定头部。

此模型继承自 FlaxPreTrainedModel。查看超类文档，了解库为其所有模型实现的通用方法（例如下载、保存和从 PyTorch 模型转换权重）

此模型也是 flax.linen.Module 的子类。将其作为常规的 Flax linen Module 使用，并参考 Flax 文档了解所有与通用用法和行为相关的事项。

最后，此模型支持固有的 JAX 功能，例如

call

< source >

( pixel_values params: typing.Optional[dict] = None dropout_rng: <function PRNGKey at 0x7effc7ad3a30> = None train: bool = False output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPooling 或 tuple(torch.FloatTensor)

transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPooling 或 tuple(torch.FloatTensor)

一个 transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPooling 或一个 torch.FloatTensor 的元组（如果传入 return_dict=False 或当 config.return_dict=False 时），包含根据配置（<class 'transformers.models.vit.configuration_vit.ViTConfig'>）和输入而定的各种元素。

last_hidden_state (形状为 (batch_size, sequence_length, hidden_size) 的 jnp.ndarray) — 模型最后一层输出的隐藏状态序列。
pooler_output (形状为 (batch_size, hidden_size) 的 jnp.ndarray) — 序列第一个 token（分类 token）的最后一层隐藏状态，经过线性层和 Tanh 激活函数进一步处理。线性层的权重在预训练期间通过下一句预测（分类）目标进行训练。
hidden_states (tuple(jnp.ndarray), 可选, 当传入 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — jnp.ndarray 的元组（一个用于嵌入层输出 + 每个层输出一个），形状为 (batch_size, sequence_length, hidden_size)。

模型在每个层输出的隐藏状态加上初始嵌入输出。
attentions (tuple(jnp.ndarray), 可选, 当传入 output_attentions=True 或当 config.output_attentions=True 时返回) — jnp.ndarray 的元组（每个层一个），形状为 (batch_size, num_heads, sequence_length, sequence_length)。

注意力 softmax 后的注意力权重，用于计算自注意力头中的加权平均值。

FlaxViTPreTrainedModel 的 forward 方法，重写了 __call__ 特殊方法。

虽然前向传播的配方需要在此函数中定义，但此后应调用 Module 实例而不是此函数，因为前者负责运行预处理和后处理步骤，而后者则默默忽略它们。

示例

>>> from transformers import AutoImageProcessor, FlaxViTModel
>>> from PIL import Image
>>> import requests

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
>>> model = FlaxViTModel.from_pretrained("google/vit-base-patch16-224-in21k")

>>> inputs = image_processor(images=image, return_tensors="np")
>>> outputs = model(**inputs)
>>> last_hidden_states = outputs.last_hidden_state

FlaxViTForImageClassification

class transformers.FlaxViTForImageClassification

< 源 >

( config: ViTConfig input_shape = None seed: int = 0 dtype: dtype = <class 'jax.numpy.float32'> _do_init: bool = True **kwargs )

参数

config (ViTConfig) — 模型配置类，包含模型的所有参数。使用配置文件初始化不会加载与模型相关的权重，只加载配置。请查阅 from_pretrained() 方法来加载模型权重。
dtype (jax.numpy.dtype, 可选, 默认为 jax.numpy.float32) — 计算的数据类型。可以是 jax.numpy.float32、jax.numpy.float16（在 GPU 上）和 jax.numpy.bfloat16（在 TPU 上）之一。

这可以用于在 GPU 或 TPU 上启用混合精度训练或半精度推理。如果指定，所有计算将以给定的 dtype 执行。

请注意，这仅指定了计算的数据类型，不影响模型参数的数据类型。

如果您希望更改模型参数的数据类型，请参阅 to_fp16() 和 to_bf16()。

ViT 模型转换器，顶部带有一个图像分类头（[CLS] token 最终隐藏状态顶部的一个线性层），例如用于 ImageNet。

此模型继承自 FlaxPreTrainedModel。查看超类文档，了解库为其所有模型实现的通用方法（例如下载、保存和从 PyTorch 模型转换权重）

此模型也是 flax.linen.Module 的子类。将其作为常规的 Flax linen Module 使用，并参考 Flax 文档了解所有与通用用法和行为相关的事项。

最后，此模型支持固有的 JAX 功能，例如

call

< 源 >

( pixel_values params: typing.Optional[dict] = None dropout_rng: <function PRNGKey at 0x7effc7ad3a30> = None train: bool = False output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_flax_outputs.FlaxSequenceClassifierOutput 或 tuple(torch.FloatTensor)

transformers.modeling_flax_outputs.FlaxSequenceClassifierOutput 或 tuple(torch.FloatTensor)

一个 transformers.modeling_flax_outputs.FlaxSequenceClassifierOutput 或一个 torch.FloatTensor 的元组（如果传入 return_dict=False 或当 config.return_dict=False 时），包含根据配置（<class 'transformers.models.vit.configuration_vit.ViTConfig'>）和输入而定的各种元素。

logits (形状为 (batch_size, config.num_labels) 的 jnp.ndarray) — 分类（如果 config.num_labels==1，则为回归）分数（SoftMax 之前）。
hidden_states (tuple(jnp.ndarray), 可选, 当传入 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — jnp.ndarray 的元组（一个用于嵌入层输出 + 每个层输出一个），形状为 (batch_size, sequence_length, hidden_size)。

模型在每个层输出的隐藏状态加上初始嵌入输出。
attentions (tuple(jnp.ndarray), 可选, 当传入 output_attentions=True 或当 config.output_attentions=True 时返回) — jnp.ndarray 的元组（每个层一个），形状为 (batch_size, num_heads, sequence_length, sequence_length)。

注意力 softmax 后的注意力权重，用于计算自注意力头中的加权平均值。

FlaxViTPreTrainedModel 的 forward 方法，重写了 __call__ 特殊方法。

虽然前向传播的配方需要在此函数中定义，但此后应调用 Module 实例而不是此函数，因为前者负责运行预处理和后处理步骤，而后者则默默忽略它们。

示例

>>> from transformers import AutoImageProcessor, FlaxViTForImageClassification
>>> from PIL import Image
>>> import jax
>>> import requests

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
>>> model = FlaxViTForImageClassification.from_pretrained("google/vit-base-patch16-224")

>>> inputs = image_processor(images=image, return_tensors="np")
>>> outputs = model(**inputs)
>>> logits = outputs.logits

>>> # model predicts one of the 1000 ImageNet classes
>>> predicted_class_idx = jax.numpy.argmax(logits, axis=-1)
>>> print("Predicted class:", model.config.id2label[predicted_class_idx.item()])

< > 在 GitHub 上更新

Transformers

Vision Transformer (ViT)

注意事项

ViTConfig

class transformers.ViTConfig

ViTFeatureExtractor

class transformers.ViTFeatureExtractor

__call__

ViTImageProcessor

class transformers.ViTImageProcessor

preprocess

ViTImageProcessorFast

class transformers.ViTImageProcessorFast

preprocess

ViTModel

class transformers.ViTModel

forward

ViTForMaskedImageModeling

class transformers.ViTForMaskedImageModeling

forward

ViTForImageClassification

class transformers.ViTForImageClassification

forward

TFViTModel

class transformers.TFViTModel

调用

TFViTForImageClassification

class transformers.TFViTForImageClassification

调用

FlaxVitModel

class transformers.FlaxViTModel

__call__

FlaxViTForImageClassification

class transformers.FlaxViTForImageClassification

__call__

call

call

call