VideoMAE

概述

VideoMAE 模型由 Zhan Tong、Yibing Song、Jue Wang、Limin Wang 在 VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training 中提出。VideoMAE 将掩码自编码器 (MAE) 扩展到视频，声称在多个视频分类基准测试中取得了最先进的性能。

论文的摘要如下：

通常需要在超大型数据集上预训练视频 Transformer，才能在相对较小的数据集上获得出色的性能。在本文中，我们展示了视频掩码自编码器 (VideoMAE) 是用于自监督视频预训练 (SSVP) 的数据高效学习器。我们受到最近 ImageMAE 的启发，并提出了定制的视频管掩码和重建。事实证明，这些简单的设计可以有效地克服视频重建过程中时间相关性引起的信息泄漏。我们在 SSVP 上获得了三个重要的发现：（1）极高比例的掩码率（即 90% 到 95%）仍然可以产生良好的 VideoMAE 性能。时间上冗余的视频内容使得掩码率高于图像。（2）VideoMAE 在非常小的数据集（即 3k-4k 个视频左右）上取得了令人印象深刻的结果，而无需使用任何额外数据。这部分归因于视频重建的挑战性任务，以加强高层次的结构学习。（3）VideoMAE 表明，对于 SSVP 而言，数据质量比数据数量更重要。预训练数据集和目标数据集之间的域偏移是 SSVP 中的重要问题。值得注意的是，我们的 VideoMAE 与原始 ViT backbone 可以在 Kinetics-400 上达到 83.9%，在 Something-Something V2 上达到 75.3%，在 UCF101 上达到 90.8%，在 HMDB51 上达到 61.1%，而无需使用任何额外数据。

VideoMAE 预训练。摘自原始论文。

此模型由 nielsr 贡献。原始代码可以在这里找到。

使用缩放点积注意力 (SDPA)

PyTorch 包含一个原生的缩放点积注意力 (SDPA) 运算符，作为 torch.nn.functional 的一部分。此函数包含多种实现，可以根据输入和正在使用的硬件应用。有关更多信息，请参阅官方文档或 GPU 推理页面。

当实现可用时，torch>=2.1.1 默认使用 SDPA，但您也可以在 from_pretrained() 中设置 attn_implementation="sdpa" 以显式请求使用 SDPA。

from transformers import VideoMAEForVideoClassification
model = VideoMAEForVideoClassification.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics", attn_implementation="sdpa", torch_dtype=torch.float16)
...

为了获得最佳加速，我们建议以半精度（例如 torch.float16 或 torch.bfloat16）加载模型。

在一个本地基准测试（A100-40GB，PyTorch 2.3.0，OS Ubuntu 22.04）中，使用 float32 和 MCG-NJU/videomae-base-finetuned-kinetics 模型，我们看到了以下推理期间的加速。

批大小	平均推理时间 (ms)，eager 模式	平均推理时间 (ms)，sdpa 模型	加速，Sdpa / Eager (x)
1	37	10	3.7
2	24	18	1.33
4	43	32	1.34
8	84	60	1.4

资源

以下是官方 Hugging Face 和社区（🌎 表示）资源列表，可帮助您开始使用 VideoMAE。如果您有兴趣提交资源以包含在此处，请随时打开 Pull Request，我们将对其进行审核！该资源应理想地展示一些新的东西，而不是重复现有资源。

视频分类

一个 notebook，展示了如何在自定义数据集上微调 VideoMAE 模型。
视频分类任务指南
一个 🤗 Space，展示了如何使用视频分类模型执行推理。

VideoMAEConfig

class transformers.VideoMAEConfig

< source >

( image_size = 224 patch_size = 16 num_channels = 3 num_frames = 16 tubelet_size = 2 hidden_size = 768 num_hidden_layers = 12 num_attention_heads = 12 intermediate_size = 3072 hidden_act = 'gelu' hidden_dropout_prob = 0.0 attention_probs_dropout_prob = 0.0 initializer_range = 0.02 layer_norm_eps = 1e-12 qkv_bias = True use_mean_pooling = True decoder_num_attention_heads = 6 decoder_hidden_size = 384 decoder_num_hidden_layers = 4 decoder_intermediate_size = 1536 norm_pix_loss = True **kwargs )

参数

image_size (int, 可选, 默认为 224) — 每张图片的大小（分辨率）。
patch_size (int, 可选, 默认为 16) — 每个 patch 的大小（分辨率）。
num_channels (int, 可选, 默认为 3) — 输入通道数。
num_frames (int, 可选, 默认为 16) — 每个视频中的帧数。
tubelet_size (int, 可选, 默认为 2) — tubelet 的数量。
hidden_size (int, 可选, 默认为 768) — 编码器层和池化器层的维度。
num_hidden_layers (int, 可选, 默认为 12) — Transformer 编码器中隐藏层的数量。
num_attention_heads (int, 可选, 默认为 12) — Transformer 编码器中每个注意力层的注意力头数。
intermediate_size (int, 可选, 默认为 3072) — Transformer 编码器中“中间”（即，前馈）层的维度。
hidden_act (str 或 function, 可选, 默认为 "gelu") — 编码器和池化器中的非线性激活函数（函数或字符串）。如果为字符串，则支持 "gelu"、"relu"、"selu" 和 "gelu_new"。
hidden_dropout_prob (float, 可选, 默认为 0.0) — embeddings、编码器和池化器中所有全连接层的 dropout 概率。
attention_probs_dropout_prob (float, 可选, 默认为 0.0) — 注意力概率的 dropout 比率。
initializer_range (float, 可选, 默认为 0.02) — 用于初始化所有权重矩阵的 truncated_normal_initializer 的标准差。
layer_norm_eps (float, optional, defaults to 1e-12) — 层归一化层使用的 epsilon 值。
qkv_bias (bool, optional, defaults to True) — 是否向 queries（查询）、keys（键）和 values（值）添加偏置。
use_mean_pooling (bool, optional, defaults to True) — 是否对最终的隐藏状态进行均值池化，而不是使用 [CLS] 令牌的最终隐藏状态。
decoder_num_attention_heads (int, optional, defaults to 6) — 解码器中每个注意力层的注意力头数。
decoder_hidden_size (int, optional, defaults to 384) — 解码器的维度。
decoder_num_hidden_layers (int, optional, defaults to 4) — 解码器中隐藏层的数量。
decoder_intermediate_size (int, optional, defaults to 1536) — 解码器中“中间层”（即，前馈层）的维度。
norm_pix_loss (bool, optional, defaults to True) — 是否标准化目标补丁像素。

这是用于存储 VideoMAEModel 配置的配置类。它用于根据指定的参数实例化 VideoMAE 模型，定义模型架构。使用默认值实例化配置将产生与 VideoMAE MCG-NJU/videomae-base 架构类似的配置。

配置对象继承自 PretrainedConfig，可用于控制模型输出。有关更多信息，请阅读 PretrainedConfig 的文档。

示例

>>> from transformers import VideoMAEConfig, VideoMAEModel

>>> # Initializing a VideoMAE videomae-base style configuration
>>> configuration = VideoMAEConfig()

>>> # Randomly initializing a model from the configuration
>>> model = VideoMAEModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

VideoMAEFeatureExtractor

class transformers.VideoMAEFeatureExtractor

< source >

( *args **kwargs )

call

< source >

( images **kwargs )

预处理单张图像或一批图像。

VideoMAEImageProcessor

class transformers.VideoMAEImageProcessor

< source >

( do_resize: bool = True size: typing.Dict[str, int] = None resample: Resampling = <Resampling.BILINEAR: 2> do_center_crop: bool = True crop_size: typing.Dict[str, int] = None do_rescale: bool = True rescale_factor: typing.Union[int, float] = 0.00392156862745098 do_normalize: bool = True image_mean: typing.Union[float, typing.List[float], NoneType] = None image_std: typing.Union[float, typing.List[float], NoneType] = None **kwargs )

参数

do_resize (bool, optional, defaults to True) — 是否将图像的（高度，宽度）尺寸调整为指定的 size。可以被 preprocess 方法中的 do_resize 参数覆盖。
size (Dict[str, int] optional, defaults to {"shortest_edge" -- 224}): 调整大小后输出图像的尺寸。图像的最短边将调整为 size["shortest_edge"]，同时保持原始图像的纵横比。可以被 preprocess 方法中的 size 覆盖。
resample (PILImageResampling, optional, defaults to Resampling.BILINEAR) — 如果调整图像大小，要使用的重采样滤波器。可以被 preprocess 方法中的 resample 参数覆盖。
do_center_crop (bool, optional, defaults to True) — 是否将图像中心裁剪为指定的 crop_size。可以被 preprocess 方法中的 do_center_crop 参数覆盖。
crop_size (Dict[str, int], optional, defaults to {"height" -- 224, "width": 224}): 应用中心裁剪后图像的尺寸。可以被 preprocess 方法中的 crop_size 参数覆盖。
do_rescale (bool, optional, defaults to True) — 是否按指定的比例 rescale_factor 缩放图像。可以被 preprocess 方法中的 do_rescale 参数覆盖。
rescale_factor (int or float, optional, defaults to 1/255) — 定义缩放图像时使用的缩放因子。可以被 preprocess 方法中的 rescale_factor 参数覆盖。
do_normalize (bool, optional, defaults to True) — 是否标准化图像。可以被 preprocess 方法中的 do_normalize 参数覆盖。
image_mean (float or List[float], optional, defaults to IMAGENET_STANDARD_MEAN) — 如果标准化图像，要使用的均值。这是一个浮点数或浮点数列表，其长度等于图像中的通道数。可以被 preprocess 方法中的 image_mean 参数覆盖。
image_std (float or List[float], optional, defaults to IMAGENET_STANDARD_STD) — 如果标准化图像，要使用的标准差。这是一个浮点数或浮点数列表，其长度等于图像中的通道数。可以被 preprocess 方法中的 image_std 参数覆盖。

构建 VideoMAE 图像处理器。

preprocess

< source >

( videos: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] do_resize: bool = None size: typing.Dict[str, int] = None resample: Resampling = None do_center_crop: bool = None crop_size: typing.Dict[str, int] = None do_rescale: bool = None rescale_factor: float = None do_normalize: bool = None image_mean: typing.Union[float, typing.List[float], NoneType] = None image_std: typing.Union[float, typing.List[float], NoneType] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None data_format: ChannelDimension = <ChannelDimension.FIRST: 'channels_first'> input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None )

参数

images (ImageInput) — 要预处理的图像。期望是像素值范围从 0 到 255 的单张或一批图像。如果传入的图像像素值在 0 到 1 之间，请设置 do_rescale=False。
do_resize (bool, optional, defaults to self.do_resize) — 是否调整图像大小。
size (Dict[str, int], 可选, 默认为 self.size) — 应用调整大小后图像的尺寸。
resample (PILImageResampling, 可选, 默认为 self.resample) — 如果调整图像大小，则使用的重采样过滤器。可以是枚举类型 PILImageResampling 中的一个。仅当 do_resize 设置为 True 时有效。
do_center_crop (bool, 可选, 默认为 self.do_centre_crop) — 是否对图像进行中心裁剪。
crop_size (Dict[str, int], 可选, 默认为 self.crop_size) — 应用中心裁剪后图像的尺寸。
do_rescale (bool, 可选, 默认为 self.do_rescale) — 是否将图像值缩放到 [0 - 1] 之间。
rescale_factor (float, 可选, 默认为 self.rescale_factor) — 如果 do_rescale 设置为 True，则用于缩放图像的缩放因子。
do_normalize (bool, 可选, 默认为 self.do_normalize) — 是否对图像进行归一化。
image_mean (float 或 List[float], 可选, 默认为 self.image_mean) — 图像均值。
image_std (float 或 List[float], 可选, 默认为 self.image_std) — 图像标准差。
return_tensors (str 或 TensorType, 可选) — 返回张量的类型。可以是以下之一：
- 未设置：返回 np.ndarray 列表。
- TensorType.TENSORFLOW 或 'tf'：返回 tf.Tensor 类型的批次。
- TensorType.PYTORCH 或 'pt'：返回 torch.Tensor 类型的批次。
- TensorType.NUMPY 或 'np'：返回 np.ndarray 类型的批次。
- TensorType.JAX 或 'jax'：返回 jax.numpy.ndarray 类型的批次。
data_format (ChannelDimension 或 str, 可选, 默认为 ChannelDimension.FIRST) — 输出图像的通道维度格式。可以是以下之一：
- ChannelDimension.FIRST: 图像格式为 (num_channels, height, width)。
- ChannelDimension.LAST: 图像格式为 (height, width, num_channels)。
- 未设置：使用输入图像的推断通道维度格式。
input_data_format (ChannelDimension 或 str, 可选) — 输入图像的通道维度格式。如果未设置，则通道维度格式从输入图像推断。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST: 图像格式为 (num_channels, height, width)。
- "channels_last" 或 ChannelDimension.LAST: 图像格式为 (height, width, num_channels)。
- "none" 或 ChannelDimension.NONE: 图像格式为 (height, width)。

预处理单张或批量图像。

VideoMAEModel

class transformers.VideoMAEModel

< source >

( config )

参数

config (VideoMAEConfig) — 包含模型所有参数的模型配置类。使用配置文件初始化不会加载与模型关联的权重，仅加载配置。查看 from_pretrained() 方法以加载模型权重。

裸 VideoMAE 模型 Transformer，输出原始隐藏状态，顶部没有任何特定的 head。此模型是 PyTorch torch.nn.Module 子类。可将其用作常规 PyTorch 模块，并参考 PyTorch 文档以了解与通用用法和行为相关的所有事项。

forward

< source >

( pixel_values: FloatTensor bool_masked_pos: typing.Optional[torch.BoolTensor] = None head_mask: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.BaseModelOutput 或 tuple(torch.FloatTensor)

参数

pixel_values (torch.FloatTensor，形状为 (batch_size, num_frames, num_channels, height, width)) — 像素值。像素值可以使用 AutoImageProcessor 获得。有关详细信息，请参阅 VideoMAEImageProcessor.call()。
head_mask (torch.FloatTensor，形状为 (num_heads,) 或 (num_layers, num_heads), 可选) — 用于置空自注意力模块中选定 head 的 mask。Mask 值在 [0, 1] 中选择：
- 1 表示 head 未被 mask，
- 0 表示 head 已被 mask。
output_attentions (bool, 可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参见返回张量下的 attentions。
output_hidden_states (bool, 可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参见返回张量下的 hidden_states。
return_dict (bool, 可选) — 是否返回 ModelOutput 而不是普通元组。
bool_masked_pos (torch.BoolTensor，形状为 (batch_size, sequence_length), 可选) — 布尔 mask 位置。指示哪些 patch 被 mask（1），哪些未被 mask（0）。批次中的每个视频必须具有相同数量的 mask patch。如果为 None，则考虑所有 patch。序列长度为 (num_frames // tubelet_size) * (image_size // patch_size) ** 2。

返回值

transformers.modeling_outputs.BaseModelOutput 或 tuple(torch.FloatTensor)

一个 transformers.modeling_outputs.BaseModelOutput 或一个 torch.FloatTensor 元组 (如果传递了 return_dict=False 或当 config.return_dict=False 时)，其中包含各种元素，具体取决于配置 (VideoMAEConfig) 和输入。

last_hidden_state (torch.FloatTensor，形状为 (batch_size, sequence_length, hidden_size)) — 模型最后一层输出端的隐藏状态序列。
hidden_states (tuple(torch.FloatTensor), 可选, 当传递 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — torch.FloatTensor 元组（如果模型具有嵌入层，则一个用于嵌入的输出；+ 每个层的输出一个），形状为 (batch_size, sequence_length, hidden_size)。

模型在每一层输出端的隐藏状态，加上可选的初始嵌入输出。
attentions (tuple(torch.FloatTensor), 可选, 当传递 output_attentions=True 或当 config.output_attentions=True 时返回) — torch.FloatTensor 元组（每层一个），形状为 (batch_size, num_heads, sequence_length, sequence_length)。

注意力 softmax 之后的注意力权重，用于计算自注意力 head 中的加权平均值。

VideoMAEModel forward 方法，覆盖了 __call__ 特殊方法。

尽管 forward 传递的配方需要在该函数中定义，但应在此之后调用 Module 实例，而不是调用此函数，因为前者负责运行预处理和后处理步骤，而后者则会默默地忽略它们。

示例

>>> import av
>>> import numpy as np

>>> from transformers import AutoImageProcessor, VideoMAEModel
>>> from huggingface_hub import hf_hub_download

>>> np.random.seed(0)


>>> def read_video_pyav(container, indices):
...     '''
...     Decode the video with PyAV decoder.
...     Args:
...         container (`av.container.input.InputContainer`): PyAV container.
...         indices (`List[int]`): List of frame indices to decode.
...     Returns:
...         result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
...     '''
...     frames = []
...     container.seek(0)
...     start_index = indices[0]
...     end_index = indices[-1]
...     for i, frame in enumerate(container.decode(video=0)):
...         if i > end_index:
...             break
...         if i >= start_index and i in indices:
...             frames.append(frame)
...     return np.stack([x.to_ndarray(format="rgb24") for x in frames])


>>> def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
...     '''
...     Sample a given number of frame indices from the video.
...     Args:
...         clip_len (`int`): Total number of frames to sample.
...         frame_sample_rate (`int`): Sample every n-th frame.
...         seg_len (`int`): Maximum allowed index of sample's last frame.
...     Returns:
...         indices (`List[int]`): List of sampled frame indices
...     '''
...     converted_len = int(clip_len * frame_sample_rate)
...     end_idx = np.random.randint(converted_len, seg_len)
...     start_idx = end_idx - converted_len
...     indices = np.linspace(start_idx, end_idx, num=clip_len)
...     indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
...     return indices


>>> # video clip consists of 300 frames (10 seconds at 30 FPS)
>>> file_path = hf_hub_download(
...     repo_id="nielsr/video-demo", filename="eating_spaghetti.mp4", repo_type="dataset"
... )
>>> container = av.open(file_path)

>>> # sample 16 frames
>>> indices = sample_frame_indices(clip_len=16, frame_sample_rate=1, seg_len=container.streams.video[0].frames)
>>> video = read_video_pyav(container, indices)

>>> image_processor = AutoImageProcessor.from_pretrained("MCG-NJU/videomae-base")
>>> model = VideoMAEModel.from_pretrained("MCG-NJU/videomae-base")

>>> # prepare video for the model
>>> inputs = image_processor(list(video), return_tensors="pt")

>>> # forward pass
>>> outputs = model(**inputs)
>>> last_hidden_states = outputs.last_hidden_state
>>> list(last_hidden_states.shape)
[1, 1568, 768]

VideoMAEForPreTraining

VideoMAEForPreTraining 包含顶部的解码器，用于自监督预训练。

class transformers.VideoMAEForPreTraining

< source >

( config )

参数

config (VideoMAEConfig) — 带有模型所有参数的模型配置类。使用配置文件初始化不会加载与模型相关的权重，仅加载配置。查看 from_pretrained() 方法来加载模型权重。

带有用于自监督预训练的解码器的 VideoMAE 模型转换器。此模型是 PyTorch torch.nn.Module 子类。将其用作常规 PyTorch 模块，并参阅 PyTorch 文档以了解与常规用法和行为相关的所有事项。

forward

< source >

( pixel_values: FloatTensor bool_masked_pos: BoolTensor head_mask: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.models.videomae.modeling_videomae.VideoMAEForPreTrainingOutput 或 tuple(torch.FloatTensor)

参数

pixel_values (形状为 (batch_size, num_frames, num_channels, height, width) 的 torch.FloatTensor) — 像素值。像素值可以使用 AutoImageProcessor 获得。有关详细信息，请参阅 VideoMAEImageProcessor.call()。
head_mask (形状为 (num_heads,) 或 (num_layers, num_heads) 的 torch.FloatTensor, 可选) — 用于使自注意力模块的选定头无效的掩码。在 [0, 1] 中选择的掩码值：
- 1 表示头未被掩蔽，
- 0 表示头被掩蔽。
output_attentions (bool, 可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参阅返回张量下的 attentions。
output_hidden_states (bool, 可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参阅返回张量下的 hidden_states。
return_dict (bool, 可选) — 是否返回 ModelOutput 而不是普通元组。
bool_masked_pos (形状为 (batch_size, sequence_length) 的 torch.BoolTensor) — 布尔掩码位置。指示哪些补丁被掩蔽 (1) 和哪些未被掩蔽 (0)。批次中的每个视频必须具有相同数量的掩蔽补丁。序列长度为 (num_frames // tubelet_size) * (image_size // patch_size) ** 2。

返回值

transformers.models.videomae.modeling_videomae.VideoMAEForPreTrainingOutput 或 tuple(torch.FloatTensor)

transformers.models.videomae.modeling_videomae.VideoMAEForPreTrainingOutput 或 torch.FloatTensor 的元组（如果传递 return_dict=False 或当 config.return_dict=False 时），包含各种元素，具体取决于配置 (VideoMAEConfig) 和输入。

loss (形状为 (1,) 的 torch.FloatTensor) — 像素重建损失。
logits (形状为 (batch_size, patch_size ** 2 * num_channels) 的 torch.FloatTensor) — 像素重建 logits。
hidden_states (tuple(torch.FloatTensor), 可选, 当传递 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — torch.FloatTensor 的元组（每个嵌入输出一个，每层输出一个），形状为 (batch_size, sequence_length, hidden_size)。模型在每层输出以及初始嵌入输出处的隐藏状态。
attentions (tuple(torch.FloatTensor), 可选, 当传递 output_attentions=True 或当 config.output_attentions=True 时返回) — torch.FloatTensor 的元组（每层一个），形状为 (batch_size, num_heads, sequence_length, sequence_length)。注意力 softmax 之后的注意力权重，用于计算自注意力头中的加权平均值。

VideoMAEForPreTraining 前向方法，覆盖了 __call__ 特殊方法。

示例

>>> from transformers import AutoImageProcessor, VideoMAEForPreTraining
>>> import numpy as np
>>> import torch

>>> num_frames = 16
>>> video = list(np.random.randint(0, 256, (num_frames, 3, 224, 224)))

>>> image_processor = AutoImageProcessor.from_pretrained("MCG-NJU/videomae-base")
>>> model = VideoMAEForPreTraining.from_pretrained("MCG-NJU/videomae-base")

>>> pixel_values = image_processor(video, return_tensors="pt").pixel_values

>>> num_patches_per_frame = (model.config.image_size // model.config.patch_size) ** 2
>>> seq_length = (num_frames // model.config.tubelet_size) * num_patches_per_frame
>>> bool_masked_pos = torch.randint(0, 2, (1, seq_length)).bool()

>>> outputs = model(pixel_values, bool_masked_pos=bool_masked_pos)
>>> loss = outputs.loss

VideoMAEForVideoClassification

class transformers.VideoMAEForVideoClassification

< source >

( config )

参数

config (VideoMAEConfig) — 带有模型所有参数的模型配置类。使用配置文件初始化不会加载与模型相关的权重，仅加载配置。查看 from_pretrained() 方法来加载模型权重。

带有视频分类头（所有 token 的平均池化隐藏状态之上的线性层）的 VideoMAE 模型转换器，例如用于 ImageNet。此模型是 PyTorch torch.nn.Module 子类。将其用作常规 PyTorch 模块，并参阅 PyTorch 文档以了解与常规用法和行为相关的所有事项。

forward

< source >

( pixel_values: typing.Optional[torch.Tensor] = None head_mask: typing.Optional[torch.Tensor] = None labels: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.ImageClassifierOutput 或 tuple(torch.FloatTensor)

参数

pixel_values (形状为 (batch_size, num_frames, num_channels, height, width) 的 torch.FloatTensor) — 像素值。像素值可以使用 AutoImageProcessor 获得。有关详细信息，请参阅 VideoMAEImageProcessor.call()。
head_mask (形状为 (num_heads,) 或 (num_layers, num_heads) 的 torch.FloatTensor, 可选) — 用于使自注意力模块的选定头无效的掩码。在 [0, 1] 中选择的掩码值：
- 1 表示头未被掩蔽，
- 0 表示头被掩蔽。
output_attentions (bool, 可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参阅返回张量下的 attentions。
output_hidden_states (bool, 可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参阅返回张量下的 hidden_states。
return_dict (bool, 可选) — 是否返回 ModelOutput 而不是普通元组。
labels (形状为 (batch_size,) 的 torch.LongTensor, 可选) — 用于计算图像分类/回归损失的标签。索引应在 [0, ..., config.num_labels - 1] 中。如果 config.num_labels == 1，则计算回归损失（均方损失），如果 config.num_labels > 1，则计算分类损失（交叉熵）。

返回值

transformers.modeling_outputs.ImageClassifierOutput 或 tuple(torch.FloatTensor)

transformers.modeling_outputs.ImageClassifierOutput 或 torch.FloatTensor 的元组（如果传递 return_dict=False 或当 config.return_dict=False 时），包含各种元素，具体取决于配置 (VideoMAEConfig) 和输入。

loss (形状为 (1,) 的 torch.FloatTensor, 可选, 当提供 labels 时返回) — 分类（或回归，如果 config.num_labels==1）损失。
logits (形状为 (batch_size, config.num_labels) 的 torch.FloatTensor) — 分类（或回归，如果 config.num_labels==1）分数（在 SoftMax 之前）。
hidden_states (tuple(torch.FloatTensor), 可选, 当传递 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — torch.FloatTensor 的元组（如果模型具有嵌入层，则为嵌入输出一个，每阶段输出一个），形状为 (batch_size, sequence_length, hidden_size)。模型在每阶段输出处的隐藏状态（也称为特征图）。
attentions (tuple(torch.FloatTensor), 可选, 当传递 output_attentions=True 或当 config.output_attentions=True 时返回) — torch.FloatTensor 的元组（每层一个），形状为 (batch_size, num_heads, patch_size, sequence_length)。

注意力 softmax 之后的注意力权重，用于计算自注意力 head 中的加权平均值。

VideoMAEForVideoClassification 前向方法，覆盖了 __call__ 特殊方法。

示例

>>> import av
>>> import torch
>>> import numpy as np

>>> from transformers import AutoImageProcessor, VideoMAEForVideoClassification
>>> from huggingface_hub import hf_hub_download

>>> np.random.seed(0)


>>> def read_video_pyav(container, indices):
...     '''
...     Decode the video with PyAV decoder.
...     Args:
...         container (`av.container.input.InputContainer`): PyAV container.
...         indices (`List[int]`): List of frame indices to decode.
...     Returns:
...         result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
...     '''
...     frames = []
...     container.seek(0)
...     start_index = indices[0]
...     end_index = indices[-1]
...     for i, frame in enumerate(container.decode(video=0)):
...         if i > end_index:
...             break
...         if i >= start_index and i in indices:
...             frames.append(frame)
...     return np.stack([x.to_ndarray(format="rgb24") for x in frames])


>>> def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
...     '''
...     Sample a given number of frame indices from the video.
...     Args:
...         clip_len (`int`): Total number of frames to sample.
...         frame_sample_rate (`int`): Sample every n-th frame.
...         seg_len (`int`): Maximum allowed index of sample's last frame.
...     Returns:
...         indices (`List[int]`): List of sampled frame indices
...     '''
...     converted_len = int(clip_len * frame_sample_rate)
...     end_idx = np.random.randint(converted_len, seg_len)
...     start_idx = end_idx - converted_len
...     indices = np.linspace(start_idx, end_idx, num=clip_len)
...     indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
...     return indices


>>> # video clip consists of 300 frames (10 seconds at 30 FPS)
>>> file_path = hf_hub_download(
...     repo_id="nielsr/video-demo", filename="eating_spaghetti.mp4", repo_type="dataset"
... )
>>> container = av.open(file_path)

>>> # sample 16 frames
>>> indices = sample_frame_indices(clip_len=16, frame_sample_rate=1, seg_len=container.streams.video[0].frames)
>>> video = read_video_pyav(container, indices)

>>> image_processor = AutoImageProcessor.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics")
>>> model = VideoMAEForVideoClassification.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics")

>>> inputs = image_processor(list(video), return_tensors="pt")

>>> with torch.no_grad():
...     outputs = model(**inputs)
...     logits = outputs.logits

>>> # model predicts one of the 400 Kinetics-400 classes
>>> predicted_label = logits.argmax(-1).item()
>>> print(model.config.id2label[predicted_label])
eating spaghetti

< > 在 GitHub 上更新

Transformers

VideoMAE

概述

使用缩放点积注意力 (SDPA)

资源

VideoMAEConfig

class transformers.VideoMAEConfig

VideoMAEFeatureExtractor

class transformers.VideoMAEFeatureExtractor

__call__

VideoMAEImageProcessor

class transformers.VideoMAEImageProcessor

preprocess

VideoMAEModel

class transformers.VideoMAEModel

forward

VideoMAEForPreTraining

class transformers.VideoMAEForPreTraining

forward

VideoMAEForVideoClassification

class transformers.VideoMAEForVideoClassification

forward

call