MobileViT

概述

MobileViT 模型由 Sachin Mehta 和 Mohammad Rastegari 在MobileViT: 轻量级、通用型、移动友好的视觉 Transformer中提出。MobileViT 引入了一个新层，用 Transformer 的全局处理取代了卷积的局部处理。

论文摘要如下：

轻量级卷积神经网络 (CNN) 是移动视觉任务的事实标准。它们的空间归纳偏置使其能够以更少的参数学习不同视觉任务的表示。然而，这些网络是空间局部的。为了学习全局表示，基于自注意力的视觉 Transformer (ViT) 被采用。与 CNN 不同，ViT 是重量级的。在本文中，我们提出了以下问题：是否有可能结合 CNN 和 ViT 的优点，为移动视觉任务构建一个轻量级、低延迟的网络？为此，我们引入了 MobileViT，一个用于移动设备的轻量级通用视觉 Transformer。MobileViT 为 Transformer 的全局信息处理提供了一个不同的视角，即 Transformer 即卷积。我们的结果表明，MobileViT 在不同任务和数据集上显著优于基于 CNN 和 ViT 的网络。在 ImageNet-1k 数据集上，MobileViT 以约 600 万个参数实现了 78.4% 的 top-1 准确率，比 MobileNetv3（基于 CNN）和 DeIT（基于 ViT）在相似参数数量下分别高出 3.2% 和 6.2%。在 MS-COCO 目标检测任务上，MobileViT 在相似参数数量下比 MobileNetv3 高出 5.7%。

此模型由matthijs贡献。该模型的 TensorFlow 版本由sayakpaul贡献。原始代码和权重可在此处找到。

使用技巧

MobileViT 更像一个 CNN 而不是 Transformer 模型。它不适用于序列数据，而是批处理图像。与 ViT 不同，它没有嵌入。主干模型输出一个特征图。您可以参考本教程进行轻量级介绍。
可以使用 MobileViTImageProcessor 来准备图像供模型使用。请注意，如果您自行进行预处理，预训练的检查点期望图像采用 BGR 像素顺序（而非 RGB）。
可用的图像分类检查点在 ImageNet-1k（也称为 ILSVRC 2012，包含 130 万张图像和 1,000 个类别）上进行了预训练。
分割模型使用 DeepLabV3 头。可用的语义分割检查点在 PASCAL VOC 上进行了预训练。
顾名思义，MobileViT 旨在移动电话上实现高性能和高效率。MobileViT 模型的 TensorFlow 版本与 TensorFlow Lite 完全兼容。

您可以使用以下代码将 MobileViT 检查点（无论是图像分类还是语义分割）转换为生成 TensorFlow Lite 模型

from transformers import TFMobileViTForImageClassification
import tensorflow as tf


model_ckpt = "apple/mobilevit-xx-small"
model = TFMobileViTForImageClassification.from_pretrained(model_ckpt)

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [
    tf.lite.OpsSet.TFLITE_BUILTINS,
    tf.lite.OpsSet.SELECT_TF_OPS,
]
tflite_model = converter.convert()
tflite_filename = model_ckpt.split("/")[-1] + ".tflite"
with open(tflite_filename, "wb") as f:
    f.write(tflite_model)

生成的模型只有 **大约 1MB**，非常适合资源和网络带宽受限的移动应用。

资源

一份官方 Hugging Face 和社区（用🌎表示）资源列表，帮助您开始使用 MobileViT。

图像分类

MobileViTForImageClassification 由此示例脚本和notebook支持。
另请参阅：图像分类任务指南

语义分割

语义分割任务指南

如果您有兴趣在此处提交资源，请随时开启 Pull Request，我们将对其进行审查！该资源最好能展示一些新内容，而不是重复现有资源。

Transformers

MobileViT

概述

使用技巧

资源

MobileViTConfig

class transformers.MobileViTConfig

MobileViTFeatureExtractor

class transformers.MobileViTFeatureExtractor

__call__

post_process_semantic_segmentation

MobileViTImageProcessor

class transformers.MobileViTImageProcessor

preprocess

post_process_semantic_segmentation

MobileViTModel

class transformers.MobileViTModel

forward

MobileViTForImageClassification

class transformers.MobileViTForImageClassification

forward

MobileViTForSemanticSegmentation

class transformers.MobileViTForSemanticSegmentation

forward

TFMobileViTModel

class transformers.TFMobileViTModel

调用

TFMobileViTForImageClassification

class transformers.TFMobileViTForImageClassification

调用

TFMobileViTForSemanticSegmentation

class transformers.TFMobileViTForSemanticSegmentation

调用

call