寻找更好的（完整）ImageNet ViT 基线

社区文章发布于 2024 年 8 月 26 日

timm 1.0.9 版刚刚发布。其中包含我在寻找更好的 ViT 基线系列中新发布的 ImageNet-12k 和 ImageNet-12k -> ImageNet-1k 权重。

模型	top1	top5	参数数量	图像尺寸
vit_mediumd_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k	87.438	98.256	64.11	384
vit_mediumd_patch16_reg4_gap_256.sbb2_e200_in12k_ft_in1k	86.608	97.934	64.11	256
vit_betwixt_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k	86.594	98.02	60.4	384
vit_betwixt_patch16_reg4_gap_256.sbb2_e200_in12k_ft_in1k	85.734	97.61	60.4	256

我想重点介绍这些模型，因为它们在 ImageNet-12k / ImageNet-22k 模型中处于帕累托前沿。有趣的是，通过比较具有相似 ImageNet-22k 微调的模型，可以看到（接近）Vanilla ViT 与其他架构的竞争力如何。在启用优化的注意力内核（timm 中默认为启用）的情况下，它们遥遥领先于 Swin，并且与 ConvNeXt 等模型相比表现良好。

另外值得指出的是，deit3 模型权重是一组非常出色且被低估的权重。我的 sbb 权重的高端模型在同等计算量下与 deit3 相当——这也是一个很好的方法。然而，我使用 sbb 方法的目标之一是使微调更容易。通过选择一种不那么奇特的增强方案，坚持使用 AdamW，并牺牲一些 top-1 准确率（更高的权重衰减），我觉得这个目标已经实现了。通过几次微调实验，我发现 sbb ViT 权重更容易适应其他，尤其是较小的数据集（Oxford Pets、RESISC 等），并且运行时间较短。

注意：所有吞吐量测量均在 RTX 4090 上进行，启用 AMP /w torch.compile()，PyTorch 2.4，Cuda 12.4。

粗体行：帕累托前沿模型

模型	图像尺寸	samples_per_sec	top1	top5	参数数量
deit3_base_patch16_224.fb_in22k_ft_in1k	224	3326.85	85.73	97.75	86.59
vit_betwixt_patch16_reg4_gap_256.sbb2_e200_in12k_ft_in1k	256	3302.28	85.73	97.61	60.40
vit_base_patch16_224.augreg2_in21k_ft_in1k	224	3278.15	85.11	97.54	86.57
vit_base_patch16_224.augreg_in21k_ft_in1k	224	3274.99	84.53	97.30	86.57
vit_mediumd_patch16_reg4_gap_256.sbb2_e200_in12k_ft_in1k	256	2761.64	86.60	97.94	64.11
caformer_m36.sail_in22k_ft_in1k	224	2345.11	86.61	98.04	56.20
convformer_m36.sail_in22k_ft_in1k	224	2319.68	86.15	97.85	57.05
swin_base_patch4_window7_224.ms_in22k_ft_in1k	224	2176.48	85.27	97.57	87.77
regnety_160.sw_in12k_ft_in1k	224	2098.25	85.59	97.67	83.59
coatnet_2_rw_224.sw_in12k_ft_in1k	224	1753.63	86.58	97.90	73.87
vit_betwixt_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k	384	1467.64	86.60	98.02	60.60
convnext_large.fb_in22k_ft_in1k	224	1457.60	86.61	98.04	197.77
convnext_small.in12k_ft_in1k_384	384	1350.43	86.19	97.92	50.22
seresnextaa101d_32x8d.sw_in12k_ft_in1k_288	288	1297.79	86.54	98.09	93.59
regnety_160.sw_in12k_ft_in1k	288	1260.01	86.03	97.83	83.59
swin_large_patch4_window7_224.ms_in22k_ft_in1k	224	1243.73	86.33	97.88	196.53
vit_mediumd_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k	384	1214.59	87.44	98.26	64.27
deit3_base_patch16_384.fb_in22k_ft_in1k	384	1098.30	86.74	98.11	86.88
deit3_large_patch16_224.fb_in22k_ft_in1k	224	1042.41	86.99	98.24	304.37
vit_large_patch16_224.augreg_in21k_ft_in1k	224	1041.47	85.85	97.83	304.33
seresnextaa101d_32x8d.sw_in12k_ft_in1k_288	320	1035.83	86.72	98.18	93.59
convnext_xlarge.fb_in22k_ft_in1k	224	921.30	86.97	98.20	350.20
convnext_large.fb_in22k_ft_in1k	288	881.61	87.01	98.21	197.77
caformer_m36.sail_in22k_ft_in1k_384	384	794.45	87.47	98.31	56.20
efficientnet_b5.sw_in12k_ft_in1k	448	729.86	85.89	97.74	30.39
convnext_xlarge.fb_in22k_ft_in1k	288	559.14	87.37	98.33	350.20
swin_base_patch4_window12_384.ms_in22k_ft_in1k	384	522.86	86.44	98.07	87.90
convnext_large.fb_in22k_ft_in1k_384	384	500.83	87.46	98.38	197.77
maxxvitv2_rmlp_base_rw_384.sw_in12k_ft_in1k	384	456.17	87.48	98.37	116.09
coatnet_rmlp_2_rw_384.sw_in12k_ft_in1k	384	404.42	87.40	98.31	73.88
seresnextaa201d_32x8d.sw_in12k_ft_in1k_384	384	365.65	87.31	98.33	149.39
deit3_large_patch16_384.fb_in22k_ft_in1k	384	342.41	87.73	98.51	304.76
vit_large_patch16_384.augreg_in21k_ft_in1k	384	338.21	87.09	98.31	304.72
swin_large_patch4_window12_384.ms_in22k_ft_in1k	384	315.38	87.14	98.23	196.74
swinv2_base_window12to24_192to384.ms_in22k_ft_in1k	384	297.03	87.14	98.23	87.92
swinv2_large_window12to24_192to384.ms_in22k_ft_in1k	384	186.30	87.47	98.26	196.74

社区

通过拖放到文本输入框、粘贴或点击此处上传图片、音频和视频。

点击或粘贴此处以上传图片

· 注册或登录评论