用于大规模图像识别的极深卷积网络 (2014)

导论

VGG 架构由牛津大学视觉几何小组（Visual Geometry Group）的 Karen Simonyan 和 Andrew Zisserman 于 2014 年开发 - 因此命名为 VGG。该模型在当时的模型基础上展示了显著的改进 - 具体来说是 2014 年的 Imagenet 挑战赛，也称为 ILSVRC。

VGG 网络架构

输入是 224x224 图像。
卷积核形状为 (3,3)，最大池化窗口形状为 (2,2)。
每个卷积层的通道数为 64 -> 128 -> 256 -> 512 -> 512。
VGG16 有 16 个隐藏层（13 个卷积层和 3 个全连接层）。
VGG19 有 19 个隐藏层（16 个卷积层和 3 个全连接层）。

主要对比

VGG（16 或 19 层）在当时比其他 SOTA 网络更深。AlexNet 是 ILSVRC 2012 的获胜模型，只有 8 层。
多个小的 (3X3) 感受野滤波器与 ReLU 激活，而不是一个大的 (7X7 或 11X11) 滤波器，可以更好地学习复杂特征。较小的滤波器也意味着每层参数更少，并在中间引入了额外的非线性。
多尺度训练和推理。每张图像都经过多轮不同尺度的训练，以确保在不同尺寸下捕获相似的特征。
VGG 网络的连贯性和简洁性使其更易于扩展或修改以进行未来的改进。

PyTorch 示例

下面你可以找到 VGG19 的 PyTorch 实现。

import torch.nn as nn


class VGG19(nn.Module):
    def __init__(self, num_classes=1000):
        super(VGG19, self).__init__()

        # Feature extraction layers: Convolutional and pooling layers
        self.feature_extractor = nn.Sequential(
            nn.Conv2d(
                3, 64, kernel_size=3, padding=1
            ),  # 3 input channels, 64 output channels, 3x3 kernel, 1 padding
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(
                kernel_size=2, stride=2
            ),  # Max pooling with 2x2 kernel and stride 2
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(128, 128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(256, 512, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
        )

        # Pooling Layer
        self.avgpool = nn.AdaptiveAvgPool2d(output_size=(7, 7))

        # Fully connected layers for classification
        self.classifier = nn.Sequential(
            nn.Linear(
                512 * 7 * 7, 4096
            ),  # 512 channels, 7x7 spatial dimensions after max pooling
            nn.ReLU(),
            nn.Dropout(0.5),  # Dropout layer with 0.5 dropout probability
            nn.Linear(4096, 4096),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(4096, num_classes),  # Output layer with 'num_classes' output units
        )

    def forward(self, x):
        x = self.feature_extractor(x)  # Pass input through the feature extractor layers
        x = self.avgpool(x)  # Pass Data through a pooling layer
        x = x.view(x.size(0), -1)  # Flatten the output for the fully connected layers
        x = self.classifier(x)  # Pass flattened output through the classifier layers
        return x

< > 在 GitHub 上更新

社区计算机视觉课程

用于大规模图像识别的极深卷积网络 (2014)

导论

VGG 网络架构

主要对比

PyTorch 示例