目标检测

目标检测是计算机视觉任务，旨在检测图像中的实例（例如人、建筑物或汽车）。目标检测模型接收图像作为输入，并输出检测到的对象的边界框坐标和相关标签。一个图像可以包含多个对象，每个对象都有自己的边界框和标签（例如，它可以有一辆汽车和一个建筑物），并且每个对象可以出现在图像的不同部分（例如，图像可以有多辆汽车）。此任务常用于自动驾驶，以检测行人、道路标志和交通信号灯等物体。其他应用包括图像中物体的计数、图像搜索等。

在本指南中，您将学习如何

在 DETR 上进行微调，该模型将卷积骨干网络与编码器-解码器 Transformer 结合使用，在 CPPE-5 数据集上。
使用您微调的模型进行推理。

要查看与此任务兼容的所有架构和检查点，我们建议查看任务页面

在开始之前，请确保您已安装所有必要的库

pip install -q datasets transformers accelerate timm
pip install -q -U albumentations>=1.4.5 torchmetrics pycocotools

您将使用 🤗 Datasets 从 Hugging Face Hub 加载数据集，使用 🤗 Transformers 训练您的模型，并使用 albumentations 增强数据。

我们鼓励您与社区分享您的模型。登录您的 Hugging Face 帐户以将其上传到 Hub。出现提示时，输入您的令牌以登录

>>> from huggingface_hub import notebook_login

>>> notebook_login()

为了开始，我们将定义全局常量，即模型名称和图像大小。在本教程中，我们将使用条件 DETR 模型，因为它收敛速度更快。您可以随意选择 transformers 库中提供的任何目标检测模型。

>>> MODEL_NAME = "microsoft/conditional-detr-resnet-50"  # or "facebook/detr-resnet-50"
>>> IMAGE_SIZE = 480

加载 CPPE-5 数据集

CPPE-5 数据集包含图像，其中包含在 COVID-19 大流行背景下识别医疗个人防护设备 (PPE) 的注释。

首先加载数据集，并从 train 创建一个 validation 拆分

>>> from datasets import load_dataset

>>> cppe5 = load_dataset("cppe-5")

>>> if "validation" not in cppe5:
...     split = cppe5["train"].train_test_split(0.15, seed=1337)
...     cppe5["train"] = split["train"]
...     cppe5["validation"] = split["test"]

>>> cppe5
DatasetDict({
    train: Dataset({
        features: ['image_id', 'image', 'width', 'height', 'objects'],
        num_rows: 850
    })
    test: Dataset({
        features: ['image_id', 'image', 'width', 'height', 'objects'],
        num_rows: 29
    })
    validation: Dataset({
        features: ['image_id', 'image', 'width', 'height', 'objects'],
        num_rows: 150
    })
})

您将看到此数据集有 1000 张图像用于训练集和验证集，以及一个包含 29 张图像的测试集。

为了熟悉数据，请探索示例的外观。

>>> cppe5["train"][0]
{
  'image_id': 366,
  'image': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=500x290>,
  'width': 500,
  'height': 500,
  'objects': {
    'id': [1932, 1933, 1934],
    'area': [27063, 34200, 32431],
    'bbox': [[29.0, 11.0, 97.0, 279.0],
      [201.0, 1.0, 120.0, 285.0],
      [382.0, 0.0, 113.0, 287.0]],
    'category': [0, 0, 0]
  }
}

数据集中的示例具有以下字段

image_id：示例图像 ID
image：包含图像的 PIL.Image.Image 对象
width：图像宽度
height：图像高度
objects：包含图像中对象边界框元数据的字典
- id：注释 ID
- area：边界框的面积
- bbox：对象的边界框（COCO 格式）
- category：对象的类别，可能的值包括 Coverall (0)、Face_Shield (1)、Gloves (2)、Goggles (3) 和 Mask (4)

您可能会注意到 bbox 字段遵循 DETR 模型期望的 COCO 格式。但是，objects 内字段的分组与 DETR 要求的注释格式不同。在使用此数据进行训练之前，您需要应用一些预处理转换。

为了更好地理解数据，请可视化数据集中的一个示例。

>>> import numpy as np
>>> import os
>>> from PIL import Image, ImageDraw

>>> image = cppe5["train"][2]["image"]
>>> annotations = cppe5["train"][2]["objects"]
>>> draw = ImageDraw.Draw(image)

>>> categories = cppe5["train"].features["objects"].feature["category"].names

>>> id2label = {index: x for index, x in enumerate(categories, start=0)}
>>> label2id = {v: k for k, v in id2label.items()}

>>> for i in range(len(annotations["id"])):
...     box = annotations["bbox"][i]
...     class_idx = annotations["category"][i]
...     x, y, w, h = tuple(box)
...     # Check if coordinates are normalized or not
...     if max(box) > 1.0:
...         # Coordinates are un-normalized, no need to re-scale them
...         x1, y1 = int(x), int(y)
...         x2, y2 = int(x + w), int(y + h)
...     else:
...         # Coordinates are normalized, re-scale them
...         x1 = int(x * width)
...         y1 = int(y * height)
...         x2 = int((x + w) * width)
...         y2 = int((y + h) * height)
...     draw.rectangle((x, y, x + w, y + h), outline="red", width=1)
...     draw.text((x, y), id2label[class_idx], fill="white")

>>> image

要可视化带有相关标签的边界框，您可以从数据集的元数据中获取标签，特别是 category 字段。您还需要创建字典，将标签 ID 映射到标签类 (id2label)，反之亦然 (label2id)。您可以在稍后设置模型时使用它们。如果您在 Hugging Face Hub 上分享您的模型，包含这些映射将使您的模型可供其他人重用。请注意，上面代码中绘制边界框的部分假定它是 COCO 格式 (x_min, y_min, width, height)。它必须进行调整以适用于其他格式，例如 (x_min, y_min, x_max, y_max)。

作为熟悉数据的最后一步，请探索数据以查找潜在问题。对象检测数据集的一个常见问题是“拉伸”到图像边缘之外的边界框。这种“失控”的边界框可能会在训练期间引发错误，应予以解决。此数据集中存在一些具有此问题的示例。为了在本指南中保持简单，我们将在下面的转换中为 BboxParams 设置 clip=True。

预处理数据

要微调模型，您必须预处理计划使用的数据，以精确匹配预训练模型使用的方法。AutoImageProcessor 负责处理图像数据以创建 pixel_values、pixel_mask 和 labels，DETR 模型可以使用它们进行训练。图像处理器具有一些您无需担心的属性

image_mean = [0.485, 0.456, 0.406 ]
image_std = [0.229, 0.224, 0.225]

这些是用于在模型预训练期间标准化图像的均值和标准差。这些值对于在执行推理或微调预训练图像模型时复制至关重要。

从您要微调的模型的同一检查点实例化图像处理器。

>>> from transformers import AutoImageProcessor

>>> MAX_SIZE = IMAGE_SIZE

>>> image_processor = AutoImageProcessor.from_pretrained(
...     MODEL_NAME,
...     do_resize=True,
...     size={"max_height": MAX_SIZE, "max_width": MAX_SIZE},
...     do_pad=True,
...     pad_size={"height": MAX_SIZE, "width": MAX_SIZE},
... )

在将图像传递给 image_processor 之前，请对数据集应用两个预处理转换

增强图像
重新格式化注释以满足 DETR 的期望

首先，为了确保模型不会在训练数据上过拟合，您可以使用任何数据增强库应用图像增强。这里我们使用 Albumentations。此库确保转换影响图像并相应地更新边界框。🤗 Datasets 库文档有一个关于如何增强对象检测图像的详细指南，并且它使用与示例完全相同的数据集。对图像应用一些几何和颜色变换。有关其他增强选项，请探索 Albumentations Demo Space。

>>> import albumentations as A

>>> train_augment_and_transform = A.Compose(
...     [
...         A.Perspective(p=0.1),
...         A.HorizontalFlip(p=0.5),
...         A.RandomBrightnessContrast(p=0.5),
...         A.HueSaturationValue(p=0.1),
...     ],
...     bbox_params=A.BboxParams(format="coco", label_fields=["category"], clip=True, min_area=25),
... )

>>> validation_transform = A.Compose(
...     [A.NoOp()],
...     bbox_params=A.BboxParams(format="coco", label_fields=["category"], clip=True),
... )

image_processor 期望注释采用以下格式：{'image_id': int, 'annotations': List[Dict]}，其中每个字典都是一个 COCO 对象注释。让我们添加一个函数来重新格式化单个示例的注释

>>> def format_image_annotations_as_coco(image_id, categories, areas, bboxes):
...     """Format one set of image annotations to the COCO format

...     Args:
...         image_id (str): image id. e.g. "0001"
...         categories (List[int]): list of categories/class labels corresponding to provided bounding boxes
...         areas (List[float]): list of corresponding areas to provided bounding boxes
...         bboxes (List[Tuple[float]]): list of bounding boxes provided in COCO format
...             ([center_x, center_y, width, height] in absolute coordinates)

...     Returns:
...         dict: {
...             "image_id": image id,
...             "annotations": list of formatted annotations
...         }
...     """
...     annotations = []
...     for category, area, bbox in zip(categories, areas, bboxes):
...         formatted_annotation = {
...             "image_id": image_id,
...             "category_id": category,
...             "iscrowd": 0,
...             "area": area,
...             "bbox": list(bbox),
...         }
...         annotations.append(formatted_annotation)

...     return {
...         "image_id": image_id,
...         "annotations": annotations,
...     }

现在您可以组合图像和注释转换，以用于一批示例

>>> def augment_and_transform_batch(examples, transform, image_processor, return_pixel_mask=False):
...     """Apply augmentations and format annotations in COCO format for object detection task"""

...     images = []
...     annotations = []
...     for image_id, image, objects in zip(examples["image_id"], examples["image"], examples["objects"]):
...         image = np.array(image.convert("RGB"))

...         # apply augmentations
...         output = transform(image=image, bboxes=objects["bbox"], category=objects["category"])
...         images.append(output["image"])

...         # format annotations in COCO format
...         formatted_annotations = format_image_annotations_as_coco(
...             image_id, output["category"], objects["area"], output["bboxes"]
...         )
...         annotations.append(formatted_annotations)

...     # Apply the image processor transformations: resizing, rescaling, normalization
...     result = image_processor(images=images, annotations=annotations, return_tensors="pt")

...     if not return_pixel_mask:
...         result.pop("pixel_mask", None)

...     return result

使用 🤗 Datasets with_transform 方法将此预处理函数应用于整个数据集。当您加载数据集的元素时，此方法会动态应用转换。

此时，您可以检查转换后数据集中示例的外观。您应该看到一个带有 pixel_values 的张量、一个带有 pixel_mask 的张量和 labels。

>>> from functools import partial

>>> # Make transform functions for batch and apply for dataset splits
>>> train_transform_batch = partial(
...     augment_and_transform_batch, transform=train_augment_and_transform, image_processor=image_processor
... )
>>> validation_transform_batch = partial(
...     augment_and_transform_batch, transform=validation_transform, image_processor=image_processor
... )

>>> cppe5["train"] = cppe5["train"].with_transform(train_transform_batch)
>>> cppe5["validation"] = cppe5["validation"].with_transform(validation_transform_batch)
>>> cppe5["test"] = cppe5["test"].with_transform(validation_transform_batch)

>>> cppe5["train"][15]
{'pixel_values': tensor([[[ 1.9235,  1.9407,  1.9749,  ..., -0.7822, -0.7479, -0.6965],
          [ 1.9578,  1.9749,  1.9920,  ..., -0.7993, -0.7650, -0.7308],
          [ 2.0092,  2.0092,  2.0263,  ..., -0.8507, -0.8164, -0.7822],
          ...,
          [ 0.0741,  0.0741,  0.0741,  ...,  0.0741,  0.0741,  0.0741],
          [ 0.0741,  0.0741,  0.0741,  ...,  0.0741,  0.0741,  0.0741],
          [ 0.0741,  0.0741,  0.0741,  ...,  0.0741,  0.0741,  0.0741]],

          [[ 1.6232,  1.6408,  1.6583,  ...,  0.8704,  1.0105,  1.1331],
          [ 1.6408,  1.6583,  1.6758,  ...,  0.8529,  0.9930,  1.0980],
          [ 1.6933,  1.6933,  1.7108,  ...,  0.8179,  0.9580,  1.0630],
          ...,
          [ 0.2052,  0.2052,  0.2052,  ...,  0.2052,  0.2052,  0.2052],
          [ 0.2052,  0.2052,  0.2052,  ...,  0.2052,  0.2052,  0.2052],
          [ 0.2052,  0.2052,  0.2052,  ...,  0.2052,  0.2052,  0.2052]],

          [[ 1.8905,  1.9080,  1.9428,  ..., -0.1487, -0.0964, -0.0615],
          [ 1.9254,  1.9428,  1.9603,  ..., -0.1661, -0.1138, -0.0790],
          [ 1.9777,  1.9777,  1.9951,  ..., -0.2010, -0.1138, -0.0790],
          ...,
          [ 0.4265,  0.4265,  0.4265,  ...,  0.4265,  0.4265,  0.4265],
          [ 0.4265,  0.4265,  0.4265,  ...,  0.4265,  0.4265,  0.4265],
          [ 0.4265,  0.4265,  0.4265,  ...,  0.4265,  0.4265,  0.4265]]]),
  'labels': {'image_id': tensor([688]), 'class_labels': tensor([3, 4, 2, 0, 0]), 'boxes': tensor([[0.4700, 0.1933, 0.1467, 0.0767],
          [0.4858, 0.2600, 0.1150, 0.1000],
          [0.4042, 0.4517, 0.1217, 0.1300],
          [0.4242, 0.3217, 0.3617, 0.5567],
          [0.6617, 0.4033, 0.5400, 0.4533]]), 'area': tensor([ 4048.,  4140.,  5694., 72478., 88128.]), 'iscrowd': tensor([0, 0, 0, 0, 0]), 'orig_size': tensor([480, 480])}}

您已成功增强了单个图像并准备了其注释。但是，预处理尚未完成。在最后一步中，创建一个自定义 collate_fn 以将图像批量处理在一起。将图像（现在是 pixel_values）填充到批次中最大的图像，并创建一个相应的 pixel_mask 以指示哪些像素是真实的 (1) 哪些是填充的 (0)。

>>> import torch

>>> def collate_fn(batch):
...     data = {}
...     data["pixel_values"] = torch.stack([x["pixel_values"] for x in batch])
...     data["labels"] = [x["labels"] for x in batch]
...     if "pixel_mask" in batch[0]:
...         data["pixel_mask"] = torch.stack([x["pixel_mask"] for x in batch])
...     return data

准备计算 mAP 的函数

对象检测模型通常使用一组 COCO 风格的指标进行评估。我们将使用 torchmetrics 来计算 mAP（平均精度均值）和 mAR（平均召回率均值）指标，并将其包装到 compute_metrics 函数中，以便在 Trainer 中用于评估。

用于训练的框的中间格式是 YOLO（标准化），但我们将计算 Pascal VOC（绝对）格式框的指标，以便正确处理框面积。让我们定义一个将边界框转换为 Pascal VOC 格式的函数

>>> from transformers.image_transforms import center_to_corners_format

>>> def convert_bbox_yolo_to_pascal(boxes, image_size):
...     """
...     Convert bounding boxes from YOLO format (x_center, y_center, width, height) in range [0, 1]
...     to Pascal VOC format (x_min, y_min, x_max, y_max) in absolute coordinates.

...     Args:
...         boxes (torch.Tensor): Bounding boxes in YOLO format
...         image_size (Tuple[int, int]): Image size in format (height, width)

...     Returns:
...         torch.Tensor: Bounding boxes in Pascal VOC format (x_min, y_min, x_max, y_max)
...     """
...     # convert center to corners format
...     boxes = center_to_corners_format(boxes)

...     # convert to absolute coordinates
...     height, width = image_size
...     boxes = boxes * torch.tensor([[width, height, width, height]])

...     return boxes

然后，在 compute_metrics 函数中，我们从评估循环结果中收集 predicted 和 target 边界框、分数和标签，并将其传递给评分函数。

>>> import numpy as np
>>> from dataclasses import dataclass
>>> from torchmetrics.detection.mean_ap import MeanAveragePrecision


>>> @dataclass
>>> class ModelOutput:
...     logits: torch.Tensor
...     pred_boxes: torch.Tensor


>>> @torch.no_grad()
>>> def compute_metrics(evaluation_results, image_processor, threshold=0.0, id2label=None):
...     """
...     Compute mean average mAP, mAR and their variants for the object detection task.

...     Args:
...         evaluation_results (EvalPrediction): Predictions and targets from evaluation.
...         threshold (float, optional): Threshold to filter predicted boxes by confidence. Defaults to 0.0.
...         id2label (Optional[dict], optional): Mapping from class id to class name. Defaults to None.

...     Returns:
...         Mapping[str, float]: Metrics in a form of dictionary {<metric_name>: <metric_value>}
...     """

...     predictions, targets = evaluation_results.predictions, evaluation_results.label_ids

...     # For metric computation we need to provide:
...     #  - targets in a form of list of dictionaries with keys "boxes", "labels"
...     #  - predictions in a form of list of dictionaries with keys "boxes", "scores", "labels"

...     image_sizes = []
...     post_processed_targets = []
...     post_processed_predictions = []

...     # Collect targets in the required format for metric computation
...     for batch in targets:
...         # collect image sizes, we will need them for predictions post processing
...         batch_image_sizes = torch.tensor(np.array([x["orig_size"] for x in batch]))
...         image_sizes.append(batch_image_sizes)
...         # collect targets in the required format for metric computation
...         # boxes were converted to YOLO format needed for model training
...         # here we will convert them to Pascal VOC format (x_min, y_min, x_max, y_max)
...         for image_target in batch:
...             boxes = torch.tensor(image_target["boxes"])
...             boxes = convert_bbox_yolo_to_pascal(boxes, image_target["orig_size"])
...             labels = torch.tensor(image_target["class_labels"])
...             post_processed_targets.append({"boxes": boxes, "labels": labels})

...     # Collect predictions in the required format for metric computation,
...     # model produce boxes in YOLO format, then image_processor convert them to Pascal VOC format
...     for batch, target_sizes in zip(predictions, image_sizes):
...         batch_logits, batch_boxes = batch[1], batch[2]
...         output = ModelOutput(logits=torch.tensor(batch_logits), pred_boxes=torch.tensor(batch_boxes))
...         post_processed_output = image_processor.post_process_object_detection(
...             output, threshold=threshold, target_sizes=target_sizes
...         )
...         post_processed_predictions.extend(post_processed_output)

...     # Compute metrics
...     metric = MeanAveragePrecision(box_format="xyxy", class_metrics=True)
...     metric.update(post_processed_predictions, post_processed_targets)
...     metrics = metric.compute()

...     # Replace list of per class metrics with separate metric for each class
...     classes = metrics.pop("classes")
...     map_per_class = metrics.pop("map_per_class")
...     mar_100_per_class = metrics.pop("mar_100_per_class")
...     for class_id, class_map, class_mar in zip(classes, map_per_class, mar_100_per_class):
...         class_name = id2label[class_id.item()] if id2label is not None else class_id.item()
...         metrics[f"map_{class_name}"] = class_map
...         metrics[f"mar_100_{class_name}"] = class_mar

...     metrics = {k: round(v.item(), 4) for k, v in metrics.items()}

...     return metrics


>>> eval_compute_metrics_fn = partial(
...     compute_metrics, image_processor=image_processor, id2label=id2label, threshold=0.0
... )

训练检测模型

您已经在前面的部分完成了大部分繁重的工作，所以现在您已准备好训练您的模型！即使在调整大小后，此数据集中的图像仍然很大。这意味着微调此模型至少需要一个 GPU。

训练涉及以下步骤

使用 AutoModelForObjectDetection 加载模型，使用与预处理中相同的检查点。
在 TrainingArguments 中定义您的训练超参数。
将训练参数传递给 Trainer 以及模型、数据集、图像处理器和数据整理器。
调用 train() 以微调您的模型。

从用于预处理的同一检查点加载模型时，请记住传递您先前从数据集的元数据创建的 label2id 和 id2label 映射。此外，我们指定 ignore_mismatched_sizes=True 以将现有的分类头替换为新的分类头。

>>> from transformers import AutoModelForObjectDetection

>>> model = AutoModelForObjectDetection.from_pretrained(
...     MODEL_NAME,
...     id2label=id2label,
...     label2id=label2id,
...     ignore_mismatched_sizes=True,
... )

在 TrainingArguments 中，使用 output_dir 指定保存模型的位置，然后根据您的需要配置超参数。对于 num_train_epochs=30，在 Google Colab T4 GPU 中训练大约需要 35 分钟，增加 epoch 的数量可以获得更好的结果。

重要提示

不要删除未使用的列，因为这会删除图像列。如果没有图像列，您将无法创建 pixel_values。因此，将 remove_unused_columns 设置为 False。
设置 eval_do_concat_batches=False 以获得正确的评估结果。图像具有不同数量的目标框，如果批次被连接，我们将无法确定哪些框属于特定图像。

如果您希望通过推送到 Hub 来分享您的模型，请将 push_to_hub 设置为 True（您必须登录 Hugging Face 才能上传您的模型）。

>>> from transformers import TrainingArguments

>>> training_args = TrainingArguments(
...     output_dir="detr_finetuned_cppe5",
...     num_train_epochs=30,
...     fp16=False,
...     per_device_train_batch_size=8,
...     dataloader_num_workers=4,
...     learning_rate=5e-5,
...     lr_scheduler_type="cosine",
...     weight_decay=1e-4,
...     max_grad_norm=0.01,
...     metric_for_best_model="eval_map",
...     greater_is_better=True,
...     load_best_model_at_end=True,
...     eval_strategy="epoch",
...     save_strategy="epoch",
...     save_total_limit=2,
...     remove_unused_columns=False,
...     eval_do_concat_batches=False,
...     push_to_hub=True,
... )

最后，将所有内容整合在一起，并调用 train()

>>> from transformers import Trainer

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=cppe5["train"],
...     eval_dataset=cppe5["validation"],
...     processing_class=image_processor,
...     data_collator=collate_fn,
...     compute_metrics=eval_compute_metrics_fn,
... )

>>> trainer.train()

[3210/3210 26:07, Epoch 30/30]

Epoch	训练损失	验证损失	Map	Map 50	Map 75	Map Small	Map Medium	Map Large	Mar 1	Mar 10	Mar 100	Mar Small	Mar Medium	Mar Large	Map Coverall	Mar 100 Coverall	Map Face Shield	Mar 100 Face Shield	Map Gloves	Mar 100 Gloves	Map Goggles	Mar 100 Goggles	Map Mask	Mar 100 Mask
1	No log	2.629903	0.008900	0.023200	0.006500	0.001300	0.002800	0.020500	0.021500	0.070400	0.101400	0.007600	0.106200	0.096100	0.036700	0.232000	0.000300	0.019000	0.003900	0.125400	0.000100	0.003100	0.003500	0.127600
2	No log	3.479864	0.014800	0.034600	0.010800	0.008600	0.011700	0.012500	0.041100	0.098700	0.130000	0.056000	0.062200	0.111900	0.053500	0.447300	0.010600	0.100000	0.000200	0.022800	0.000100	0.015400	0.009700	0.064400
3	No log	2.107622	0.041700	0.094000	0.034300	0.024100	0.026400	0.047400	0.091500	0.182800	0.225800	0.087200	0.199400	0.210600	0.150900	0.571200	0.017300	0.101300	0.007300	0.180400	0.002100	0.026200	0.031000	0.250200
4	No log	2.031242	0.055900	0.120600	0.046900	0.013800	0.038100	0.090300	0.105900	0.225600	0.266100	0.130200	0.228100	0.330000	0.191000	0.572100	0.010600	0.157000	0.014600	0.235300	0.001700	0.052300	0.061800	0.313800
5	3.889400	1.883433	0.089700	0.201800	0.067300	0.022800	0.065300	0.129500	0.136000	0.272200	0.303700	0.112900	0.312500	0.424600	0.300200	0.585100	0.032700	0.202500	0.031300	0.271000	0.008700	0.126200	0.075500	0.333800
6	3.889400	1.807503	0.118500	0.270900	0.090200	0.034900	0.076700	0.152500	0.146100	0.297800	0.325400	0.171700	0.283700	0.545900	0.396900	0.554500	0.043000	0.262000	0.054500	0.271900	0.020300	0.230800	0.077600	0.308000
7	3.889400	1.716169	0.143500	0.307700	0.123200	0.045800	0.097800	0.258300	0.165300	0.327700	0.352600	0.140900	0.336700	0.599400	0.442900	0.620700	0.069400	0.301300	0.081600	0.292000	0.011000	0.230800	0.112700	0.318200
8	3.889400	1.679014	0.153000	0.355800	0.127900	0.038700	0.115600	0.291600	0.176000	0.322500	0.349700	0.135600	0.326100	0.643700	0.431700	0.582900	0.069800	0.265800	0.088600	0.274600	0.028300	0.280000	0.146700	0.345300
9	3.889400	1.618239	0.172100	0.375300	0.137600	0.046100	0.141700	0.308500	0.194000	0.356200	0.386200	0.162400	0.359200	0.677700	0.469800	0.623900	0.102100	0.317700	0.099100	0.290200	0.029300	0.335400	0.160200	0.364000
10	1.599700	1.572512	0.179500	0.400400	0.147200	0.056500	0.141700	0.316700	0.213100	0.357600	0.381300	0.197900	0.344300	0.638500	0.466900	0.623900	0.101300	0.311400	0.104700	0.279500	0.051600	0.338500	0.173000	0.353300
11	1.599700	1.528889	0.192200	0.415000	0.160800	0.053700	0.150500	0.378000	0.211500	0.371700	0.397800	0.204900	0.374600	0.684800	0.491900	0.632400	0.131200	0.346800	0.122000	0.300900	0.038400	0.344600	0.177500	0.364400
12	1.599700	1.517532	0.198300	0.429800	0.159800	0.066400	0.162900	0.383300	0.220700	0.382100	0.405400	0.214800	0.383200	0.672900	0.469000	0.610400	0.167800	0.379700	0.119700	0.307100	0.038100	0.335400	0.196800	0.394200
13	1.599700	1.488849	0.209800	0.452300	0.172300	0.094900	0.171100	0.437800	0.222000	0.379800	0.411500	0.203800	0.397300	0.707500	0.470700	0.620700	0.186900	0.407600	0.124200	0.306700	0.059300	0.355400	0.207700	0.367100
14	1.599700	1.482210	0.228900	0.482600	0.187800	0.083600	0.191800	0.444100	0.225900	0.376900	0.407400	0.182500	0.384800	0.700600	0.512100	0.640100	0.175000	0.363300	0.144300	0.300000	0.083100	0.363100	0.229900	0.370700
15	1.326800	1.475198	0.216300	0.455600	0.174900	0.088500	0.183500	0.424400	0.226900	0.373400	0.404300	0.199200	0.396400	0.677800	0.496300	0.633800	0.166300	0.392400	0.128900	0.312900	0.085200	0.312300	0.205000	0.370200
16	1.326800	1.459697	0.233200	0.504200	0.192200	0.096000	0.202000	0.430800	0.239100	0.382400	0.412600	0.219500	0.403100	0.670400	0.485200	0.625200	0.196500	0.410100	0.135700	0.299600	0.123100	0.356900	0.225300	0.371100
17	1.326800	1.407340	0.243400	0.511900	0.204500	0.121000	0.215700	0.468000	0.246200	0.394600	0.424200	0.225900	0.416100	0.705200	0.494900	0.638300	0.224900	0.430400	0.157200	0.317900	0.115700	0.369200	0.224200	0.365300
18	1.326800	1.419522	0.245100	0.521500	0.210000	0.116100	0.211500	0.489900	0.255400	0.391600	0.419700	0.198800	0.421200	0.701400	0.501800	0.634200	0.226700	0.410100	0.154400	0.321400	0.105900	0.352300	0.236700	0.380400
19	1.158600	1.398764	0.253600	0.519200	0.213600	0.135200	0.207700	0.491900	0.257300	0.397300	0.428000	0.241400	0.401800	0.703500	0.509700	0.631100	0.236700	0.441800	0.155900	0.330800	0.128100	0.352300	0.237500	0.384000
20	1.158600	1.390591	0.248800	0.520200	0.216600	0.127500	0.211400	0.471900	0.258300	0.407000	0.429100	0.240300	0.407600	0.708500	0.505800	0.623400	0.235500	0.431600	0.150000	0.325000	0.125700	0.375400	0.227200	0.390200
21	1.158600	1.360608	0.262700	0.544800	0.222100	0.134700	0.230000	0.487500	0.269500	0.413300	0.436300	0.236200	0.419100	0.709300	0.514100	0.637400	0.257200	0.450600	0.165100	0.338400	0.139400	0.372300	0.237700	0.382700
22	1.158600	1.368296	0.262800	0.542400	0.236400	0.137400	0.228100	0.498500	0.266500	0.409000	0.433000	0.239900	0.418500	0.697500	0.520500	0.641000	0.257500	0.455700	0.162600	0.334800	0.140200	0.353800	0.233200	0.379600
23	1.158600	1.368176	0.264800	0.541100	0.233100	0.138200	0.223900	0.498700	0.272300	0.407400	0.434400	0.233100	0.418300	0.702000	0.524400	0.642300	0.262300	0.444300	0.159700	0.335300	0.140500	0.366200	0.236900	0.384000
24	1.049700	1.355271	0.269700	0.549200	0.239100	0.134700	0.229900	0.519200	0.274800	0.412700	0.437600	0.245400	0.417200	0.711200	0.523200	0.644100	0.272100	0.440500	0.166700	0.341500	0.137700	0.373800	0.249000	0.388000
25	1.049700	1.355180	0.272500	0.547900	0.243800	0.149700	0.229900	0.523100	0.272500	0.415700	0.442200	0.256200	0.420200	0.705800	0.523900	0.639600	0.271700	0.451900	0.166300	0.346900	0.153700	0.383100	0.247000	0.389300
26	1.049700	1.349337	0.275600	0.556300	0.246400	0.146700	0.234800	0.516300	0.274200	0.418300	0.440900	0.248700	0.418900	0.705800	0.523200	0.636500	0.274700	0.440500	0.172400	0.349100	0.155600	0.384600	0.252300	0.393800
27	1.049700	1.350782	0.275200	0.548700	0.246800	0.147300	0.236400	0.527200	0.280100	0.416200	0.442600	0.253400	0.424000	0.710300	0.526600	0.640100	0.273200	0.445600	0.167000	0.346900	0.160100	0.387700	0.249200	0.392900
28	1.049700	1.346533	0.277000	0.552800	0.252900	0.147400	0.240000	0.527600	0.280900	0.420900	0.444100	0.255500	0.424500	0.711200	0.530200	0.646800	0.277400	0.441800	0.170900	0.346900	0.156600	0.389200	0.249600	0.396000
29	0.993700	1.346575	0.277100	0.554800	0.252900	0.148400	0.239700	0.523600	0.278400	0.420000	0.443300	0.256300	0.424000	0.705600	0.529600	0.647300	0.273900	0.439200	0.174300	0.348700	0.157600	0.386200	0.250100	0.395100
30	0.993700	1.346446	0.277400	0.554700	0.252700	0.147900	0.240800	0.523600	0.278800	0.420400	0.443300	0.256100	0.424200	0.705500	0.530100	0.646800	0.275600	0.440500	0.174500	0.348700	0.157300	0.386200	0.249200	0.394200

如果您在 training_args 中将 push_to_hub 设置为 True，则训练检查点将被推送到 Hugging Face Hub。训练完成后，也通过调用 push_to_hub() 方法将最终模型推送到 Hub。

>>> trainer.push_to_hub()

评估

>>> from pprint import pprint

>>> metrics = trainer.evaluate(eval_dataset=cppe5["test"], metric_key_prefix="test")
>>> pprint(metrics)
{'epoch': 30.0,
  'test_loss': 1.0877351760864258,
  'test_map': 0.4116,
  'test_map_50': 0.741,
  'test_map_75': 0.3663,
  'test_map_Coverall': 0.5937,
  'test_map_Face_Shield': 0.5863,
  'test_map_Gloves': 0.3416,
  'test_map_Goggles': 0.1468,
  'test_map_Mask': 0.3894,
  'test_map_large': 0.5637,
  'test_map_medium': 0.3257,
  'test_map_small': 0.3589,
  'test_mar_1': 0.323,
  'test_mar_10': 0.5237,
  'test_mar_100': 0.5587,
  'test_mar_100_Coverall': 0.6756,
  'test_mar_100_Face_Shield': 0.7294,
  'test_mar_100_Gloves': 0.4721,
  'test_mar_100_Goggles': 0.4125,
  'test_mar_100_Mask': 0.5038,
  'test_mar_large': 0.7283,
  'test_mar_medium': 0.4901,
  'test_mar_small': 0.4469,
  'test_runtime': 1.6526,
  'test_samples_per_second': 17.548,
  'test_steps_per_second': 2.42}

通过调整 TrainingArguments 中的超参数，可以进一步改进这些结果。试试看！

推理

现在您已经微调了一个模型，对其进行了评估，并将其上传到 Hugging Face Hub，您可以使用它进行推理。

>>> import torch
>>> import requests

>>> from PIL import Image, ImageDraw
>>> from transformers import AutoImageProcessor, AutoModelForObjectDetection

>>> url = "https://images.pexels.com/photos/8413299/pexels-photo-8413299.jpeg?auto=compress&cs=tinysrgb&w=630&h=375&dpr=2"
>>> image = Image.open(requests.get(url, stream=True).raw)

从 Hugging Face Hub 加载模型和图像处理器（跳过以使用本会话中已训练的模型）

>>> from accelerate.test_utils.testing import get_backend
# automatically detects the underlying device type (CUDA, CPU, XPU, MPS, etc.)
>>> device, _, _ = get_backend()
>>> model_repo = "qubvel-hf/detr_finetuned_cppe5"

>>> image_processor = AutoImageProcessor.from_pretrained(model_repo)
>>> model = AutoModelForObjectDetection.from_pretrained(model_repo)
>>> model = model.to(device)

并检测边界框


>>> with torch.no_grad():
...     inputs = image_processor(images=[image], return_tensors="pt")
...     outputs = model(**inputs.to(device))
...     target_sizes = torch.tensor([[image.size[1], image.size[0]]])
...     results = image_processor.post_process_object_detection(outputs, threshold=0.3, target_sizes=target_sizes)[0]

>>> for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
...     box = [round(i, 2) for i in box.tolist()]
...     print(
...         f"Detected {model.config.id2label[label.item()]} with confidence "
...         f"{round(score.item(), 3)} at location {box}"
...     )
Detected Gloves with confidence 0.683 at location [244.58, 124.33, 300.35, 185.13]
Detected Mask with confidence 0.517 at location [143.73, 64.58, 219.57, 125.89]
Detected Gloves with confidence 0.425 at location [179.15, 155.57, 262.4, 226.35]
Detected Coverall with confidence 0.407 at location [307.13, -1.18, 477.82, 318.06]
Detected Coverall with confidence 0.391 at location [68.61, 126.66, 309.03, 318.89]

让我们绘制结果

>>> draw = ImageDraw.Draw(image)

>>> for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
...     box = [round(i, 2) for i in box.tolist()]
...     x, y, x2, y2 = tuple(box)
...     draw.rectangle((x, y, x2, y2), outline="red", width=1)
...     draw.text((x, y), model.config.id2label[label.item()], fill="white")

>>> image

< > 在 GitHub 上更新