使用自定义生物医学数据集微调 Vision Transformer 模型

本指南概述了在自定义生物医学数据集上微调 Vision Transformer (ViT) 模型的流程。它包括加载和准备数据集、为不同的数据分割设置图像转换、配置和初始化 ViT 模型以及定义包含评估和可视化工具的训练过程的步骤。

数据集信息

自定义数据集是手工制作的，包含 780 张图像，分为 3 类（良性、恶性、正常）。

attachment:datasetinfo.png

模型信息

我们微调的模型将是 Google 的 "vit-large-patch16-224"。它在 ImageNet-21k（1400 万张图像，21843 个类别）上进行训练，并在分辨率为 224x224 的 ImageNet 2012（100 万张图像，1000 个类别）上进行微调。Google 还有其他几种具有不同图像尺寸和补丁的 ViT 模型。

让我们开始吧。

开始

首先，让我们先安装库。

!pip install datasets transformers accelerate torch torchvision scikit-learn matplotlib wandb

（可选）我们将把我们的模型推送到 Hugging Face Hub，因此我们必须登录。

# from huggingface_hub import notebook_login
# notebook_login()

数据集准备

Datasets 库自动从数据集拉取图像和类别。有关详细信息，您可以访问此链接。

from datasets import load_dataset

dataset = load_dataset("emre570/breastcancer-ultrasound-images")
dataset

我们得到了数据集。但是我们没有验证集。为了创建验证集，我们将根据测试集的大小计算验证集的大小，作为训练集的一部分。然后，我们将训练数据集拆分为新的训练和验证子集。

# Get the numbers of each set
test_num = len(dataset["test"])
train_num = len(dataset["train"])

val_size = test_num / train_num

train_val_split = dataset["train"].train_test_split(test_size=val_size)
train_val_split

我们得到了分离的训练集。让我们将它们与测试集合并。

from datasets import DatasetDict

dataset = DatasetDict(
    {"train": train_val_split["train"], "validation": train_val_split["test"], "test": dataset["test"]}
)
dataset

完美！我们的数据集已准备就绪。让我们将子集分配给不同的变量。我们稍后将使用它们以便于参考。

train_ds = dataset["train"]
val_ds = dataset["validation"]
test_ds = dataset["test"]

我们可以看到图像是 PIL.Image，并带有与之关联的标签。

train_ds[0]

我们还可以看到训练集的特征。

train_ds.features

让我们显示数据集中每个类别的一张图像。

>>> import matplotlib.pyplot as plt

>>> # Initialize a set to keep track of shown labels
>>> shown_labels = set()

>>> # Initialize the figure for plotting
>>> plt.figure(figsize=(10, 10))

>>> # Loop through the dataset and plot the first image of each label
>>> for i, sample in enumerate(train_ds):
...     label = train_ds.features["label"].names[sample["label"]]
...     if label not in shown_labels:
...         plt.subplot(1, len(train_ds.features["label"].names), len(shown_labels) + 1)
...         plt.imshow(sample["image"])
...         plt.title(label)
...         plt.axis("off")
...         shown_labels.add(label)
...         if len(shown_labels) == len(train_ds.features["label"].names):
...             break

>>> plt.show()

数据处理

数据集已准备就绪。但我们还没有准备好进行微调。我们将分别遵循以下步骤

标签映射： 我们在标签 ID 和其对应的名称之间进行转换，这对于模型训练和评估非常有用。
图像处理： 然后，我们利用 ViTImageProcessor 来标准化输入图像大小并应用特定于预训练模型的归一化。此外，还将为训练、验证和测试定义不同的转换，以使用 torchvision 改进模型泛化。
转换函数： 实现将转换应用于数据集的函数，将图像转换为 ViT 模型所需的格式和尺寸。
数据加载： 设置自定义整理函数以正确批处理图像和标签，并创建 DataLoader 以在模型训练期间高效加载和批处理。
批次准备： 检索并显示样本批次中数据的形状，以验证正确的处理和模型输入的就绪状态。

标签映射

id2label = {id: label for id, label in enumerate(train_ds.features["label"].names)}
label2id = {label: id for id, label in id2label.items()}
id2label, id2label[train_ds[0]["label"]]

图像处理

from transformers import ViTImageProcessor

model_name = "google/vit-large-patch16-224"
processor = ViTImageProcessor.from_pretrained(model_name)

from torchvision.transforms import (
    CenterCrop,
    Compose,
    Normalize,
    RandomHorizontalFlip,
    RandomResizedCrop,
    ToTensor,
    Resize,
)

image_mean, image_std = processor.image_mean, processor.image_std
size = processor.size["height"]

normalize = Normalize(mean=image_mean, std=image_std)

train_transforms = Compose(
    [
        RandomResizedCrop(size),
        RandomHorizontalFlip(),
        ToTensor(),
        normalize,
    ]
)
val_transforms = Compose(
    [
        Resize(size),
        CenterCrop(size),
        ToTensor(),
        normalize,
    ]
)
test_transforms = Compose(
    [
        Resize(size),
        CenterCrop(size),
        ToTensor(),
        normalize,
    ]
)

创建转换函数

def apply_train_transforms(examples):
    examples["pixel_values"] = [train_transforms(image.convert("RGB")) for image in examples["image"]]
    return examples


def apply_val_transforms(examples):
    examples["pixel_values"] = [val_transforms(image.convert("RGB")) for image in examples["image"]]
    return examples


def apply_test_transforms(examples):
    examples["pixel_values"] = [val_transforms(image.convert("RGB")) for image in examples["image"]]
    return examples

将转换函数应用于每个集合

train_ds.set_transform(apply_train_transforms)
val_ds.set_transform(apply_val_transforms)
test_ds.set_transform(apply_test_transforms)

train_ds.features

train_ds[0]

看起来我们已将像素值转换为张量。

数据加载

import torch
from torch.utils.data import DataLoader


def collate_fn(examples):
    pixel_values = torch.stack([example["pixel_values"] for example in examples])
    labels = torch.tensor([example["label"] for example in examples])
    return {"pixel_values": pixel_values, "labels": labels}


train_dl = DataLoader(train_ds, collate_fn=collate_fn, batch_size=4)

批次准备

>>> batch = next(iter(train_dl))
>>> for k, v in batch.items():
...     if isinstance(v, torch.Tensor):
...         print(k, v.shape)

pixel_values torch.Size([4, 3, 224, 224])
labels torch.Size([4])

完美！现在我们准备好进行微调过程了。

微调模型

现在我们将配置和微调模型。我们首先使用特定的标签映射和预训练设置初始化模型，并针对大小不匹配进行调整。训练参数设置为定义模型的学习过程，包括保存策略、批次大小和训练 epoch，结果通过 Weights & Biases 记录。然后将实例化 Hugging Face Trainer 以管理训练和评估，利用自定义数据整理器和模型的内置处理器。最后，在训练之后，在测试数据集上评估模型的性能，并打印指标以评估其准确性。

首先，我们调用我们的模型。

from transformers import ViTForImageClassification

model = ViTForImageClassification.from_pretrained(
    model_name, id2label=id2label, label2id=label2id, ignore_mismatched_sizes=True
)

这里有一个细微的细节。ignore_mismatched_sizes 参数。

当您在新数据集上微调预训练模型时，有时您的图像的输入大小或模型架构的细节（例如分类层中标签的数量）可能与模型最初训练的内容不完全匹配。发生这种情况的原因有很多，例如，当在完全不同类型的图像数据（如医疗图像或专用相机图像）上使用在一种类型的图像数据（如来自 ImageNet 的自然图像）上训练的模型时。

将 ignore_mismatched_sizes 设置为 True 允许模型调整其层以适应大小差异，而不会抛出错误。

例如，此模型训练的类别数量为 1000，即 torch.Size([1000])，它期望输入具有 torch.Size([1000]) 个类别。我们的数据集有 3 个类别，即 torch.Size([3]) 个类别。如果我们直接给它，它会引发错误，因为类别编号不匹配。

然后，为此模型定义来自 Google 的训练参数。

（可选）请注意，指标将保存在 Weights & Biases 中，因为我们将 report_to 参数设置为 wandb。W&B 将要求您提供 API 密钥，因此您应该创建一个帐户和一个 API 密钥。如果您不想这样做，可以删除 report_to 参数。

from transformers import TrainingArguments, Trainer
import numpy as np

train_args = TrainingArguments(
    output_dir="output-models",
    save_total_limit=2,
    report_to="wandb",
    save_strategy="epoch",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=10,
    per_device_eval_batch_size=4,
    num_train_epochs=40,
    weight_decay=0.01,
    load_best_model_at_end=True,
    logging_dir="logs",
    remove_unused_columns=False,
)

我们现在可以使用 Trainer 开始微调过程。

trainer = Trainer(
    model,
    train_args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    data_collator=collate_fn,
    tokenizer=processor,
)
trainer.train()

Epoch	训练损失	验证损失	准确率
40	0.174700	0.596288	0.903846

微调过程已完成。让我们继续评估模型在测试集上的表现。

>>> outputs = trainer.predict(test_ds)
>>> print(outputs.metrics)

&#123;'test_loss': 0.40843912959098816, 'test_runtime': 4.9934, 'test_samples_per_second': 31.242, 'test_steps_per_second': 7.81}

{'test_loss': 0.3219967782497406, 'test_accuracy': 0.9102564102564102, 'test_runtime': 4.0543, 'test_samples_per_second': 38.478, 'test_steps_per_second': 9.619}

（可选）将模型推送到 Hub

我们可以使用 push_to_hub 将我们的模型推送到 Hugging Face Hub

model.push_to_hub("your_model_name")

太棒了！让我们可视化结果。

结果

我们进行了微调。让我们看看我们的模型如何使用 scikit-learn 的混淆矩阵显示来预测类别，并显示召回率。

什么是混淆矩阵？

混淆矩阵是一种特定的表格布局，可以可视化算法（通常是有监督的学习模型）在已知真实值的一组测试数据上的性能。它对于检查分类模型的性能特别有用，因为它显示了真实标签与预测标签的频率。

让我们绘制我们模型的混淆矩阵

>>> from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

>>> y_true = outputs.label_ids
>>> y_pred = outputs.predictions.argmax(1)

>>> labels = train_ds.features["label"].names
>>> cm = confusion_matrix(y_true, y_pred)
>>> disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
>>> disp.plot(xticks_rotation=45)

什么是召回率？

召回率是分类任务中使用的一种性能指标，用于衡量模型正确识别数据集内所有相关实例的能力。具体来说，召回率评估模型正确预测为正例的实际正例的比例。

让我们使用 scikit-learn 打印召回率

>>> from sklearn.metrics import recall_score

>>> # Calculate the recall scores
>>> # 'None' calculates recall for each class separately
>>> recall = recall_score(y_true, y_pred, average=None)

>>> # Print the recall for each class
>>> for label, score in zip(labels, recall):
...     print(f"Recall for {label}: {score:.2f}")

Recall for benign: 0.90
Recall for malignant: 0.86
Recall for normal: 0.78

良性的召回率：0.90，恶性的召回率：0.86，正常的召回率：0.78

结论

在本食谱中，我们介绍了如何使用医疗数据集训练 ViT 模型。它涵盖了关键步骤，例如数据集准备、图像预处理、模型配置、训练、评估和结果可视化。通过利用 Hugging Face 的 Transformers 库 scikit-learn 和 PyTorch Torchvision，它可以促进高效的模型训练和评估，从而提供对模型性能及其准确分类生物医学图像的能力的有价值的见解。

< > 在 GitHub 上更新

开源 AI 食谱