使用合成数据微调 ModernBERT 进行文本分类

社区文章 发布于 2024 年 12 月 30 日

LLM 是优秀的通用模型,但对于特定任务而言,它们并非总是最佳选择。因此,更小、更专业的模型对于可持续、高效、更廉价的 AI 至关重要。

缺乏专用数据集是更小、更专业模型的常见问题。这是因为很难找到一个既具有代表性又足够多样化的数据集来完成特定任务。我们通过使用 synthetic-data-generator 从 LLM 生成合成数据集来解决此问题,该生成器可作为 Hugging Face Space 或在 GitHub 上获取。

在此示例中,我们将在由 synthetic-data-generator 生成的合成数据集上微调 ModernBERT 模型。展示了合成数据的有效性以及新型 ModernBERT 模型的性能,ModernBERT 是 BERT 模型的新改进版本,具有 8192 个令牌上下文长度、显著更好的下游性能和更快的处理速度。

安装依赖项

# Install Pytorch & other libraries
%pip install "torch==2.5.0" "torchvision==0.20.0" 
%pip install "setuptools<71.0.0" scikit-learn 
 
# Install Hugging Face libraries
%pip install  --upgrade \
  "datasets==3.1.0" \
  "accelerate==1.2.1" \
  "hf-transfer==0.1.8"
 
# ModernBERT is not yet available in an official release, so we need to install it from github
%pip install "git+https://github.com/huggingface/transformers.git@6e0515e99c39444caae39472ee1b2fd76ece32f1" --upgrade

问题

nvidia/domain-classifier 是一个可以对文本领域进行分类的模型,这有助于数据整理。这个模型很酷,但它基于 Deberta V3 Base,这是一个过时的架构,需要自定义代码才能运行,其上下文长度为 512 个令牌,并且不如 ModernBERT 模型快。该模型的标签是:

'Adult', 'Arts_and_Entertainment', 'Autos_and_Vehicles', 'Beauty_and_Fitness', 'Books_and_Literature', 'Business_and_Industrial', 'Computers_and_Electronics', 'Finance', 'Food_and_Drink', 'Games', 'Health', 'Hobbies_and_Leisure', 'Home_and_Garden', 'Internet_and_Telecom', 'Jobs_and_Education', 'Law_and_Government', 'News', 'Online_Communities', 'People_and_Society', 'Pets_and_Animals', 'Real_Estate', 'Science', 'Sensitive_Subjects', 'Shopping', 'Sports', 'Travel_and_Transportation'

训练该模型的数据不可用,因此我们无法将其用于我们的目的。但是,我们可以生成合成数据来解决这个问题。

让我们生成一些数据

让我们前往 Hugging Face Space 来生成数据。这分三步完成:1) 我们提出数据集描述,2) 迭代任务配置,以及 3) 生成数据并将其推送到 Hugging Face。更详细的流程可以在这篇博文中找到。

对于此示例,我们将生成 1000 个样本,温度设置为 1。经过几次迭代,我们得到了以下系统提示:

Long texts (at least 2000 words) from various media sources like Wikipedia, Reddit, Common Crawl, websites, commercials, online forums, books, newspapers and folders that cover multiple topics. Classify the text based on its main subject matter into one of the following categories

我们按下“Push to Hub”按钮,等待数据生成。这需要几分钟,最终我们得到了一个包含 1000 个样本的数据集。标签在类别之间分布良好,长度各异,文本看起来也多样有趣。

数据已推送到 Argilla,因此我们建议在微调模型之前检查和验证标签。

微调 ModernBERT 模型

我们主要参考了 Phillip Schmid 的博客。我将使用基本的消费级硬件,即我的 Apple M1 Max,配备 32GB 共享内存。我们将使用 datasets 库加载数据,并使用 transformers 库微调模型。

from datasets import load_dataset
from datasets.arrow_dataset import Dataset
from datasets.dataset_dict import DatasetDict, IterableDatasetDict
from datasets.iterable_dataset import IterableDataset
 
# Dataset id from huggingface.co/dataset
dataset_id = "argilla/synthetic-domain-text-classification"
 
# Load raw dataset
train_dataset = load_dataset(dataset_id, split='train')

split_dataset = train_dataset.train_test_split(test_size=0.1)
split_dataset['train'][0]
# {'text': 'Recently, there has been an increase in property values within the suburban areas of several cities due to improvements in infrastructure and lifestyle amenities such as parks, retail stores, and educational institutions nearby. Additionally, new housing developments are emerging, catering to different family needs with varying sizes and price ranges. These changes have influenced investment decisions for many looking to buy or sell properties.', 'label': 14}

首先,我们需要对数据进行标记化。我们将使用 transformers 库中的 AutoTokenizer 类加载分词器。

from transformers import AutoTokenizer
 
# Model id to load the tokenizer
model_id = "answerdotai/ModernBERT-base"

# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
 
# Tokenize helper function
def tokenize(batch):
    return tokenizer(batch['text'], padding=True, truncation=True, return_tensors="pt")
 
# Tokenize dataset
if "label" in split_dataset["train"].features.keys():
    split_dataset =  split_dataset.rename_column("label", "labels") # to match Trainer
tokenized_dataset = split_dataset.map(tokenize, batched=True, remove_columns=["text"])
 
tokenized_dataset["train"].features.keys()
# dict_keys(['labels', 'input_ids', 'attention_mask'])

现在,我们需要准备模型。我们将使用 transformers 库中的 AutoModelForSequenceClassification 类来加载模型。

from transformers import AutoModelForSequenceClassification
 
# Model id to load the tokenizer
model_id = "answerdotai/ModernBERT-base"
 
# Prepare model labels - useful for inference
labels = tokenized_dataset["train"].features["labels"].names
num_labels = len(labels)
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label
 
# Download the model from huggingface.co/models
model = AutoModelForSequenceClassification.from_pretrained(
    model_id, num_labels=num_labels, label2id=label2id, id2label=id2label,
)

我们将使用简单的 F1 分数作为评估指标。

import numpy as np
from sklearn.metrics import f1_score
 
# Metric helper method
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    score = f1_score(
            labels, predictions, labels=labels, pos_label=1, average="weighted"
        )
    return {"f1": float(score) if score == 1 else score}

最后,我们需要定义训练参数。我们将使用 transformers 库中的 TrainingArguments 类来定义训练参数。

from huggingface_hub import HfFolder
from transformers import Trainer, TrainingArguments
 
# Define training args
training_args = TrainingArguments(
    output_dir= "ModernBERT-domain-classifier",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=16,
    learning_rate=5e-5,
        num_train_epochs=5,
    bf16=True, # bfloat16 training 
    optim="adamw_torch_fused", # improved optimizer 
    # logging & evaluation strategies
    logging_strategy="steps",
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
    use_mps_device=True,
    metric_for_best_model="f1",
    # push to hub parameters
    push_to_hub=True,
    hub_strategy="every_save",
    hub_token=HfFolder.get_token(),
)
 
# Create a Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    compute_metrics=compute_metrics,
)
trainer.train()
# {'train_runtime': 3642.7783, 'train_samples_per_second': 1.235, 'train_steps_per_second': 0.04, 'train_loss': 0.535627057634551, 'epoch': 5.0}

我们在测试集上获得了 0.89 的 F1 分数,对于小型数据集和所花费的时间来说,这个结果相当不错。

运行推理

现在我们可以加载模型并运行推理了。

from transformers import pipeline
 
# load model from huggingface.co/models using our repository id
classifier = pipeline(
    task="text-classification", 
    model="argilla/ModernBERT-domain-classifier", 
    device=0,
)
 
sample = "Smoking is bad for your health."
 
classifier(sample)
# [{'label': 'health', 'score': 0.6779336333274841}]

结论

我们已经展示了如何从 LLM 生成合成数据集,并在此数据集上微调 ModernBERT 模型。这展示了合成数据和新型 ModernBERT 模型的有效性,ModernBERT 是 BERT 模型的新改进版本,具有 8192 个令牌上下文长度、显著更好的下游性能和更快的处理速度。

20 分钟生成数据,1 小时在消费级硬件上进行微调,效果相当不错。

社区

训练 8192 max_len 模型需要大量 GPU

你好 @davidberenstein1957 ,这段代码似乎不适用于 transformers: 4.49.0。有什么想法吗?我看到 eval_f1 是 0.007867705980913528...

我得到这个输出

python3 train4.py
Parameter 'function'=<function tokenize at 0x7fec4c3b6b90> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 900/900 [00:00<00:00, 2251.77 examples/s]
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 2668.59 examples/s]
Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at answerdotai/ModernBERT-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.115044247787611e-05, 'epoch': 0.88}
{'eval_loss': nan, 'eval_f1': 0.007867705980913528, 'eval_runtime': 11.5539, 'eval_samples_per_second': 8.655, 'eval_steps_per_second': 1.125, 'epoch': 1.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.230088495575221e-05, 'epoch': 1.77}
{'eval_loss': nan, 'eval_f1': 0.007867705980913528, 'eval_runtime': 0.3503, 'eval_samples_per_second': 285.465, 'eval_steps_per_second': 37.11, 'epoch': 2.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.345132743362832e-05, 'epoch': 2.65}
{'eval_loss': nan, 'eval_f1': 0.007867705980913528, 'eval_runtime': 0.3496, 'eval_samples_per_second': 286.027, 'eval_steps_per_second': 37.184, 'epoch': 3.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 1.4601769911504426e-05, 'epoch': 3.54}
{'eval_loss': nan, 'eval_f1': 0.007867705980913528, 'eval_runtime': 0.3529, 'eval_samples_per_second': 283.348, 'eval_steps_per_second': 36.835, 'epoch': 4.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 5.752212389380531e-06, 'epoch': 4.42}
{'eval_loss': nan, 'eval_f1': 0.007867705980913528, 'eval_runtime': 0.3147, 'eval_samples_per_second': 317.753, 'eval_steps_per_second': 41.308, 'epoch': 5.0}
{'train_runtime': 149.6166, 'train_samples_per_second': 30.077, 'train_steps_per_second': 3.776, 'train_loss': 0.0, 'epoch': 5.0}                  
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 565/565 [02:29<00:00,  3.78it/s]
Device set to use cuda:0
[2025-03-26 15:16:13,918] torch._dynamo.convert_frame: [WARNING] torch._dynamo hit config.cache_size_limit (8)
[2025-03-26 15:16:13,918] torch._dynamo.convert_frame: [WARNING]    function: 'compiled_mlp' (/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/modernbert/modeling_modernbert.py:552)
[2025-03-26 15:16:13,918] torch._dynamo.convert_frame: [WARNING]    last reason: ___check_global_state()
[2025-03-26 15:16:13,918] torch._dynamo.convert_frame: [WARNING] To log all recompilation reasons, use TORCH_LOGS="recompiles".
[2025-03-26 15:16:13,918] torch._dynamo.convert_frame: [WARNING] To diagnose recompilation issues, see https://pytorch.ac.cn/docs/master/compile/troubleshooting.html.
[{'label': 'business-and-industrial', 'score': nan}]

完整代码

from datasets import load_dataset
from datasets.arrow_dataset import Dataset
from datasets.dataset_dict import DatasetDict, IterableDatasetDict
from datasets.iterable_dataset import IterableDataset
import os
import torch

os.environ["TOKENIZERS_PARALLELISM"] = "false"
# UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
torch.set_float32_matmul_precision('high')

# Dataset id from huggingface.co/dataset
dataset_id = "argilla/synthetic-domain-text-classification"
 
# Load raw dataset
train_dataset = load_dataset(dataset_id, split='train')

split_dataset = train_dataset.train_test_split(test_size=0.1)
split_dataset['train'][0]
# {'text': 'Recently, there has been an increase in property values within the suburban areas of several cities due to improvements in infrastructure and lifestyle amenities such as parks, retail stores, and educational institutions nearby. Additionally, new housing developments are emerging, catering to different family needs with varying sizes and price ranges. These changes have influenced investment decisions for many looking to buy or sell properties.', 'label': 14}

from transformers import AutoTokenizer
 
# Model id to load the tokenizer
model_id = "answerdotai/ModernBERT-base"

# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
 
# Tokenize helper function
def tokenize(batch):
    return tokenizer(batch['text'], padding=True, truncation=True, return_tensors="pt")
 
# Tokenize dataset
if "label" in split_dataset["train"].features.keys():
    split_dataset =  split_dataset.rename_column("label", "labels") # to match Trainer
tokenized_dataset = split_dataset.map(tokenize, batched=True, remove_columns=["text"])
 
tokenized_dataset["train"].features.keys()
# dict_keys(['labels', 'input_ids', 'attention_mask'])

from transformers import AutoModelForSequenceClassification
 
# Model id to load the tokenizer
model_id = "answerdotai/ModernBERT-base"
 
# Prepare model labels - useful for inference
labels = tokenized_dataset["train"].features["labels"].names
num_labels = len(labels)
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label
 
# Download the model from huggingface.co/models
model = AutoModelForSequenceClassification.from_pretrained(
    model_id, num_labels=num_labels, label2id=label2id, id2label=id2label,
)

import numpy as np
from sklearn.metrics import f1_score
 
# Metric helper method
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    score = f1_score(
            labels, predictions, labels=labels, pos_label=1, average="weighted"
        )
    return {"f1": float(score) if score == 1 else score}

from huggingface_hub import HfFolder
from transformers import Trainer, TrainingArguments
 
# Define training args
training_args = TrainingArguments(
    output_dir = "ModernBERT-domain-classifier",
    per_device_train_batch_size=8,#32,
    per_device_eval_batch_size=8,#16,
    learning_rate=5e-5,
    num_train_epochs=5,
    bf16=True, # bfloat16 training 
    optim="adamw_torch_fused", # improved optimizer 
    # logging & evaluation strategies
    logging_strategy="steps",
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
    #use_mps_device=True,
    metric_for_best_model="f1",
    # push to hub parameters
    push_to_hub=False,
    hub_strategy="every_save",
    hub_token=HfFolder.get_token(),
)
 
# Create a Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    compute_metrics=compute_metrics,
)
trainer.train()
# {'train_runtime': 3642.7783, 'train_samples_per_second': 1.235, 'train_steps_per_second': 0.04, 'train_loss': 0.535627057634551, 'epoch': 5.0}

from transformers import pipeline
 
model_save_path = "ModernBERT-domain-classifier-save"
trainer.save_model(model_save_path)
# Save processor and create model card
tokenizer.save_pretrained(model_save_path)

# load model from huggingface.co/models using our repository id
classifier = pipeline(
    task="text-classification", 
    model=model_save_path, 
    device=0,
)
 
sample = "Smoking is bad for your health."

print(classifier(sample))
# [{'label': 'health', 'score': 0.6779336333274841}]

你在哪里看到 F1 分数为 0.89?

注册登录 发表评论