文本分类

文本分类是常见的 NLP 任务，它为文本分配标签或类别。一些大公司在生产中运行文本分类，用于各种实际应用。情感分析是文本分类最流行的形式之一，它将诸如 🙂 积极、🙁 消极或 😐 中性等标签分配给文本序列。

本指南将向您展示如何：

在 IMDb 数据集上微调 DistilBERT，以确定影评是积极的还是消极的。
使用您的微调模型进行推理。

要查看所有与此任务兼容的架构和检查点，建议查看任务页面。

在开始之前，请确保您已安装所有必要的库

pip install transformers datasets evaluate accelerate

我们鼓励您登录 Hugging Face 账户，以便您可以上传并与社区分享您的模型。出现提示时，输入您的令牌进行登录

>>> from huggingface_hub import notebook_login

>>> notebook_login()

加载 IMDb 数据集

首先从 🤗 Datasets 库加载 IMDb 数据集

>>> from datasets import load_dataset

>>> imdb = load_dataset("imdb")

然后查看一个示例

>>> imdb["test"][0]
{
    "label": 0,
    "text": "I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn't match the background, and painfully one-dimensional characters cannot be overcome with a 'sci-fi' setting. (I'm sure there are those of you out there who think Babylon 5 is good sci-fi TV. It's not. It's clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It's really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it's rubbish as they have to always say \"Gene Roddenberry's Earth...\" otherwise people would not continue watching. Roddenberry's ashes must be turning in their orbit as this dull, cheap, poorly edited (watching it without advert breaks really brings this home) trudging Trabant of a show lumbers into space. Spoiler. So, kill off a main character. And then bring him back as another actor. Jeeez! Dallas all over again.",
}

此数据集中有两个字段

text: 影评文本。
label: 一个值为 0 代表负面评论或 1 代表正面评论的字段。

预处理

下一步是加载 DistilBERT 分词器来预处理 text 字段

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")

创建一个预处理函数，以分词 text 并将序列截断为不超过 DistilBERT 的最大输入长度

>>> def preprocess_function(examples):
...     return tokenizer(examples["text"], truncation=True)

要将预处理函数应用于整个数据集，请使用 🤗 Datasets 的 map 函数。通过设置 batched=True 来一次处理数据集的多个元素，可以加快 map 的速度

tokenized_imdb = imdb.map(preprocess_function, batched=True)

现在使用 DataCollatorWithPadding 创建一批示例。在整理过程中，将句子动态填充到批次中最长长度，而不是将整个数据集填充到最大长度，效率更高。

Pytorch

隐藏 Pytorch 内容

>>> from transformers import DataCollatorWithPadding

>>> data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

TensorFlow

隐藏 TensorFlow 内容

>>> from transformers import DataCollatorWithPadding

>>> data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

评估

在训练期间包含一个指标通常有助于评估模型的性能。您可以使用 🤗 Evaluate 库快速加载评估方法。对于此任务，加载 accuracy 指标（请参阅 🤗 Evaluate 快速入门，了解如何加载和计算指标）

>>> import evaluate

>>> accuracy = evaluate.load("accuracy")

然后创建一个函数，将您的预测和标签传递给 compute 以计算准确度。

>>> import numpy as np


>>> def compute_metrics(eval_pred):
...     predictions, labels = eval_pred
...     predictions = np.argmax(predictions, axis=1)
...     return accuracy.compute(predictions=predictions, references=labels)

您的 compute_metrics 函数现在可以使用了，您将在设置训练时再次用到它。

训练

在开始训练模型之前，使用 id2label 和 label2id 为预期 ID 及其标签创建映射

>>> id2label = {0: "NEGATIVE", 1: "POSITIVE"}
>>> label2id = {"NEGATIVE": 0, "POSITIVE": 1}

Pytorch

隐藏 Pytorch 内容

如果您不熟悉如何使用 Trainer 对模型进行微调，请参阅此处的基本教程！

现在您已准备好开始训练模型！使用 AutoModelForSequenceClassification 加载 DistilBERT，同时指定预期标签数量和标签映射

>>> from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

>>> model = AutoModelForSequenceClassification.from_pretrained(
...     "distilbert/distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
... )

此时，只剩下三个步骤

在 TrainingArguments 中定义训练超参数。唯一必需的参数是 output_dir，它指定保存模型的位置。您可以通过设置 push_to_hub=True 将此模型推送到 Hub（您需要登录 Hugging Face 才能上传模型）。在每个 epoch 结束时，Trainer 将评估准确性并保存训练检查点。
将训练参数与模型、数据集、分词器、数据整理器和 compute_metrics 函数一起传递给 Trainer。
调用 train() 来微调您的模型。

>>> training_args = TrainingArguments(
...     output_dir="my_awesome_model",
...     learning_rate=2e-5,
...     per_device_train_batch_size=16,
...     per_device_eval_batch_size=16,
...     num_train_epochs=2,
...     weight_decay=0.01,
...     eval_strategy="epoch",
...     save_strategy="epoch",
...     load_best_model_at_end=True,
...     push_to_hub=True,
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=tokenized_imdb["train"],
...     eval_dataset=tokenized_imdb["test"],
...     processing_class=tokenizer,
...     data_collator=data_collator,
...     compute_metrics=compute_metrics,
... )

>>> trainer.train()

当您将 tokenizer 传递给 Trainer 时，它会默认应用动态填充。在这种情况下，您无需显式指定数据收集器。

训练完成后，使用 push_to_hub() 方法将您的模型分享到 Hub，以便所有人都可以使用您的模型。

>>> trainer.push_to_hub()

TensorFlow

隐藏 TensorFlow 内容

如果您不熟悉如何使用 Keras 对模型进行微调，请参阅此处的基本教程！

要在 TensorFlow 中对模型进行微调，首先要设置优化器函数、学习率调度和一些训练超参数

>>> from transformers import create_optimizer
>>> import tensorflow as tf

>>> batch_size = 16
>>> num_epochs = 5
>>> batches_per_epoch = len(tokenized_imdb["train"]) // batch_size
>>> total_train_steps = int(batches_per_epoch * num_epochs)
>>> optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)

然后，您可以使用 TFAutoModelForSequenceClassification 加载 DistilBERT，同时指定预期标签数量和标签映射

>>> from transformers import TFAutoModelForSequenceClassification

>>> model = TFAutoModelForSequenceClassification.from_pretrained(
...     "distilbert/distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
... )

使用 prepare_tf_dataset() 将数据集转换为 tf.data.Dataset 格式

>>> tf_train_set = model.prepare_tf_dataset(
...     tokenized_imdb["train"],
...     shuffle=True,
...     batch_size=16,
...     collate_fn=data_collator,
... )

>>> tf_validation_set = model.prepare_tf_dataset(
...     tokenized_imdb["test"],
...     shuffle=False,
...     batch_size=16,
...     collate_fn=data_collator,
... )

使用 compile 配置模型进行训练。请注意，所有 Transformers 模型都具有默认的任务相关损失函数，因此除非您需要，否则无需指定

>>> import tensorflow as tf

>>> model.compile(optimizer=optimizer)  # No loss argument!

在开始训练之前，还需要设置两件事：计算预测的准确性，并提供一种将模型推送到 Hub 的方法。这两者都通过使用 Keras 回调来完成。

将您的 compute_metrics 函数传递给 KerasMetricCallback

>>> from transformers.keras_callbacks import KerasMetricCallback

>>> metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)

在 PushToHubCallback 中指定将模型和分词器推送到何处

>>> from transformers.keras_callbacks import PushToHubCallback

>>> push_to_hub_callback = PushToHubCallback(
...     output_dir="my_awesome_model",
...     tokenizer=tokenizer,
... )

然后将回调函数捆绑在一起

>>> callbacks = [metric_callback, push_to_hub_callback]

最后，您已准备好开始训练模型！使用训练和验证数据集、epoch 数量和回调函数调用 fit 来微调模型

>>> model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3, callbacks=callbacks)

训练完成后，您的模型会自动上传到 Hub，供所有人使用！

有关如何微调文本分类模型的更深入示例，请参阅相应的 PyTorch notebook 或 TensorFlow notebook。

推理

太棒了，现在您已经微调了模型，您可以将其用于推理了！

获取一些您想要运行推理的文本

>>> text = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."

试用微调模型进行推理的最简单方法是将其用于 pipeline()。使用您的模型实例化一个情感分析 pipeline，并将您的文本传递给它

>>> from transformers import pipeline

>>> classifier = pipeline("sentiment-analysis", model="stevhliu/my_awesome_model")
>>> classifier(text)
[{'label': 'POSITIVE', 'score': 0.9994940757751465}]

如果需要，您也可以手动复制 pipeline 的结果

Pytorch

隐藏 Pytorch 内容

对文本进行分词并返回 PyTorch 张量

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_model")
>>> inputs = tokenizer(text, return_tensors="pt")

将您的输入传递给模型并返回 logits。

>>> from transformers import AutoModelForSequenceClassification

>>> model = AutoModelForSequenceClassification.from_pretrained("stevhliu/my_awesome_model")
>>> with torch.no_grad():
...     logits = model(**inputs).logits

获取概率最高的类别，并使用模型的 id2label 映射将其转换为文本标签

>>> predicted_class_id = logits.argmax().item()
>>> model.config.id2label[predicted_class_id]
'POSITIVE'

TensorFlow

隐藏 TensorFlow 内容

对文本进行分词并返回 TensorFlow 张量

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_model")
>>> inputs = tokenizer(text, return_tensors="tf")

将您的输入传递给模型并返回 logits。

>>> from transformers import TFAutoModelForSequenceClassification

>>> model = TFAutoModelForSequenceClassification.from_pretrained("stevhliu/my_awesome_model")
>>> logits = model(**inputs).logits

获取概率最高的类别，并使用模型的 id2label 映射将其转换为文本标签

>>> predicted_class_id = int(tf.math.argmax(logits, axis=-1)[0])
>>> model.config.id2label[predicted_class_id]
'POSITIVE'

< > 在 GitHub 上更新