Transformers 文档
文本分类
并获得增强的文档体验
开始使用
文本分类
文本分类是一个常见的 NLP 任务,它为文本分配一个标签或类别。一些大型公司在生产环境中运行文本分类,用于各种实际应用。文本分类最流行的形式之一是情感分析,它为文本序列分配一个类似于 🙂 positive(积极)、🙁 negative(消极)或 😐 neutral(中性)的标签。
本指南将向您展示如何:
- 在 IMDb 数据集上对 DistilBERT 进行微调,以确定电影评论是积极的还是消极的。
- 使用您的微调模型进行推理。
要查看与此任务兼容的所有架构和检查点,我们建议查看 任务页面。
在开始之前,请确保您已安装所有必要的库
pip install transformers datasets evaluate accelerate
我们鼓励您登录 Hugging Face 账户,以便您可以上传并与社区分享您的模型。出现提示时,输入您的令牌进行登录
>>> from huggingface_hub import notebook_login
>>> notebook_login()加载 IMDb 数据集
首先从 🤗 Datasets 库加载 IMDb 数据集
>>> from datasets import load_dataset
>>> imdb = load_dataset("imdb")然后查看一个示例
>>> imdb["test"][0]
{
"label": 0,
"text": "I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn't match the background, and painfully one-dimensional characters cannot be overcome with a 'sci-fi' setting. (I'm sure there are those of you out there who think Babylon 5 is good sci-fi TV. It's not. It's clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It's really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it's rubbish as they have to always say \"Gene Roddenberry's Earth...\" otherwise people would not continue watching. Roddenberry's ashes must be turning in their orbit as this dull, cheap, poorly edited (watching it without advert breaks really brings this home) trudging Trabant of a show lumbers into space. Spoiler. So, kill off a main character. And then bring him back as another actor. Jeeez! Dallas all over again.",
}此数据集中有两个字段
text:电影评论文本。label:一个值,0表示负面评论,1表示正面评论。
预处理
下一步是加载 DistilBERT 分词器来预处理 text 字段
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")创建一个预处理函数来分词 text 并截断序列,使其长度不超过 DistilBERT 的最大输入长度
>>> def preprocess_function(examples):
... return tokenizer(examples["text"], truncation=True)要将预处理函数应用于整个数据集,请使用 🤗 Datasets 的 map 函数。通过设置 batched=True 来一次处理多个数据集元素,可以加速 map 的速度
tokenized_imdb = imdb.map(preprocess_function, batched=True)现在使用 DataCollatorWithPadding 创建一个示例批次。在批次处理过程中动态填充(pad)句子到批次中最长的长度比将整个数据集填充到最大长度更有效。
>>> from transformers import DataCollatorWithPadding
>>> data_collator = DataCollatorWithPadding(tokenizer=tokenizer)评估
在训练期间包含指标通常有助于评估模型的性能。您可以快速加载一个评估方法,使用 🤗 Evaluate 库。对于此任务,加载 accuracy 指标(有关如何加载和计算指标的更多信息,请参阅 🤗 Evaluate 的 快速入门)。
>>> import evaluate
>>> accuracy = evaluate.load("accuracy")然后创建一个函数,将您的预测和标签传递给 compute 以计算准确率
>>> import numpy as np
>>> def compute_metrics(eval_pred):
... predictions, labels = eval_pred
... predictions = np.argmax(predictions, axis=1)
... return accuracy.compute(predictions=predictions, references=labels)您的 compute_metrics 函数现在可以使用了,您将在设置训练时再次用到它。
训练
在开始训练模型之前,使用 id2label 和 label2id 创建一个映射,将预期的 id 映射到它们的标签
>>> id2label = {0: "NEGATIVE", 1: "POSITIVE"}
>>> label2id = {"NEGATIVE": 0, "POSITIVE": 1}现在您已准备好开始训练模型!使用 AutoModelForSequenceClassification 加载 DistilBERT,并传入预期的标签数量和标签映射
>>> from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
>>> model = AutoModelForSequenceClassification.from_pretrained(
... "distilbert/distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
... )此时,只剩下三个步骤
- 在 TrainingArguments 中定义您的训练超参数。唯一必需的参数是
output_dir,它指定了模型保存的位置。通过设置push_to_hub=True,您将把模型推送到 Hub(您需要登录 Hugging Face 才能上传模型)。在每个 epoch 结束时,Trainer 将评估准确率并保存训练检查点。 - 将训练参数与模型、数据集、分词器、数据整理器和
compute_metrics函数一起传递给 Trainer。 - 调用 train() 来微调您的模型。
>>> training_args = TrainingArguments(
... output_dir="my_awesome_model",
... learning_rate=2e-5,
... per_device_train_batch_size=16,
... per_device_eval_batch_size=16,
... num_train_epochs=2,
... weight_decay=0.01,
... eval_strategy="epoch",
... save_strategy="epoch",
... load_best_model_at_end=True,
... push_to_hub=True,
... )
>>> trainer = Trainer(
... model=model,
... args=training_args,
... train_dataset=tokenized_imdb["train"],
... eval_dataset=tokenized_imdb["test"],
... processing_class=tokenizer,
... data_collator=data_collator,
... compute_metrics=compute_metrics,
... )
>>> trainer.train()Trainer 在您将其
tokenizer传递给它时,默认会应用动态填充。在这种情况下,您不需要显式指定数据整理器(data collator)。
训练完成后,使用 push_to_hub() 方法将您的模型共享到 Hub,以便每个人都可以使用您的模型。
>>> trainer.push_to_hub()有关微调文本分类模型的更深入示例,请查看相应的 PyTorch notebook。
推理
太棒了,现在您已经微调了模型,您可以将其用于推理了!
抓取一些您想进行推理的文本
>>> text = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."尝试对您的微调模型进行推理的最简单方法是将其用于 pipeline()。使用您的模型实例化一个用于情感分析的 pipeline,然后将您的文本传递给它
>>> from transformers import pipeline
>>> classifier = pipeline("sentiment-analysis", model="stevhliu/my_awesome_model")
>>> classifier(text)
[{'label': 'POSITIVE', 'score': 0.9994940757751465}]如果需要,您也可以手动复制 pipeline 的结果
对文本进行分词并返回 PyTorch 张量
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_model")
>>> inputs = tokenizer(text, return_tensors="pt")将您的输入传递给模型并返回 logits。
>>> from transformers import AutoModelForSequenceClassification
>>> model = AutoModelForSequenceClassification.from_pretrained("stevhliu/my_awesome_model")
>>> with torch.no_grad():
... logits = model(**inputs).logits获取概率最高的类别,并使用模型的 id2label 映射将其转换为文本标签
>>> predicted_class_id = logits.argmax().item()
>>> model.config.id2label[predicted_class_id]
'POSITIVE'