多项选择
多项选择任务类似于问答,只是提供了几个候选答案以及上下文,模型被训练来选择正确的答案。
本指南将向您展示如何
在开始之前,请确保您已安装所有必要的库
pip install transformers datasets evaluate
我们鼓励您登录您的 Hugging Face 帐户,以便您可以上传模型并与社区共享。当系统提示时,输入您的令牌以登录。
>>> from huggingface_hub import notebook_login
>>> notebook_login()
加载 SWAG 数据集
首先从 🤗 数据集库加载 SWAG 数据集的regular
配置
>>> from datasets import load_dataset
>>> swag = load_dataset("swag", "regular")
然后查看一个示例
>>> swag["train"][0]
{'ending0': 'passes by walking down the street playing their instruments.',
'ending1': 'has heard approaching them.',
'ending2': "arrives and they're outside dancing and asleep.",
'ending3': 'turns the lead singer watches the performance.',
'fold-ind': '3416',
'gold-source': 'gold',
'label': 0,
'sent1': 'Members of the procession walk down the street holding small horn brass instruments.',
'sent2': 'A drum line',
'startphrase': 'Members of the procession walk down the street holding small horn brass instruments. A drum line',
'video-id': 'anetv_jkn6uvmqwh4'}
虽然这里看起来有很多字段,但实际上它非常简单
sent1
和sent2
:这些字段显示了句子的开头,如果你将两者放在一起,就会得到startphrase
字段。ending
:建议一个句子结尾的可能结尾,但只有一个是正确的。label
:标识正确的句子结尾。
预处理
下一步是加载 BERT 分词器来处理句子开头和四个可能的结尾
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
您要创建的预处理函数需要
- 复制
sent1
字段四次,并将每个字段与sent2
组合,重新创建句子的开头。 - 将
sent2
与四个可能的句子结尾中的每一个组合。 - 将这两个列表展平,以便您可以对其进行分词,然后在之后将其取消展平,以便每个示例都有对应的
input_ids
、attention_mask
和labels
字段。
>>> ending_names = ["ending0", "ending1", "ending2", "ending3"]
>>> def preprocess_function(examples):
... first_sentences = [[context] * 4 for context in examples["sent1"]]
... question_headers = examples["sent2"]
... second_sentences = [
... [f"{header} {examples[end][i]}" for end in ending_names] for i, header in enumerate(question_headers)
... ]
... first_sentences = sum(first_sentences, [])
... second_sentences = sum(second_sentences, [])
... tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
... return {k: [v[i : i + 4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()}
要将预处理函数应用于整个数据集,请使用 🤗 Datasets 的 map 方法。您可以通过设置batched=True
来加快map
函数的速度,以便一次处理数据集的多个元素。
tokenized_swag = swag.map(preprocess_function, batched=True)
🤗 Transformers 没有针对多项选择的数据整理器,因此您需要调整 DataCollatorWithPadding 来创建一批示例。在整理过程中,动态填充句子到批次中最长的长度,而不是将整个数据集填充到最大长度,这样做效率更高。
DataCollatorForMultipleChoice
展平所有模型输入,应用填充,然后取消展平结果。
>>> from dataclasses import dataclass
>>> from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
>>> from typing import Optional, Union
>>> import torch
>>> @dataclass
... class DataCollatorForMultipleChoice:
... """
... Data collator that will dynamically pad the inputs for multiple choice received.
... """
... tokenizer: PreTrainedTokenizerBase
... padding: Union[bool, str, PaddingStrategy] = True
... max_length: Optional[int] = None
... pad_to_multiple_of: Optional[int] = None
... def __call__(self, features):
... label_name = "label" if "label" in features[0].keys() else "labels"
... labels = [feature.pop(label_name) for feature in features]
... batch_size = len(features)
... num_choices = len(features[0]["input_ids"])
... flattened_features = [
... [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
... ]
... flattened_features = sum(flattened_features, [])
... batch = self.tokenizer.pad(
... flattened_features,
... padding=self.padding,
... max_length=self.max_length,
... pad_to_multiple_of=self.pad_to_multiple_of,
... return_tensors="pt",
... )
... batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
... batch["labels"] = torch.tensor(labels, dtype=torch.int64)
... return batch
>>> from dataclasses import dataclass
>>> from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
>>> from typing import Optional, Union
>>> import tensorflow as tf
>>> @dataclass
... class DataCollatorForMultipleChoice:
... """
... Data collator that will dynamically pad the inputs for multiple choice received.
... """
... tokenizer: PreTrainedTokenizerBase
... padding: Union[bool, str, PaddingStrategy] = True
... max_length: Optional[int] = None
... pad_to_multiple_of: Optional[int] = None
... def __call__(self, features):
... label_name = "label" if "label" in features[0].keys() else "labels"
... labels = [feature.pop(label_name) for feature in features]
... batch_size = len(features)
... num_choices = len(features[0]["input_ids"])
... flattened_features = [
... [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
... ]
... flattened_features = sum(flattened_features, [])
... batch = self.tokenizer.pad(
... flattened_features,
... padding=self.padding,
... max_length=self.max_length,
... pad_to_multiple_of=self.pad_to_multiple_of,
... return_tensors="tf",
... )
... batch = {k: tf.reshape(v, (batch_size, num_choices, -1)) for k, v in batch.items()}
... batch["labels"] = tf.convert_to_tensor(labels, dtype=tf.int64)
... return batch
评估
在训练期间包含指标通常有助于评估模型的性能。您可以使用 🤗 Evaluate 库快速加载评估方法。对于此任务,加载 accuracy 指标(请参阅 🤗 Evaluate 的 快速入门 了解有关如何加载和计算指标的更多信息)。
>>> import evaluate
>>> accuracy = evaluate.load("accuracy")
然后创建一个函数,将您的预测和标签传递给compute
以计算准确率。
>>> import numpy as np
>>> def compute_metrics(eval_pred):
... predictions, labels = eval_pred
... predictions = np.argmax(predictions, axis=1)
... return accuracy.compute(predictions=predictions, references=labels)
您的compute_metrics
函数现在可以使用了,您将在设置训练时返回到它。
训练
您现在可以开始训练您的模型了!使用 AutoModelForMultipleChoice 加载 BERT。
>>> from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer
>>> model = AutoModelForMultipleChoice.from_pretrained("google-bert/bert-base-uncased")
此时,只剩下三个步骤
- 在 TrainingArguments 中定义您的训练超参数。唯一必需的参数是
output_dir
,它指定保存模型的位置。您将通过设置push_to_hub=True
将此模型推送到 Hub(您需要登录 Hugging Face 才能上传您的模型)。在每个 epoch 结束时,Trainer 将评估准确率并保存训练检查点。 - 将训练参数传递给 Trainer,以及模型、数据集、分词器、数据整理器和
compute_metrics
函数。 - 调用 train() 来微调您的模型。
>>> training_args = TrainingArguments(
... output_dir="my_awesome_swag_model",
... eval_strategy="epoch",
... save_strategy="epoch",
... load_best_model_at_end=True,
... learning_rate=5e-5,
... per_device_train_batch_size=16,
... per_device_eval_batch_size=16,
... num_train_epochs=3,
... weight_decay=0.01,
... push_to_hub=True,
... )
>>> trainer = Trainer(
... model=model,
... args=training_args,
... train_dataset=tokenized_swag["train"],
... eval_dataset=tokenized_swag["validation"],
... tokenizer=tokenizer,
... data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
... compute_metrics=compute_metrics,
... )
>>> trainer.train()
训练完成后,使用 push_to_hub() 方法将您的模型共享到 Hub,以便每个人都可以使用您的模型。
>>> trainer.push_to_hub()
如果您不熟悉使用 Keras 微调模型,请查看基本的教程 此处!
>>> from transformers import create_optimizer
>>> batch_size = 16
>>> num_train_epochs = 2
>>> total_train_steps = (len(tokenized_swag["train"]) // batch_size) * num_train_epochs
>>> optimizer, schedule = create_optimizer(init_lr=5e-5, num_warmup_steps=0, num_train_steps=total_train_steps)
然后,您可以使用 TFAutoModelForMultipleChoice 加载 BERT。
>>> from transformers import TFAutoModelForMultipleChoice
>>> model = TFAutoModelForMultipleChoice.from_pretrained("google-bert/bert-base-uncased")
使用 prepare_tf_dataset() 将您的数据集转换为tf.data.Dataset
格式。
>>> data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
>>> tf_train_set = model.prepare_tf_dataset(
... tokenized_swag["train"],
... shuffle=True,
... batch_size=batch_size,
... collate_fn=data_collator,
... )
>>> tf_validation_set = model.prepare_tf_dataset(
... tokenized_swag["validation"],
... shuffle=False,
... batch_size=batch_size,
... collate_fn=data_collator,
... )
使用 compile
为训练配置模型。请注意,Transformers 模型都具有默认的任务相关损失函数,因此您无需指定一个,除非您需要。
>>> model.compile(optimizer=optimizer) # No loss argument!
在开始训练之前,您需要设置的最后两件事是根据预测计算准确率,并提供一种将模型推送到 Hub 的方法。两者都是通过使用 Keras 回调 完成的。
将您的compute_metrics
函数传递给 KerasMetricCallback
>>> from transformers.keras_callbacks import KerasMetricCallback
>>> metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)
在 PushToHubCallback 中指定推送模型和分词器的位置。
>>> from transformers.keras_callbacks import PushToHubCallback
>>> push_to_hub_callback = PushToHubCallback(
... output_dir="my_awesome_model",
... tokenizer=tokenizer,
... )
然后将您的回调捆绑在一起。
>>> callbacks = [metric_callback, push_to_hub_callback]
最后,您就可以开始训练您的模型了!使用您的训练和验证数据集、epoch 数和您的回调调用 fit
来微调模型。
>>> model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=2, callbacks=callbacks)
训练完成后,您的模型会自动上传到 Hub,以便每个人都可以使用它!
有关如何微调多项选择模型的更深入的示例,请查看相应的 PyTorch 笔记本 或 TensorFlow 笔记本。
推理
太好了,现在您已经微调了一个模型,您可以使用它进行推理!
想出一些文本和两个候选答案。
>>> prompt = "France has a bread law, Le Décret Pain, with strict rules on what is allowed in a traditional baguette."
>>> candidate1 = "The law does not apply to croissants and brioche."
>>> candidate2 = "The law applies to baguettes."
对每个提示和候选答案对进行分词,并返回 PyTorch 张量。您还应该创建一些labels
。
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("username/my_awesome_swag_model")
>>> inputs = tokenizer([[prompt, candidate1], [prompt, candidate2]], return_tensors="pt", padding=True)
>>> labels = torch.tensor(0).unsqueeze(0)
将您的输入和标签传递给模型,并返回logits
。
>>> from transformers import AutoModelForMultipleChoice
>>> model = AutoModelForMultipleChoice.from_pretrained("username/my_awesome_swag_model")
>>> outputs = model(**{k: v.unsqueeze(0) for k, v in inputs.items()}, labels=labels)
>>> logits = outputs.logits
获取概率最高的类。
>>> predicted_class = logits.argmax().item()
>>> predicted_class
'0'
对每个提示和候选答案对进行分词,并返回 TensorFlow 张量。
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("username/my_awesome_swag_model")
>>> inputs = tokenizer([[prompt, candidate1], [prompt, candidate2]], return_tensors="tf", padding=True)
将您的输入传递给模型,并返回logits
。
>>> from transformers import TFAutoModelForMultipleChoice
>>> model = TFAutoModelForMultipleChoice.from_pretrained("username/my_awesome_swag_model")
>>> inputs = {k: tf.expand_dims(v, 0) for k, v in inputs.items()}
>>> outputs = model(inputs)
>>> logits = outputs.logits
获取概率最高的类。
>>> predicted_class = int(tf.math.argmax(logits, axis=-1)[0])
>>> predicted_class
'0'