问答

问答任务返回给定问题的答案。如果您曾经向 Alexa、Siri 或 Google 等虚拟助手询问天气，那么您之前已经使用过问答模型。有两种常见的问答任务类型

抽取式：从给定的上下文中提取答案。
生成式：从上下文中生成正确回答问题的答案。

本指南将向您展示如何

在 SQuAD 数据集上微调 DistilBERT 以进行抽取式问答。
使用您微调的模型进行推理。

要查看与此任务兼容的所有架构和检查点，我们建议查看任务页面

在开始之前，请确保您已安装所有必要的库

pip install transformers datasets evaluate

我们鼓励您登录您的 Hugging Face 帐户，以便您可以上传并与社区分享您的模型。出现提示时，输入您的令牌以登录

>>> from huggingface_hub import notebook_login

>>> notebook_login()

加载 SQuAD 数据集

首先从 🤗 Datasets 库加载 SQuAD 数据集的较小子集。这将让您有机会进行实验，并在花费更多时间在完整数据集上进行训练之前确保一切正常。

>>> from datasets import load_dataset

>>> squad = load_dataset("squad", split="train[:5000]")

使用 train_test_split 方法将数据集的 train 分割拆分为训练集和测试集

>>> squad = squad.train_test_split(test_size=0.2)

然后看看一个例子

>>> squad["train"][0]
{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'id': '5733be284776f41900661182',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'title': 'University_of_Notre_Dame'
}

这里有几个重要的字段

answers：答案标记的起始位置和答案文本。
context：模型需要从中提取答案的背景信息。
question：模型应该回答的问题。

预处理

下一步是加载 DistilBERT tokenizer 以处理 question 和 context 字段

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")

您应该注意一些特定于问答任务的预处理步骤

数据集中的某些示例可能具有非常长的 context，超出了模型的最大输入长度。要处理较长的序列，请仅通过设置 truncation="only_second" 来截断 context。
接下来，通过设置 return_offset_mapping=True 将答案的开始和结束位置映射到原始 context。
有了映射，现在您可以找到答案的开始和结束标记。使用 sequence_ids 方法查找偏移量的哪一部分对应于 question，哪一部分对应于 context。

以下是如何创建一个函数来截断和映射 answer 的开始和结束标记到 context

>>> def preprocess_function(examples):
...     questions = [q.strip() for q in examples["question"]]
...     inputs = tokenizer(
...         questions,
...         examples["context"],
...         max_length=384,
...         truncation="only_second",
...         return_offsets_mapping=True,
...         padding="max_length",
...     )

...     offset_mapping = inputs.pop("offset_mapping")
...     answers = examples["answers"]
...     start_positions = []
...     end_positions = []

...     for i, offset in enumerate(offset_mapping):
...         answer = answers[i]
...         start_char = answer["answer_start"][0]
...         end_char = answer["answer_start"][0] + len(answer["text"][0])
...         sequence_ids = inputs.sequence_ids(i)

...         # Find the start and end of the context
...         idx = 0
...         while sequence_ids[idx] != 1:
...             idx += 1
...         context_start = idx
...         while sequence_ids[idx] == 1:
...             idx += 1
...         context_end = idx - 1

...         # If the answer is not fully inside the context, label it (0, 0)
...         if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
...             start_positions.append(0)
...             end_positions.append(0)
...         else:
...             # Otherwise it's the start and end token positions
...             idx = context_start
...             while idx <= context_end and offset[idx][0] <= start_char:
...                 idx += 1
...             start_positions.append(idx - 1)

...             idx = context_end
...             while idx >= context_start and offset[idx][1] >= end_char:
...                 idx -= 1
...             end_positions.append(idx + 1)

...     inputs["start_positions"] = start_positions
...     inputs["end_positions"] = end_positions
...     return inputs

要将预处理函数应用于整个数据集，请使用 🤗 Datasets map 函数。您可以通过设置 batched=True 来加速 map 函数，以一次处理数据集的多个元素。删除您不需要的任何列

>>> tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)

现在使用 DefaultDataCollator 创建一批示例。与 🤗 Transformers 中的其他数据整理器不同，DefaultDataCollator 不应用任何额外的预处理，例如填充。

Pytorch

隐藏 Pytorch 内容

>>> from transformers import DefaultDataCollator

>>> data_collator = DefaultDataCollator()

TensorFlow

隐藏 TensorFlow 内容

>>> from transformers import DefaultDataCollator

>>> data_collator = DefaultDataCollator(return_tensors="tf")

训练

Pytorch

隐藏 Pytorch 内容

如果您不熟悉使用 Trainer 微调模型，请查看此处的基本教程！

您现在可以开始训练您的模型了！使用 AutoModelForQuestionAnswering 加载 DistilBERT

>>> from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

>>> model = AutoModelForQuestionAnswering.from_pretrained("distilbert/distilbert-base-uncased")

此时，仅剩三个步骤

在 TrainingArguments 中定义您的训练超参数。唯一必需的参数是 output_dir，它指定保存模型的位置。您将通过设置 push_to_hub=True 将此模型推送到 Hub（您需要登录 Hugging Face 才能上传您的模型）。
将训练参数传递给 Trainer，以及模型、数据集、tokenizer 和数据整理器。
调用 train() 来微调您的模型。

>>> training_args = TrainingArguments(
...     output_dir="my_awesome_qa_model",
...     eval_strategy="epoch",
...     learning_rate=2e-5,
...     per_device_train_batch_size=16,
...     per_device_eval_batch_size=16,
...     num_train_epochs=3,
...     weight_decay=0.01,
...     push_to_hub=True,
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=tokenized_squad["train"],
...     eval_dataset=tokenized_squad["test"],
...     processing_class=tokenizer,
...     data_collator=data_collator,
... )

>>> trainer.train()

训练完成后，使用 push_to_hub() 方法将您的模型分享到 Hub，以便每个人都可以使用您的模型

>>> trainer.push_to_hub()

TensorFlow

隐藏 TensorFlow 内容

如果您不熟悉使用 Keras 微调模型，请查看此处的基本教程！

要在 TensorFlow 中微调模型，首先要设置优化器函数、学习率计划和一些训练超参数

>>> from transformers import create_optimizer

>>> batch_size = 16
>>> num_epochs = 2
>>> total_train_steps = (len(tokenized_squad["train"]) // batch_size) * num_epochs
>>> optimizer, schedule = create_optimizer(
...     init_lr=2e-5,
...     num_warmup_steps=0,
...     num_train_steps=total_train_steps,
... )

然后您可以使用 TFAutoModelForQuestionAnswering 加载 DistilBERT

>>> from transformers import TFAutoModelForQuestionAnswering

>>> model = TFAutoModelForQuestionAnswering.from_pretrained("distilbert/distilbert-base-uncased")

使用 prepare_tf_dataset() 将您的数据集转换为 tf.data.Dataset 格式

>>> tf_train_set = model.prepare_tf_dataset(
...     tokenized_squad["train"],
...     shuffle=True,
...     batch_size=16,
...     collate_fn=data_collator,
... )

>>> tf_validation_set = model.prepare_tf_dataset(
...     tokenized_squad["test"],
...     shuffle=False,
...     batch_size=16,
...     collate_fn=data_collator,
... )

使用 compile 配置模型以进行训练

>>> import tensorflow as tf

>>> model.compile(optimizer=optimizer)

在开始训练之前要设置的最后一件事是提供一种将模型推送到 Hub 的方法。这可以通过在 PushToHubCallback 中指定推送模型和 tokenizer 的位置来完成

>>> from transformers.keras_callbacks import PushToHubCallback

>>> callback = PushToHubCallback(
...     output_dir="my_awesome_qa_model",
...     tokenizer=tokenizer,
... )

最后，您已准备好开始训练您的模型！使用您的训练和验证数据集、epoch 数和您的回调调用 fit 来微调模型

>>> model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3, callbacks=[callback])

训练完成后，您的模型将自动上传到 Hub，以便每个人都可以使用它！

有关如何更深入地了解如何微调模型以进行问答的示例，请查看相应的 PyTorch notebook 或 TensorFlow notebook。

评估

问答的评估需要大量的后处理。为避免占用您太多时间，本指南跳过了评估步骤。 Trainer 仍然会在训练期间计算评估损失，因此您不会完全不了解模型的性能。

如果您有更多时间并且有兴趣了解如何评估您的问答模型，请查看 🤗 Hugging Face 课程的问答章节！

推理

太棒了，现在您已经微调了一个模型，您可以使用它进行推理了！

想出一个问题和一些您希望模型预测的上下文

>>> question = "How many programming languages does BLOOM support?"
>>> context = "BLOOM has 176 billion parameters and can generate text in 46 languages natural languages and 13 programming languages."

尝试使用微调模型进行推理的最简单方法是在 pipeline() 中使用它。为问答实例化一个 pipeline 和您的模型，并将您的文本传递给它

>>> from transformers import pipeline

>>> question_answerer = pipeline("question-answering", model="my_awesome_qa_model")
>>> question_answerer(question=question, context=context)
{'score': 0.2058267742395401,
 'start': 10,
 'end': 95,
 'answer': '176 billion parameters and can generate text in 46 languages natural languages and 13'}

如果您愿意，也可以手动复制 pipeline 的结果

Pytorch

隐藏 Pytorch 内容

对文本进行标记化并返回 PyTorch 张量

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_qa_model")
>>> inputs = tokenizer(question, context, return_tensors="pt")

将您的输入传递给模型并返回 logits

>>> import torch
>>> from transformers import AutoModelForQuestionAnswering

>>> model = AutoModelForQuestionAnswering.from_pretrained("my_awesome_qa_model")
>>> with torch.no_grad():
...     outputs = model(**inputs)

从模型输出中获取开始和结束位置的最高概率

>>> answer_start_index = outputs.start_logits.argmax()
>>> answer_end_index = outputs.end_logits.argmax()

解码预测的标记以获得答案

>>> predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
>>> tokenizer.decode(predict_answer_tokens)
'176 billion parameters and can generate text in 46 languages natural languages and 13'

TensorFlow

隐藏 TensorFlow 内容

对文本进行标记化并返回 TensorFlow 张量

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_qa_model")
>>> inputs = tokenizer(question, context, return_tensors="tf")

将您的输入传递给模型并返回 logits

>>> from transformers import TFAutoModelForQuestionAnswering

>>> model = TFAutoModelForQuestionAnswering.from_pretrained("my_awesome_qa_model")
>>> outputs = model(**inputs)

从模型输出中获取开始和结束位置的最高概率

>>> answer_start_index = int(tf.math.argmax(outputs.start_logits, axis=-1)[0])
>>> answer_end_index = int(tf.math.argmax(outputs.end_logits, axis=-1)[0])

解码预测的标记以获得答案

>>> predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
>>> tokenizer.decode(predict_answer_tokens)
'176 billion parameters and can generate text in 46 languages natural languages and 13'

< > 在 GitHub 上更新