使用 AutoTrain 进行抽取式问答

社区文章发布于 2024 年 8 月 20 日

抽取式问答 (Extractive Question Answering) 是一项训练模型从给定上下文中提取问题答案的任务。模型被训练来预测答案在上下文中的起始和结束位置。这项任务常用于问答系统中，从大量文本语料库中提取相关信息。

有时候，你需要的不仅仅是生成式模型 ;)

在这篇博客中，我们将讨论如何使用 AutoTrain 来训练一个抽取式问答模型。AutoTrain（又名 AutoTrain Advanced）是一个开源的、无需代码的解决方案，它简化了在各种领域和模态类型中训练最先进模型的过程。它使你只需点击几下即可训练模型，无需任何编码或机器学习专业知识。

AutoTrain 的 GitHub 仓库可以在这里找到。

准备数据

要训练一个抽取式问答模型，你需要一个包含以下列的数据集

context: 用于提取答案的上下文或段落。
question: 需要提取答案的问题。
answer: 答案在上下文中的起始位置和答案文本。

answer 列应该是一个包含 text 和 answer_start 键的字典。

例如：

{
    "context":"Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend \"Venite Ad Me Omnes\". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.",
    "question":"To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?",
    "answers":{"text":["Saint Bernadette Soubirous"],"answer_start":[515]}
}

AutoTrain 支持 CSV 和 JSONL 格式的训练数据。如果你想使用 CSV，answer 列应该是包含 text 和 answer_start 键的字符串化 JSON。JSONL 是问答任务的首选格式。

你也可以使用 Hugging Face Hub 上的数据集，例如 lhoestq/squad。

数据集看起来是这样的

列映射：

列映射对于 AutoTrain 至关重要。AutoTrain 根据提供的列映射来理解数据。对于抽取式问答，列映射应如下所示

{"text": "context", "question": "question", "answer": "answers"}

其中 answer 是一个包含 text 和 answer_start 键的字典。

如你所见，AutoTrain 的列是：text、question 和 answer！

在本地训练模型

要在本地使用 AutoTrain，你需要安装 pip 包：autotrain-advanced。

$ pip install -U autotrain-advanced

安装包后，你可以使用以下命令训练模型

$ export HF_USERNAME=<your_hf_username>
$ export HF_TOKEN=<your_hf_write_token>
$ autotrain --config <path_to_config_file>

其中配置文件看起来像这样

task: extractive-qa
base_model: google-bert/bert-base-uncased
project_name: autotrain-bert-ex-qa1
log: tensorboard
backend: local

data:
  path: lhoestq/squad
  train_split: train
  valid_split: validation
  column_mapping:
    text_column: context
    question_column: question
    answer_column: answers

params:
  max_seq_length: 512
  max_doc_stride: 128
  epochs: 3
  batch_size: 4
  lr: 2e-5
  optimizer: adamw_torch
  scheduler: linear
  gradient_accumulation: 1
  mixed_precision: fp16

hub:
  username: ${HF_USERNAME}
  token: ${HF_TOKEN}
  push_to_hub: true

上述配置将在 lhoestq/squad 数据集上训练一个 BERT 模型，训练 3 个 epoch，批大小为 4，学习率为 2e-5。你可以在文档中找到所有参数。

如果使用本地文件，你只需将配置文件的 data 部分更改为

data:
  path: data/ # this must be the path to the directory containing the train and valid files
  train_split: train # this must be either train.csv or train.json
  valid_split: valid # this must be either valid.csv or valid.json, can also be null

注意：如果你不将模型推送到 Hub，或者不使用受限/私有的模型/数据集，则无需导出你的 HF_USERNAME 和 HF_TOKEN。

在 Hugging Face Hub 上训练模型

要在 Hugging Face Hub 上训练模型，你需要创建一个具有适当硬件的 AutoTrain Space。要创建 AutoTrain Space，请访问 AutoTrain 并按照说明操作，或点击此处。

完成后，你将看到如下界面

选择 Extractive Question Answering 任务，填写所需信息：数据集和列映射，如果需要可以更改参数，然后点击“Start Training”。

你也可以使用以下命令在本地运行 UI

$ export HF_TOKEN=<your_hf_write_token>
$ autotrain app

就是这样！现在你可以使用 AutoTrain 在本地或 Hugging Face Hub 上训练你自己的抽取式问答模型了。

训练愉快！ 🚀

如果你有任何问题或需要帮助，请随时在 GitHub 上与我们联系。

社区

通过拖放到文本输入框、粘贴或点击此处上传图片、音频和视频。

点击或粘贴此处以上传图片

· 注册或登录以发表评论