AutoTrain 文档

抽取式问答

你正在查看 main 版本,需要源代码安装。如果你希望使用常规的 pip 安装,请查看最新的稳定版本 (v0.8.24).
Hugging Face's logo
加入 Hugging Face 社区

并访问增强文档体验

开始操作

抽取式问题解答

抽取式问题解答是一项任务,其中训练一个模型从给定的上下文中提取问题的答案。该模型经过训练,可以预测答案跨度在上下文中开始和结束的位置。此任务通常用于问答系统,用于从大量文本语料库中提取相关信息。

准备你的数据

要训练抽取式问题解答模型,你需要一个包含以下列的数据集

  • text: 上下文或从中提取答案的段落。
  • question: 需要提取答案的问题。
  • answer: 上下文中答案跨度的开始位置。

以下是数据集应如何显示的示例

{"context":"Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend \"Venite Ad Me Omnes\". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.","question":"To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?","answers":{"text":["Saint Bernadette Soubirous"],"answer_start":[515]}}
{"context":"Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend \"Venite Ad Me Omnes\". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.","question":"What is in front of the Notre Dame Main Building?","answers":{"text":["a copper statue of Christ"],"answer_start":[188]}}
{"context":"Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend \"Venite Ad Me Omnes\". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.","question":"The Basilica of the Sacred heart at Notre Dame is beside to which structure?","answers":{"text":["the Main Building"],"answer_start":[279]}}

注意:问答的首选格式是 JSONL,如果你想使用 CSV,则answer列应包含键为textanswer_start的字符串化 JSON。

Hugging Face Hub 上的示例数据集:lhoestq/squad

附注:你可以使用带有正确列映射的 squad 和 squad v2 数据格式。

本地训练

要在本地训练 Extractive QA 模型,您需要一个配置文件

task: extractive-qa
base_model: google-bert/bert-base-uncased
project_name: autotrain-bert-ex-qa1
log: tensorboard
backend: local

data:
  path: lhoestq/squad
  train_split: train
  valid_split: validation
  column_mapping:
    text_column: context
    question_column: question
    answer_column: answers

params:
  max_seq_length: 512
  max_doc_stride: 128
  epochs: 3
  batch_size: 4
  lr: 2e-5
  optimizer: adamw_torch
  scheduler: linear
  gradient_accumulation: 1
  mixed_precision: fp16

hub:
  username: ${HF_USERNAME}
  token: ${HF_TOKEN}
  push_to_hub: true

要训练模型,请运行以下命令

$ autotrain --config config.yaml

在这里,我们在 SQuAD 数据集上训练一个 BERT 模型,并使用 Extractive QA 任务。该模型使用 3 个 epochs,批处理大小为 4,学习率为 2e-5 进行训练。训练过程使用 TensorBoard 记录。该模型在本地训练,并在训练后推送到 Hugging Face Hub。

在 Hugging Face Spaces 上训练

AutoTrain Extractive Question Answering on Hugging Face Spaces

与往常一样,请特别注意列映射。

参数

class autotrain.trainers.extractive_question_answering.params.ExtractiveQuestionAnsweringParams

< >

( data_path: str = None model: str = 'bert-base-uncased' lr: float = 5e-05 epochs: int = 3 max_seq_length: int = 128 max_doc_stride: int = 128 batch_size: int = 8 warmup_ratio: float = 0.1 gradient_accumulation: int = 1 optimizer: str = 'adamw_torch' scheduler: str = 'linear' weight_decay: float = 0.0 max_grad_norm: float = 1.0 seed: int = 42 train_split: str = 'train' valid_split: Optional = None text_column: str = 'context' question_column: str = 'question' answer_column: str = 'answers' logging_steps: int = -1 project_name: str = 'project-name' auto_find_batch_size: bool = False mixed_precision: Optional = None save_total_limit: int = 1 token: Optional = None push_to_hub: bool = False eval_strategy: str = 'epoch' username: Optional = None log: str = 'none' early_stopping_patience: int = 5 early_stopping_threshold: float = 0.01 )

参数

  • data_path (字符串)— 数据集路径。
  • model (字符串)— 预训练的模型名称。默认值为“bert-base-uncased”。
  • epochs(int)——训练时期数。默认值为 3。
  • max_seq_length(int)——输入的最大序列长度。默认值为 128。
  • batch_size (int) — 训练的批大小。默认为 8。
  • warmup_ratio (float) — 学习率调度器的预热比例。默认为 0.1。
  • 梯度累积 (int) — 梯度累积步骤数。默认为 1。
  • 优化器 (str) — 优化器类型。默认为 “adamw_torch”。
  • weight_decay(浮点数)- 优化器的权重衰减。默认值为 0.0。
  • max_grad_norm(浮点数)- 用于剪切的最大梯度范数。默认值为 1.0。
  • train_split (str) — 训练数据拆分的名称。默认值为“train”。
  • valid_split (Optional[str]) — 验证数据拆分的名称。默认值为无。
  • question_column (str) — 问题列标题。默认值为“question”。
  • answer_column (str) — 答案列标题。默认值为“answers”。
  • logging_steps(int)- 记录之间的步数。默认值为<i>-1</i>。
  • project_name(str)- 输出目录的项目名称。默认值为“project-name”。
  • mixed_precision (Optional[str]) — 混合精度训练模式(fp16、bf16 或 None)。默认为 None。
  • save_total_limit (int) — 要保存的检查点的最大数量。默认为 1。
  • 令牌 (Optional[str]) — 拥抱面部中心的认证令牌。默认为无。
  • 推送到枢纽 (bool) — 是否将模型推送到 Hugging Face Hub。默认为 False。
  • eval_strategy (str) — 训练时的评估策略。默认为 “epoch”。
  • username (Optional[str]) — 用于身份验证的 Hugging Face 用户名。默认为 None。
  • early_stopping_patience (int) — 早期停止,不再有进步的 epoch 数量。默认值为 5。
  • early_stopping_threshold (float) — 早期停止改进的阈值。默认值为 0.01。

ExtractiveQuestionAnsweringParams

< > 更新 GitHub 上的