抽取式问题解答
抽取式问题解答是一项任务,其中训练一个模型从给定的上下文中提取问题的答案。该模型经过训练,可以预测答案跨度在上下文中开始和结束的位置。此任务通常用于问答系统,用于从大量文本语料库中提取相关信息。
准备你的数据
要训练抽取式问题解答模型,你需要一个包含以下列的数据集
text
: 上下文或从中提取答案的段落。question
: 需要提取答案的问题。answer
: 上下文中答案跨度的开始位置。
以下是数据集应如何显示的示例
{"context":"Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend \"Venite Ad Me Omnes\". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.","question":"To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?","answers":{"text":["Saint Bernadette Soubirous"],"answer_start":[515]}}
{"context":"Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend \"Venite Ad Me Omnes\". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.","question":"What is in front of the Notre Dame Main Building?","answers":{"text":["a copper statue of Christ"],"answer_start":[188]}}
{"context":"Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend \"Venite Ad Me Omnes\". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.","question":"The Basilica of the Sacred heart at Notre Dame is beside to which structure?","answers":{"text":["the Main Building"],"answer_start":[279]}}
注意:问答的首选格式是 JSONL,如果你想使用 CSV,则answer
列应包含键为text
和answer_start
的字符串化 JSON。
Hugging Face Hub 上的示例数据集:lhoestq/squad
附注:你可以使用带有正确列映射的 squad 和 squad v2 数据格式。
本地训练
要在本地训练 Extractive QA 模型,您需要一个配置文件
task: extractive-qa
base_model: google-bert/bert-base-uncased
project_name: autotrain-bert-ex-qa1
log: tensorboard
backend: local
data:
path: lhoestq/squad
train_split: train
valid_split: validation
column_mapping:
text_column: context
question_column: question
answer_column: answers
params:
max_seq_length: 512
max_doc_stride: 128
epochs: 3
batch_size: 4
lr: 2e-5
optimizer: adamw_torch
scheduler: linear
gradient_accumulation: 1
mixed_precision: fp16
hub:
username: ${HF_USERNAME}
token: ${HF_TOKEN}
push_to_hub: true
要训练模型,请运行以下命令
$ autotrain --config config.yaml
在这里,我们在 SQuAD 数据集上训练一个 BERT 模型,并使用 Extractive QA 任务。该模型使用 3 个 epochs,批处理大小为 4,学习率为 2e-5 进行训练。训练过程使用 TensorBoard 记录。该模型在本地训练,并在训练后推送到 Hugging Face Hub。
在 Hugging Face Spaces 上训练
与往常一样,请特别注意列映射。
参数
class autotrain.trainers.extractive_question_answering.params.ExtractiveQuestionAnsweringParams
< 源代码 >( data_path: str = None model: str = 'bert-base-uncased' lr: float = 5e-05 epochs: int = 3 max_seq_length: int = 128 max_doc_stride: int = 128 batch_size: int = 8 warmup_ratio: float = 0.1 gradient_accumulation: int = 1 optimizer: str = 'adamw_torch' scheduler: str = 'linear' weight_decay: float = 0.0 max_grad_norm: float = 1.0 seed: int = 42 train_split: str = 'train' valid_split: Optional = None text_column: str = 'context' question_column: str = 'question' answer_column: str = 'answers' logging_steps: int = -1 project_name: str = 'project-name' auto_find_batch_size: bool = False mixed_precision: Optional = None save_total_limit: int = 1 token: Optional = None push_to_hub: bool = False eval_strategy: str = 'epoch' username: Optional = None log: str = 'none' early_stopping_patience: int = 5 early_stopping_threshold: float = 0.01 )
参数
- data_path (字符串)— 数据集路径。
- model (字符串)— 预训练的模型名称。默认值为“bert-base-uncased”。
- max_seq_length(int)——输入的最大序列长度。默认值为 128。
- warmup_ratio (float) — 学习率调度器的预热比例。默认为 0.1。
- 梯度累积 (int) — 梯度累积步骤数。默认为 1。
- 优化器 (str) — 优化器类型。默认为 “adamw_torch”。
- max_grad_norm(浮点数)- 用于剪切的最大梯度范数。默认值为 1.0。
- valid_split (Optional[str]) — 验证数据拆分的名称。默认值为无。
- answer_column (str) — 答案列标题。默认值为“answers”。
- logging_steps(int)- 记录之间的步数。默认值为<i>-1</i>。
- project_name(str)- 输出目录的项目名称。默认值为“project-name”。
- save_total_limit (int) — 要保存的检查点的最大数量。默认为 1。
- 令牌 (Optional[str]) — 拥抱面部中心的认证令牌。默认为无。
- 推送到枢纽 (bool) — 是否将模型推送到 Hugging Face Hub。默认为 False。
- eval_strategy (str) — 训练时的评估策略。默认为 “epoch”。
- username (Optional[str]) — 用于身份验证的 Hugging Face 用户名。默认为 None。
- early_stopping_threshold (float) — 早期停止改进的阈值。默认值为 0.01。
ExtractiveQuestionAnsweringParams