利用 Transformer 和 PyTorch 处理多项选择题任务

引言
多项选择题(MCQ)是各种领域中普遍存在的评估形式,从教育到招聘。深度学习,特别是基于 Transformer 的架构的出现,彻底改变了自然语言处理(NLP)任务,使其在处理 MCQ 方面非常有效。PyTorch 是一个流行的深度学习框架,与 Transformer 模型无缝集成,可以高效处理 MCQ 任务。在本文中,我们将探讨如何利用 Transformer 和 PyTorch 处理 MCQ 任务。
理解 Transformer 和 PyTorch
Transformer:这些模型通过自注意力机制擅长理解序列中的上下文信息。这种捕捉文本不同部分之间关系的能力在有效理解和回答 MCQ 方面特别有益。
PyTorch:PyTorch 的动态计算图和用户友好的界面简化了复杂神经网络的实现和训练。其灵活性允许与 Transformer 架构无缝集成,从而实现流线型的开发和实验。
利用 Transformer 与 PyTorch 的优势
增强的上下文理解:Transformer 与 PyTorch 结合,在捕捉文本数据中细微关系方面表现出色。这使它们能够全面掌握 MCQ 的上下文,从而实现更准确的预测。
迁移学习能力:预训练的 Transformer 模型,如 BERT、RoBERTa 或 ALBERT,可以使用 PyTorch 在 MCQ 数据集上进行微调。利用预训练模型显著减少了训练时间和数据需求,同时仍能实现高性能。
3. 灵活性和定制化:PyTorch 的灵活性允许轻松定制 Transformer 模型。研究人员和开发人员可以调整架构、损失函数和训练方法,以适应 MCQ 任务的特定需求。
最先进的性能:基于 Transformer 的模型在各种 NLP 基准测试中始终达到最先进的性能。与 PyTorch 的优化工具结合使用时,它们在预测 MCQ 正确答案方面提供了高精度。
可扩展性和效率:PyTorch 对计算的高效处理以及 Transformer 的并行处理能力使其成为可扩展的解决方案。它们可以快速处理大量的 MCQ,使其适用于实时应用程序。
代码实现
以下是关于利用 Transformer 与 PyTorch 处理 MCQ 任务中每个步骤如何从它们的协同作用中受益的简要阐述
- 数据集准备:Transformer 在 PyTorch 的支持下,有效地处理多样的数据集结构。PyTorch 的数据处理能力简化了数据集组织,确保 MCQ 及其相应选择的无缝集成,从而实现高效的模型训练。
!pip install datasets transformers evaluate --quiet
import transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding, Trainer, TrainingArguments
import evaluate
import numpy as np
from datasets import load_metric, load_dataset
import random
print(transformers.__version__)
# Defining a constant SEED for reproducibility in random operations
SEED = 42
# Setting the seed for the random library to ensure consistent results
random.seed(SEED)
from datasets import load_dataset, load_metric
datasets = load_dataset("swag", "regular")
datasets["train"][0]
输出
{'video-id': 'anetv_jkn6uvmqwh4',
'fold-ind': '3416',
'startphrase': 'Members of the procession walk down the street holding small horn brass instruments. A drum line',
'sent1': 'Members of the procession walk down the street holding small horn brass instruments.',
'sent2': 'A drum line',
'gold-source': 'gold',
'ending0': 'passes by walking down the street playing their instruments.',
'ending1': 'has heard approaching them.',
'ending2': "arrives and they're outside dancing and asleep.",
'ending3': 'turns the lead singer watches the performance.',
'label': 0}
def show_one(example):
print(f"Context: {example['sent1']}")
print(f" A - {example['sent2']} {example['ending0']}")
print(f" B - {example['sent2']} {example['ending1']}")
print(f" C - {example['sent2']} {example['ending2']}")
print(f" D - {example['sent2']} {example['ending3']}")
print(f"\nGround truth: option {['A', 'B', 'C', 'D'][example['label']]}")
show_one(datasets["train"][15])
输出
Context: Now it's someone's turn to rain blades on his opponent.
A - Someone pats his shoulder and spins wildly.
B - Someone lunges forward through the window.
C - Someone falls to the ground.
D - Someone rolls up his fast run from the water and tosses in the sky.
Ground truth: option C
- 预处理:PyTorch 与 Transformer 模型的兼容性促进了文本的顺利预处理。这包括分词、编码和序列准备,简化了文本数据到 Transformer 可以理解的数值表示的转换。
model_checkpoint = 'distilbert-base-uncased' # "bert-base-uncased"
batch_size = 4
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
tokenizer("Hello, this one sentence!", "And this sentence goes with it.")
输出
{'input_ids': [101, 7592, 1010, 2023, 2028, 6251, 999, 102, 1998, 2023, 6251, 3632, 2007, 2009, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
ending_names = ["ending0", "ending1", "ending2", "ending3"]
def preprocess_function(examples):
# Repeat each first sentence four times to go with the four possibilities of second sentences.
first_sentences = [[context] * 4 for context in examples["sent1"]]
# Grab all second sentences possible for each context.
question_headers = examples["sent2"]
second_sentences = [[f"{header} {examples[end][i]}" for end in ending_names] for i, header in enumerate(question_headers)]
# Flatten everything
first_sentences = sum(first_sentences, [])
second_sentences = sum(second_sentences, [])
# Tokenize
tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
# Un-flatten
return {k: [v[i:i+4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()}
examples = datasets["train"][:5]
features = preprocess_function(examples)
print(len(features["input_ids"]), len(features["input_ids"][0]), [len(x) for x in features["input_ids"][0]])
输出
5 4 [30, 25, 30, 28]
idx = 3
[tokenizer.decode(features["input_ids"][idx][i]) for i in range(4)]
输出
['[CLS] a drum line passes by walking down the street playing their instruments. [SEP] members of the procession are playing ping pong and celebrating one left each in quick. [SEP]',
'[CLS] a drum line passes by walking down the street playing their instruments. [SEP] members of the procession wait slowly towards the cadets. [SEP]',
'[CLS] a drum line passes by walking down the street playing their instruments. [SEP] members of the procession makes a square call and ends by jumping down into snowy streets where fans begin to take their positions. [SEP]',
'[CLS] a drum line passes by walking down the street playing their instruments. [SEP] members of the procession play and go back and forth hitting the drums while the audience claps for them. [SEP]']
encoded_datasets = datasets.map(preprocess_function, batched=True)
- 微调:PyTorch 和 Transformer 之间的协同作用在微调期间至关重要。PyTorch 基于梯度的优化和反向传播能够有效地调整 Transformer 模型参数,以专门适应 MCQ 任务的细微差别。
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer
# https://github.com/huggingface/transformers/blob/main/src/transformers/models/roberta/modeling_roberta.py#L1266
model = AutoModelForMultipleChoice.from_pretrained(model_checkpoint)
# https://github.com/huggingface/datasets/issues/2165
from torch.utils.data import Dataset, DataLoader, RandomSampler
class HFDataset(Dataset):
def __init__(self, dset):
self.dset = dset
def __getitem__(self, idx):
x = self.dset[idx]
return {'input_ids': x['input_ids'],
'attention_mask': x['attention_mask'], # ignore token_type_ids
'label' : x['label']}
def __len__(self):
return len(self.dset)
train_ds = HFDataset(encoded_datasets['train'])
test_ds = HFDataset(encoded_datasets['validation'])
len(encoded_datasets['train']), len(train_ds)
输出
(73546, 73546)
from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import torch
@dataclass
class DataCollatorForMultipleChoice:
"""
Data collator that will dynamically pad the inputs for multiple choice received.
"""
tokenizer: PreTrainedTokenizerBase
padding: Union[bool, str, PaddingStrategy] = True
max_length: Optional[int] = None
pad_to_multiple_of: Optional[int] = None
def __call__(self, features):
label_name = "label" if "label" in features[0].keys() else "labels"
labels = [feature.pop(label_name) for feature in features]
batch_size = len(features)
num_choices = len(features[0]["input_ids"])
flattened_features = [[{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features]
flattened_features = sum(flattened_features, [])
batch = self.tokenizer.pad(
flattened_features,
padding=self.padding,
max_length=self.max_length,
pad_to_multiple_of=self.pad_to_multiple_of,
return_tensors="pt",
)
# Un-flatten
batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
# Add back labels
batch["labels"] = torch.tensor(labels, dtype=torch.int64)
return batch
def HFDataLoader(dataset, tokenizer, batch_size=4, shuffle=True, num_workers=2):
def listdict2dictlist(batch):
'''
Input: batch -- list of dict
Output: dict of list-size-batch
'''
d = {}
keys = batch[0].keys()
for k in keys:
d[k] = []
for i in range(len(batch)):
d[k].append(batch[i][k])
return d
def prepare_sample(sample):
padding = True
max_length = None
pad_to_multiple_of = None
features = listdict2dictlist(sample)
batch_size = len(features["input_ids"])
num_choices = len(features["input_ids"][0])
flattened_features = {}
for k,v in features.items():
if k=='label':
continue
flattened_features[k] = []
for example in features[k]: # e.g. k='input_ids'
for choice in example: # e.g. 4 choices per example
flattened_features[k].append(choice)
batch = tokenizer.pad(
flattened_features,
padding=padding,
max_length=max_length,
pad_to_multiple_of=pad_to_multiple_of,
return_tensors="pt",
)
# Un-flatten
batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
# Add back labels
batch["labels"] = torch.tensor(features['label'], dtype=torch.int64)
return batch
sampler = RandomSampler(dataset) if shuffle else None
return DataLoader(dataset,
sampler=sampler,
batch_size=batch_size,
collate_fn=prepare_sample,
num_workers=num_workers)
import os
os.environ['TOKENIZERS_PARALLELISM'] = "false"
train_loader = HFDataLoader(train_ds, tokenizer, batch_size=16)
test_loader = HFDataLoader(test_ds, tokenizer, batch_size=16, shuffle=False)
for x in train_loader:
print(x)
break
输出
ou're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
{'input_ids': tensor([[[ 101, 1037, 2450, ..., 0, 0, 0],
[ 101, 1037, 2450, ..., 0, 0, 0],
[ 101, 1037, 2450, ..., 0, 0, 0],
[ 101, 1037, 2450, ..., 0, 0, 0]],
[[ 101, 2111, 2024, ..., 0, 0, 0],
[ 101, 2111, 2024, ..., 0, 0, 0],
[ 101, 2111, 2024, ..., 0, 0, 0],
[ 101, 2111, 2024, ..., 0, 0, 0]],
[[ 101, 2059, 2007, ..., 0, 0, 0],
[ 101, 2059, 2007, ..., 0, 0, 0],
[ 101, 2059, 2007, ..., 0, 0, 0],
[ 101, 2059, 2007, ..., 0, 0, 0]],
...,
[[ 101, 2002, 17395, ..., 0, 0, 0],
[ 101, 2002, 17395, ..., 0, 0, 0],
[ 101, 2002, 17395, ..., 0, 0, 0],
[ 101, 2002, 17395, ..., 0, 0, 0]],
[[ 101, 1037, 2450, ..., 0, 0, 0],
[ 101, 1037, 2450, ..., 0, 0, 0],
[ 101, 1037, 2450, ..., 0, 0, 0],
[ 101, 1037, 2450, ..., 0, 0, 0]],
[[ 101, 2002, 12668, ..., 0, 0, 0],
[ 101, 2002, 12668, ..., 0, 0, 0],
[ 101, 2002, 12668, ..., 0, 0, 0],
[ 101, 2002, 12668, ..., 0, 0, 0]]]), 'attention_mask': tensor([[[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0]],
[[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0]],
[[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0]],
...,
[[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0]],
[[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0]],
[[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0]]]), 'labels': tensor([1, 0, 0, 2, 2, 0, 1, 0, 2, 3, 1, 3, 2, 0, 0, 2])}
- 训练:PyTorch 的训练实用程序与 Transformer 架构相结合,简化了训练过程。无缝集成允许高效的计算和参数更新,加速模型在 MCQ 数据集上的收敛。
import pytorch_lightning as pl
class PLTransformer(pl.LightningModule):
def __init__(
self,
model_base,
learning_rate: float = 2e-5,
adam_epsilon: float = 1e-8,
warmup_steps: int = 0,
weight_decay: float = 0.0,
train_batch_size: int = 32,
eval_batch_size: int = 32,
**kwargs,
):
super().__init__()
# self.save_hyperparameters() # cause code to freeze if we have model_base as argument !!
self.model_base = model_base
self.lr = learning_rate
self.num_labels = 4 # TODO: hard code ATM
def forward(self, input_ids=None, attention_mask=None, labels=None, **kwarg):
return self.model_base(input_ids=input_ids, attention_mask=attention_mask, labels=labels, **kwarg)
def training_step(self, batch, batch_idx):
outputs = self.forward(**batch)
loss = outputs.loss
return loss
def validation_step(self, batch, batch_idx, dataloader_idx=0):
outputs = self.forward(**batch)
val_loss, logits = outputs.loss, outputs.logits
if self.num_labels > 1:
preds = torch.argmax(logits, axis=1)
elif self.num_labels == 1:
preds = logits.squeeze()
labels = batch["labels"]
return {"loss": val_loss, "preds": preds, "labels": labels}
def configure_optimizers(self):
"""Prepare optimizer and schedule (linear warmup and decay)"""
optimizer = torch.optim.AdamW(self.model_base.parameters(), lr=self.lr,)
return optimizer #, [scheduler]
pl_model = PLTransformer(model)
print(pl_model.to('cpu')(**x))
pl_model.to('cpu').training_step(x, 0)
输出
MultipleChoiceModelOutput(loss=tensor(1.3915, grad_fn=<NllLossBackward0>), logits=tensor([[ 0.0144, 0.0352, 0.0017, 0.0198],
[-0.0346, -0.0176, -0.0254, -0.0258],
[-0.0054, 0.0022, 0.0579, -0.0057],
[-0.0168, -0.0084, -0.0332, 0.0098],
[ 0.0393, 0.0254, 0.0325, 0.0005],
[ 0.0292, 0.0291, 0.0407, 0.0326],
[-0.0220, -0.0277, -0.0461, -0.0345],
[-0.0347, -0.0353, -0.0412, -0.0308],
[ 0.0145, 0.0040, -0.0098, -0.0152],
[ 0.0151, -0.0131, 0.0044, -0.0081],
[-0.0025, -0.0051, 0.0014, -0.0056],
[ 0.0293, 0.0211, 0.0291, 0.0254],
[-0.0377, 0.0128, -0.0248, -0.0133],
[ 0.0255, 0.0315, 0.0295, 0.0504],
[-0.0230, 0.0035, 0.0003, -0.0109],
[ 0.0458, 0.0464, 0.0418, 0.0733]], grad_fn=<ViewBackward0>), hidden_states=None, attentions=None)
tensor(1.3915, grad_fn=<NllLossBackward0>)
trainer = pl.Trainer(
max_epochs=1,
accelerator="gpu",
devices=[0],
precision='16',
)
trainer.fit(pl_model, train_loader, test_loader)
结论
基于 Transformer 的架构和 PyTorch 的结合为高效准确地处理 MCQ 任务提供了一个引人注目的框架。Transformer 提供的优势,包括增强的上下文理解和迁移学习能力,加上 PyTorch 的灵活性和优化工具,使这种融合成为开发稳健的 MCQ 求解模型的理想选择。
随着 Transformer 架构和 PyTorch 的不断发展,它们的集成有望在跨不同领域实现 MCQ 评估自动化方面取得更大的进步。
总而言之,Transformer 和 PyTorch 的结合是开发高效处理 MCQ 任务模型的基石,为改进自动化问答系统铺平了道路。
“保持联系,并通过各种平台支持我的工作
Huggingface:对于自然语言处理和人工智能相关项目,您可以在 https://huggingface.co/Andyrasika 查看我的 Huggingface 个人资料。
LinkedIn:要及时了解我的最新项目和帖子,您可以在 LinkedIn 上关注我。这是我的个人资料链接:https://www.linkedin.com/in/ankushsingal/。"
请求和问题:如果您有希望我从事的项目,或者您对我在本文中解释的概念有任何疑问,请随时告诉我。我一直在寻找未来 Notebook 的新想法,并且乐于帮助解决您可能遇到的任何疑问。
资源