利用 Transformer 和 PyTorch 处理多项选择题任务

社区文章 发布日期:2023年12月25日

image/png

引言

多项选择题(MCQ)是各种领域中普遍存在的评估形式,从教育到招聘。深度学习,特别是基于 Transformer 的架构的出现,彻底改变了自然语言处理(NLP)任务,使其在处理 MCQ 方面非常有效。PyTorch 是一个流行的深度学习框架,与 Transformer 模型无缝集成,可以高效处理 MCQ 任务。在本文中,我们将探讨如何利用 Transformer 和 PyTorch 处理 MCQ 任务。

image/png

理解 Transformer 和 PyTorch

Transformer:这些模型通过自注意力机制擅长理解序列中的上下文信息。这种捕捉文本不同部分之间关系的能力在有效理解和回答 MCQ 方面特别有益。

PyTorch:PyTorch 的动态计算图和用户友好的界面简化了复杂神经网络的实现和训练。其灵活性允许与 Transformer 架构无缝集成,从而实现流线型的开发和实验。

利用 Transformer 与 PyTorch 的优势

  1. 增强的上下文理解:Transformer 与 PyTorch 结合,在捕捉文本数据中细微关系方面表现出色。这使它们能够全面掌握 MCQ 的上下文,从而实现更准确的预测。

  2. 迁移学习能力:预训练的 Transformer 模型,如 BERT、RoBERTa 或 ALBERT,可以使用 PyTorch 在 MCQ 数据集上进行微调。利用预训练模型显著减少了训练时间和数据需求,同时仍能实现高性能。

3. 灵活性和定制化:PyTorch 的灵活性允许轻松定制 Transformer 模型。研究人员和开发人员可以调整架构、损失函数和训练方法,以适应 MCQ 任务的特定需求。

  1. 最先进的性能:基于 Transformer 的模型在各种 NLP 基准测试中始终达到最先进的性能。与 PyTorch 的优化工具结合使用时,它们在预测 MCQ 正确答案方面提供了高精度。

  2. 可扩展性和效率:PyTorch 对计算的高效处理以及 Transformer 的并行处理能力使其成为可扩展的解决方案。它们可以快速处理大量的 MCQ,使其适用于实时应用程序。

代码实现

以下是关于利用 Transformer 与 PyTorch 处理 MCQ 任务中每个步骤如何从它们的协同作用中受益的简要阐述

  1. 数据集准备:Transformer 在 PyTorch 的支持下,有效地处理多样的数据集结构。PyTorch 的数据处理能力简化了数据集组织,确保 MCQ 及其相应选择的无缝集成,从而实现高效的模型训练。
!pip install datasets transformers evaluate --quiet

import transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding, Trainer, TrainingArguments
import evaluate
import numpy as np
from datasets import load_metric, load_dataset
import random

print(transformers.__version__)

# Defining a constant SEED for reproducibility in random operations
SEED = 42

# Setting the seed for the random library to ensure consistent results
random.seed(SEED)

from datasets import load_dataset, load_metric
datasets = load_dataset("swag", "regular")

datasets["train"][0]

输出

{'video-id': 'anetv_jkn6uvmqwh4',
 'fold-ind': '3416',
 'startphrase': 'Members of the procession walk down the street holding small horn brass instruments. A drum line',
 'sent1': 'Members of the procession walk down the street holding small horn brass instruments.',
 'sent2': 'A drum line',
 'gold-source': 'gold',
 'ending0': 'passes by walking down the street playing their instruments.',
 'ending1': 'has heard approaching them.',
 'ending2': "arrives and they're outside dancing and asleep.",
 'ending3': 'turns the lead singer watches the performance.',
 'label': 0}
def show_one(example):
    print(f"Context: {example['sent1']}")
    print(f"  A - {example['sent2']} {example['ending0']}")
    print(f"  B - {example['sent2']} {example['ending1']}")
    print(f"  C - {example['sent2']} {example['ending2']}")
    print(f"  D - {example['sent2']} {example['ending3']}")
    print(f"\nGround truth: option {['A', 'B', 'C', 'D'][example['label']]}")

show_one(datasets["train"][15])

输出

Context: Now it's someone's turn to rain blades on his opponent.
  A - Someone pats his shoulder and spins wildly.
  B - Someone lunges forward through the window.
  C - Someone falls to the ground.
  D - Someone rolls up his fast run from the water and tosses in the sky.

Ground truth: option C
  1. 预处理:PyTorch 与 Transformer 模型的兼容性促进了文本的顺利预处理。这包括分词、编码和序列准备,简化了文本数据到 Transformer 可以理解的数值表示的转换。
model_checkpoint = 'distilbert-base-uncased' # "bert-base-uncased"
batch_size = 4
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
tokenizer("Hello, this one sentence!", "And this sentence goes with it.")

输出

{'input_ids': [101, 7592, 1010, 2023, 2028, 6251, 999, 102, 1998, 2023, 6251, 3632, 2007, 2009, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
ending_names = ["ending0", "ending1", "ending2", "ending3"]

def preprocess_function(examples):
    # Repeat each first sentence four times to go with the four possibilities of second sentences.
    first_sentences = [[context] * 4 for context in examples["sent1"]]
    # Grab all second sentences possible for each context.
    question_headers = examples["sent2"]
    second_sentences = [[f"{header} {examples[end][i]}" for end in ending_names] for i, header in enumerate(question_headers)]
    
    # Flatten everything
    first_sentences = sum(first_sentences, [])
    second_sentences = sum(second_sentences, [])
    
    # Tokenize
    tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
    # Un-flatten
    return {k: [v[i:i+4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()}

examples = datasets["train"][:5]
features = preprocess_function(examples)
print(len(features["input_ids"]), len(features["input_ids"][0]), [len(x) for x in features["input_ids"][0]])

输出

5 4 [30, 25, 30, 28]
idx = 3
[tokenizer.decode(features["input_ids"][idx][i]) for i in range(4)]

输出

['[CLS] a drum line passes by walking down the street playing their instruments. [SEP] members of the procession are playing ping pong and celebrating one left each in quick. [SEP]',
 '[CLS] a drum line passes by walking down the street playing their instruments. [SEP] members of the procession wait slowly towards the cadets. [SEP]',
 '[CLS] a drum line passes by walking down the street playing their instruments. [SEP] members of the procession makes a square call and ends by jumping down into snowy streets where fans begin to take their positions. [SEP]',
 '[CLS] a drum line passes by walking down the street playing their instruments. [SEP] members of the procession play and go back and forth hitting the drums while the audience claps for them. [SEP]']
encoded_datasets = datasets.map(preprocess_function, batched=True)
  1. 微调:PyTorch 和 Transformer 之间的协同作用在微调期间至关重要。PyTorch 基于梯度的优化和反向传播能够有效地调整 Transformer 模型参数,以专门适应 MCQ 任务的细微差别。
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer
# https://github.com/huggingface/transformers/blob/main/src/transformers/models/roberta/modeling_roberta.py#L1266
model = AutoModelForMultipleChoice.from_pretrained(model_checkpoint)
# https://github.com/huggingface/datasets/issues/2165
from torch.utils.data import Dataset, DataLoader, RandomSampler
 
class HFDataset(Dataset):
    def __init__(self, dset):
        self.dset = dset

    def __getitem__(self, idx):
        x = self.dset[idx]
        return {'input_ids': x['input_ids'],
                'attention_mask': x['attention_mask'], # ignore token_type_ids
                'label' : x['label']}

    def __len__(self):
        return len(self.dset)

train_ds = HFDataset(encoded_datasets['train'])
test_ds = HFDataset(encoded_datasets['validation'])
len(encoded_datasets['train']), len(train_ds)

输出

(73546, 73546)
from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import torch

@dataclass
class DataCollatorForMultipleChoice:
    """
    Data collator that will dynamically pad the inputs for multiple choice received.
    """

    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features):
        label_name = "label" if "label" in features[0].keys() else "labels"
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]["input_ids"])
        flattened_features = [[{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features]
        flattened_features = sum(flattened_features, [])
        
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )
        
        # Un-flatten
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        # Add back labels
        batch["labels"] = torch.tensor(labels, dtype=torch.int64)
        return batch
    
    
def HFDataLoader(dataset, tokenizer, batch_size=4, shuffle=True, num_workers=2):
    
    def listdict2dictlist(batch):
        '''
        Input: batch -- list of dict
        Output: dict of list-size-batch
        '''
        d = {}
        keys = batch[0].keys()
    
        for k in keys:
            d[k] = []
            for i in range(len(batch)):
                d[k].append(batch[i][k])
    
        return d
    
    def prepare_sample(sample):
        
        padding = True
        max_length = None
        pad_to_multiple_of = None

        features = listdict2dictlist(sample)
        batch_size = len(features["input_ids"])
        num_choices = len(features["input_ids"][0])
        
        flattened_features = {}
        for k,v in features.items():
            if k=='label': 
                continue
                
            flattened_features[k] = []
            for example in features[k]: # e.g. k='input_ids'
                for choice in example: # e.g. 4 choices per example
                    flattened_features[k].append(choice)

        
        batch = tokenizer.pad(
            flattened_features,
            padding=padding,
            max_length=max_length,
            pad_to_multiple_of=pad_to_multiple_of,
            return_tensors="pt",
        )
        
        # Un-flatten
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        # Add back labels
        batch["labels"] = torch.tensor(features['label'], dtype=torch.int64)
        return batch
    
    sampler = RandomSampler(dataset) if shuffle else None
    return DataLoader(dataset,
            sampler=sampler,
            batch_size=batch_size,
            collate_fn=prepare_sample,
            num_workers=num_workers)
import os
os.environ['TOKENIZERS_PARALLELISM'] = "false"

train_loader = HFDataLoader(train_ds, tokenizer,  batch_size=16)
test_loader = HFDataLoader(test_ds, tokenizer,  batch_size=16, shuffle=False)
for x in train_loader:
    print(x)
    break

输出

ou're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
{'input_ids': tensor([[[  101,  1037,  2450,  ...,     0,     0,     0],
         [  101,  1037,  2450,  ...,     0,     0,     0],
         [  101,  1037,  2450,  ...,     0,     0,     0],
         [  101,  1037,  2450,  ...,     0,     0,     0]],

        [[  101,  2111,  2024,  ...,     0,     0,     0],
         [  101,  2111,  2024,  ...,     0,     0,     0],
         [  101,  2111,  2024,  ...,     0,     0,     0],
         [  101,  2111,  2024,  ...,     0,     0,     0]],

        [[  101,  2059,  2007,  ...,     0,     0,     0],
         [  101,  2059,  2007,  ...,     0,     0,     0],
         [  101,  2059,  2007,  ...,     0,     0,     0],
         [  101,  2059,  2007,  ...,     0,     0,     0]],

        ...,

        [[  101,  2002, 17395,  ...,     0,     0,     0],
         [  101,  2002, 17395,  ...,     0,     0,     0],
         [  101,  2002, 17395,  ...,     0,     0,     0],
         [  101,  2002, 17395,  ...,     0,     0,     0]],

        [[  101,  1037,  2450,  ...,     0,     0,     0],
         [  101,  1037,  2450,  ...,     0,     0,     0],
         [  101,  1037,  2450,  ...,     0,     0,     0],
         [  101,  1037,  2450,  ...,     0,     0,     0]],

        [[  101,  2002, 12668,  ...,     0,     0,     0],
         [  101,  2002, 12668,  ...,     0,     0,     0],
         [  101,  2002, 12668,  ...,     0,     0,     0],
         [  101,  2002, 12668,  ...,     0,     0,     0]]]), 'attention_mask': tensor([[[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]],

        [[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]],

        [[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]],

        ...,

        [[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]],

        [[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]],

        [[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]]), 'labels': tensor([1, 0, 0, 2, 2, 0, 1, 0, 2, 3, 1, 3, 2, 0, 0, 2])}
  1. 训练:PyTorch 的训练实用程序与 Transformer 架构相结合,简化了训练过程。无缝集成允许高效的计算和参数更新,加速模型在 MCQ 数据集上的收敛。
import pytorch_lightning as pl
class PLTransformer(pl.LightningModule):
    def __init__(
        self,
        model_base,
        learning_rate: float = 2e-5,
        adam_epsilon: float = 1e-8,
        warmup_steps: int = 0,
        weight_decay: float = 0.0,
        train_batch_size: int = 32,
        eval_batch_size: int = 32,
        **kwargs,
    ):
        super().__init__()

#         self.save_hyperparameters() # cause code to freeze if we have model_base as argument !!
        self.model_base = model_base
        self.lr = learning_rate
        self.num_labels = 4 # TODO: hard code ATM
        
    def forward(self, input_ids=None, attention_mask=None, labels=None, **kwarg):
        return self.model_base(input_ids=input_ids, attention_mask=attention_mask, labels=labels, **kwarg)

    def training_step(self, batch, batch_idx):
        outputs = self.forward(**batch)
        loss = outputs.loss
        return loss

    def validation_step(self, batch, batch_idx, dataloader_idx=0):
        outputs = self.forward(**batch)
        val_loss, logits = outputs.loss, outputs.logits

        if self.num_labels > 1:
            preds = torch.argmax(logits, axis=1)
        elif self.num_labels == 1:
            preds = logits.squeeze()

        labels = batch["labels"]

        return {"loss": val_loss, "preds": preds, "labels": labels}


    def configure_optimizers(self):
        """Prepare optimizer and schedule (linear warmup and decay)"""
        optimizer =  torch.optim.AdamW(self.model_base.parameters(), lr=self.lr,)
        return optimizer #, [scheduler]

pl_model = PLTransformer(model)

print(pl_model.to('cpu')(**x))
pl_model.to('cpu').training_step(x, 0)

输出

MultipleChoiceModelOutput(loss=tensor(1.3915, grad_fn=<NllLossBackward0>), logits=tensor([[ 0.0144,  0.0352,  0.0017,  0.0198],
        [-0.0346, -0.0176, -0.0254, -0.0258],
        [-0.0054,  0.0022,  0.0579, -0.0057],
        [-0.0168, -0.0084, -0.0332,  0.0098],
        [ 0.0393,  0.0254,  0.0325,  0.0005],
        [ 0.0292,  0.0291,  0.0407,  0.0326],
        [-0.0220, -0.0277, -0.0461, -0.0345],
        [-0.0347, -0.0353, -0.0412, -0.0308],
        [ 0.0145,  0.0040, -0.0098, -0.0152],
        [ 0.0151, -0.0131,  0.0044, -0.0081],
        [-0.0025, -0.0051,  0.0014, -0.0056],
        [ 0.0293,  0.0211,  0.0291,  0.0254],
        [-0.0377,  0.0128, -0.0248, -0.0133],
        [ 0.0255,  0.0315,  0.0295,  0.0504],
        [-0.0230,  0.0035,  0.0003, -0.0109],
        [ 0.0458,  0.0464,  0.0418,  0.0733]], grad_fn=<ViewBackward0>), hidden_states=None, attentions=None)
tensor(1.3915, grad_fn=<NllLossBackward0>)
trainer = pl.Trainer(
    max_epochs=1,
    accelerator="gpu",
    devices=[0],
    precision='16',
)

trainer.fit(pl_model, train_loader, test_loader)

结论

基于 Transformer 的架构和 PyTorch 的结合为高效准确地处理 MCQ 任务提供了一个引人注目的框架。Transformer 提供的优势,包括增强的上下文理解和迁移学习能力,加上 PyTorch 的灵活性和优化工具,使这种融合成为开发稳健的 MCQ 求解模型的理想选择。

随着 Transformer 架构和 PyTorch 的不断发展,它们的集成有望在跨不同领域实现 MCQ 评估自动化方面取得更大的进步。

总而言之,Transformer 和 PyTorch 的结合是开发高效处理 MCQ 任务模型的基石,为改进自动化问答系统铺平了道路。

“保持联系,并通过各种平台支持我的工作

Huggingface:对于自然语言处理和人工智能相关项目,您可以在 https://huggingface.co/Andyrasika 查看我的 Huggingface 个人资料。

LinkedIn:要及时了解我的最新项目和帖子,您可以在 LinkedIn 上关注我。这是我的个人资料链接:https://www.linkedin.com/in/ankushsingal/。"

请求和问题:如果您有希望我从事的项目,或者您对我在本文中解释的概念有任何疑问,请随时告诉我。我一直在寻找未来 Notebook 的新想法,并且乐于帮助解决您可能遇到的任何疑问。

资源

社区

注册登录 以评论