开源 AI 食谱文档

使用 Cleanlab 进行主动学习来标注文本数据

Hugging Face's logo
加入 Hugging Face 社区

并获得增强型文档体验的访问权限

入门

Open In Colab

使用 Cleanlab 进行主动学习来标注文本数据

作者:Aravind Putrevu

在这个笔记本中,我重点介绍了使用主动学习来改进经过微调的 Hugging Face Transformer 用于文本分类,同时将来自人工标注者的标签总数保持在较低水平。当资源限制阻止你为整个数据获取标签时,主动学习旨在通过选择哪些示例数据标注者应该花费精力进行标注来节省时间和金钱。 ActiveLab 是一种主动学习算法,当来自人工标注者的标签存在噪声以及应该为先前已标注的示例(其标签看起来可疑)还是为尚未标注的示例收集一个额外的标注时,它特别有用。在为一批数据收集这些新标注以增加我们的训练数据集后,我们重新训练我们的模型并评估其测试准确率。

什么是主动学习?

主动学习有助于优先考虑要标注的数据,以最大程度地提高在标注数据上训练的监督机器学习模型的性能。此过程通常以迭代方式进行——在每一轮中,主动学习告诉我们应该收集哪些示例的额外标注,以在有限的标注预算下最大程度地提高当前模型的性能。 ActiveLab 是一种主动学习算法,当来自人工标注者的标签存在噪声以及应该为先前已标注的示例(其标签看起来可疑)还是为尚未标注的示例收集一个额外的标注时,它特别有用。在为一批数据收集这些新标注以增加我们的训练数据集后,我们重新训练我们的模型并评估其测试准确率。

ActiveLab thumb.webp

在这个笔记本中,我考虑了一个二元文本分类任务:预测特定短语是礼貌的还是不礼貌的。

在收集用于 Transformer 模型的额外标注方面,使用 ActiveLab 进行的主动学习远远优于随机选择。它始终能够产生更好的模型,其错误率大约降低了 50%,无论总标注预算如何。

本笔记本的其余部分将逐步介绍你可以使用来实现这些结果的开源代码。

设置环境

!pip install datasets==2.20.0 transformers==4.25.1 scikit-learn==1.1.2 matplotlib==3.5.3 cleanlab
import pandas as pd

pd.set_option("max_colwidth", None)
import numpy as np
import random
import transformers
import datasets
import matplotlib.pyplot as plt

from cleanlab.multiannotator import (
    get_majority_vote_label,
    get_active_learning_scores,
    get_label_quality_multiannotator,
)
from transformers import AutoTokenizer, AutoModel
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
from datasets import load_dataset, Dataset, DatasetDict, ClassLabel
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold
from scipy.special import softmax
from datetime import datetime

收集和整理数据

这里我们下载这个笔记本需要的数据。

labeled_data_file = {"labeled": "X_labeled_full.csv"}
unlabeled_data_file = {"unlabeled": "X_labeled_full.csv"}
test_data_file = {"test": "test.csv"}

X_labeled_full = load_dataset("Cleanlab/stanford-politeness", split="labeled", data_files=labeled_data_file)
X_unlabeled = load_dataset("Cleanlab/stanford-politeness", split="unlabeled", data_files=unlabeled_data_file)
test = load_dataset("Cleanlab/stanford-politeness", split="test", data_files=test_data_file)

!wget -nc -O 'extra_annotations.npy' 'https://huggingface.co/datasets/Cleanlab/stanford-politeness/resolve/main/extra_annotations.npy?download=true'

extra_annotations = np.load("extra_annotations.npy",allow_pickle=True).item()
X_labeled_full = X_labeled_full.to_pandas()
X_labeled_full.set_index("id", inplace=True)
X_unlabeled = X_unlabeled.to_pandas()
X_unlabeled.set_index("id", inplace=True)
test = test.to_pandas()

对文本的礼貌程度进行分类

我们使用 斯坦福礼貌语料库 作为数据集。

它被构建成一个二元文本分类任务,用来对每个短语是礼貌还是无礼进行分类。人类标注者会获得一个选定的文本短语,并对其礼貌程度进行(不完美)标注:**0** 代表无礼,**1** 代表礼貌。

在标注的数据上训练一个 Transformer 分类器,我们通过一组留出的测试样本衡量模型的准确率,我对于这些样本的真实标签感到自信,因为它们来自于 5 个标注者对每个样本进行标注的共识。

至于训练数据,我们有

  • X_labeled_full: 我们的初始训练集,只包含一小部分包含 2 个标注的 100 个文本样本。
  • X_unlabeled: 包含 1900 个未标注文本样本的大型数据集,我们可以考虑让标注者对其进行标注。
  • extra_annotations: 当需要对样本进行标注时,我们从该池中提取的额外标注。

可视化数据

# Multi-annotated Data
X_labeled_full.head()
# Unlabeled Data
X_unlabeled.head()
# extra_annotations contains the annotations that we will use when an additional annotation is requested.
extra_annotations

# Random sample of extra_annotations to see format.
{k: extra_annotations[k] for k in random.sample(extra_annotations.keys(), 5)}

查看测试集中的部分样本

>>> num_to_label = {0: "Impolite", 1: "Polite"}
>>> for i in range(2):
...     print(f"{num_to_label[i]} examples:")
...     subset = test[test.label == i][["text"]].sample(n=3, random_state=2)
...     print(subset)
Impolite examples:

无礼样本

文本
120 而且浪费我们的时间。我只能再说一次:为什么你不做一些建设性的工作,添加一些关于你心爱的马其顿的内容呢?
150 与其告诉我关闭某些 afd 的决定有多么错误,不如把你的时间花在处理当前的 afd 积压问题上 <url>。如果我的决定如此错误,为什么你没有重新打开它们?
326 这应该根据 CFD 被移到 <url>。为什么没有被移动?

礼貌样本

文本
498 您好,我提出对 <url> 的茶碱页面进行取消保护的可能性。您对此有何看法?
132 由于某些编辑,页面对齐发生了变化。您能帮忙吗?
131 我很高兴您对整体外观感到满意。在我标注所有街道之前,文字大小、字体样式等是否合适?

辅助方法

以下部分包含这个笔记本需要的所有辅助方法。

get_idx_to_label 用于主动学习场景,尤其是处理标注和未标注数据混合的情况。它的主要目标是根据主动学习分数确定哪些样本(来自标注和未标注数据集)应该被选中进行额外的标注。

# Helper method to get indices of examples with the lowest active learning score to collect more labels for.
def get_idx_to_label(
    X_labeled_full,
    X_unlabeled,
    extra_annotations,
    batch_size_to_label,
    active_learning_scores,
    active_learning_scores_unlabeled=None,
):
    if active_learning_scores_unlabeled is None:
        active_learning_scores_unlabeled = np.array([])

    to_label_idx = []
    to_label_idx_unlabeled = []

    num_labeled = len(active_learning_scores)
    active_learning_scores_combined = np.concatenate((active_learning_scores, active_learning_scores_unlabeled))
    to_label_idx_combined = np.argsort(active_learning_scores_combined)

    # We want to collect the n=batch_size best examples to collect another annotation for.
    i = 0
    while (len(to_label_idx) + len(to_label_idx_unlabeled)) < batch_size_to_label:
        idx = to_label_idx_combined[i]
        # We know this is an already annotated example.
        if idx < num_labeled:
            text_id = X_labeled_full.iloc[idx].name
            # Make sure we have an annotation left to collect.
            if text_id in extra_annotations and extra_annotations[text_id]:
                to_label_idx.append(idx)
        # We know this is an example that is currently not annotated.
        else:
            # Subtract off offset to get back original index.
            idx -= num_labeled
            text_id = X_unlabeled.iloc[idx].name
            # Make sure we have an annotation left to collect.
            if text_id in extra_annotations and extra_annotations[text_id]:
                to_label_idx_unlabeled.append(idx)
        i += 1

    to_label_idx = np.array(to_label_idx)
    to_label_idx_unlabeled = np.array(to_label_idx_unlabeled)
    return to_label_idx, to_label_idx_unlabeled

get_idx_to_label_random 用于主动学习环境,其中对额外标注数据点的选择是随机进行的,而不是基于模型的不确定性或学习分数。这种方法可以作为基线,与更复杂的主动学习策略进行比较,或者在无法确定如何对样本进行评分的情况下使用。

# Helper method to get indices of random examples to collect more labels for.
def get_idx_to_label_random(X_labeled_full, X_unlabeled, extra_annotations, batch_size_to_label):
    to_label_idx = []
    to_label_idx_unlabeled = []

    # Generate list of indices for both sets of examples.
    labeled_idx = [(x, "labeled") for x in range(len(X_labeled_full))]
    unlabeled_idx = []
    if X_unlabeled is not None:
        unlabeled_idx = [(x, "unlabeled") for x in range(len(X_unlabeled))]
    combined_idx = labeled_idx + unlabeled_idx

    # We want to collect the n=batch_size random examples to collect another annotation for.
    while (len(to_label_idx) + len(to_label_idx_unlabeled)) < batch_size_to_label:
        # Random choice from indices.
        # We time-seed to ensure randomness.
        random.seed(datetime.now().timestamp())
        choice = random.choice(combined_idx)
        idx, which_subset = choice
        # We know this is an already annotated example.
        if which_subset == "labeled":
            text_id = X_labeled_full.iloc[idx].name
            # Make sure we have an annotation left to collect.
            if text_id in extra_annotations and extra_annotations[text_id]:
                to_label_idx.append(idx)
            combined_idx.remove(choice)
        # We know this is an example that is currently not annotated.
        else:
            text_id = X_unlabeled.iloc[idx].name
            # Make sure we have an annotation left to collect.
            if text_id in extra_annotations and extra_annotations[text_id]:
                to_label_idx_unlabeled.append(idx)
            combined_idx.remove(choice)

    to_label_idx = np.array(to_label_idx)
    to_label_idx_unlabeled = np.array(to_label_idx_unlabeled)
    return to_label_idx, to_label_idx_unlabeled

以下是一些帮助我们计算标准差、选择之前已经对样本进行标注的特定标注者以及一些用于将文本样本进行 Token 化的 Token 函数的实用方法。

# Helper method to compute std dev across 2D array of accuracies.
def compute_std_dev(accuracy):
    def compute_std_dev_ind(accs):
        mean = np.mean(accs)
        std_dev = np.std(accs)
        return np.array([mean - std_dev, mean + std_dev])

    std_dev = np.apply_along_axis(compute_std_dev_ind, 0, accuracy)
    return std_dev


# Helper method to select which annotator we should collect another annotation from.
def choose_existing(annotators, existing_annotators):
    for annotator in annotators:
        # If we find one that has already given an annotation, we return it.
        if annotator in existing_annotators:
            return annotator
    # If we don't find an existing, just return a random one.
    choice = random.choice(list(annotators.keys()))
    return choice


# Helper method for Trainer.
def compute_metrics(p):
    logits, labels = p
    pred = np.argmax(logits, axis=1)
    pred_probs = softmax(logits, axis=1)
    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    return {"logits": logits, "pred_probs": pred_probs, "accuracy": accuracy}


# Helper method to tokenize text.
def tokenize_function(examples):
    model_name = "distilbert-base-uncased"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return tokenizer(examples["text"], padding="max_length", truncation=True)


# Helper method to tokenize given dataset.
def tokenize_data(data):
    dataset = Dataset.from_dict({"label": data["label"], "text": data["text"].values})
    tokenized_dataset = dataset.map(tokenize_function, batched=True)
    tokenized_dataset = tokenized_dataset.cast_column("label", ClassLabel(names=["0", "1"]))
    return tokenized_dataset

这里的 get_trainer 函数旨在使用 DistilBERT(BERT 模型的精简版,更轻巧、更快速)为文本分类任务设置一个训练环境。

# Helper method to initiate a new Trainer with given train and test sets.
def get_trainer(train_set, test_set):

    # Model params.
    model_name = "distilbert-base-uncased"
    model_folder = "model_training"
    max_training_steps = 300
    num_classes = 2

    # Set training args.
    # We time-seed to ensure randomness between different benchmarking runs.
    training_args = TrainingArguments(
        max_steps=max_training_steps, output_dir=model_folder, seed=int(datetime.now().timestamp())
    )

    # Tokenize train/test set.
    train_tokenized_dataset = tokenize_data(train_set)
    test_tokenized_dataset = tokenize_data(test_set)

    # Initiate a pre-trained model.
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_classes)
    trainer = Trainer(
        model=model,
        args=training_args,
        compute_metrics=compute_metrics,
        train_dataset=train_tokenized_dataset,
        eval_dataset=test_tokenized_dataset,
    )
    return trainer

get_pred_probs 函数使用交叉验证对给定数据集进行样本外预测概率计算,并对未标注数据进行额外处理。

# Helper method to manually compute cross-validated predicted probabilities needed for ActiveLab.
def get_pred_probs(X, X_unlabeled):
    """Uses cross-validation to obtain out-of-sample predicted probabilities
    for given dataset"""

    # Generate cross-val splits.
    n_splits = 3
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True)
    skf_splits = [[train_index, test_index] for train_index, test_index in skf.split(X=X["text"], y=X["label"])]

    # Initiate empty array to store pred_probs.
    num_examples, num_classes = len(X), len(X.label.value_counts())
    pred_probs = np.full((num_examples, num_classes), np.NaN)
    pred_probs_unlabeled = None

    # If we use up all examples from the initial unlabeled pool, X_unlabeled will be None.
    if X_unlabeled is not None:
        pred_probs_unlabeled = np.full((n_splits, len(X_unlabeled), num_classes), np.NaN)

    # Iterate through cross-validation folds.
    for split_num, split in enumerate(skf_splits):
        train_index, test_index = split

        train_set = X.iloc[train_index]
        test_set = X.iloc[test_index]

        # Get trainer with train/test subsets.
        trainer = get_trainer(train_set, test_set)
        trainer.train()
        eval_metrics = trainer.evaluate()

        # Get pred_probs and insert into dataframe.
        pred_probs_fold = eval_metrics["eval_pred_probs"]
        pred_probs[test_index] = pred_probs_fold

        # Since we don't have labels for the unlabeled pool, we compute pred_probs at each round of CV
        # and then average the results at the end.
        if X_unlabeled is not None:
            dataset_unlabeled = Dataset.from_dict({"text": X_unlabeled["text"].values})
            unlabeled_tokenized_dataset = dataset_unlabeled.map(tokenize_function, batched=True)
            logits = trainer.predict(unlabeled_tokenized_dataset).predictions
            curr_pred_probs_unlabeled = softmax(logits, axis=1)
            pred_probs_unlabeled[split_num] = curr_pred_probs_unlabeled

    # Here we average the pred_probs from each round of CV to get pred_probs for the unlabeled pool.
    if X_unlabeled is not None:
        pred_probs_unlabeled = np.mean(np.array(pred_probs_unlabeled), axis=0)

    return pred_probs, pred_probs_unlabeled

get_annotator 函数根据一组标准确定最适合从其收集新标注的标注者,而 get_annotation 则专注于从选择的标注者收集给定样本的实际标注,它还会从池中删除已收集的标注,以防止它被再次选中。

# Helper method to determine which annotator to collect annotation from for given example.
def get_annotator(example_id):
    # Update who has already annotated atleast one example.
    existing_annotators = set(X_labeled_full.drop("text", axis=1).columns)
    # Returns the annotator we want to collect annotation from.
    # Chooses existing annotators first.
    annotators = extra_annotations[example_id]
    chosen_annotator = choose_existing(annotators, existing_annotators)
    return chosen_annotator


# Helper method to collect an annotation for given text example.
def get_annotation(example_id, chosen_annotator):

    # Collect new annotation.
    new_annotation = extra_annotations[example_id][chosen_annotator]

    # Remove annotation.
    del extra_annotations[example_id][chosen_annotator]

    return new_annotation

运行以下单元格以隐藏下一个模型训练块中的 HTML 输出。

%%html
<style>
    div.output_stderr {
    display: none;
    }
</style>

使用的方法

对于每一轮**主动学习**,我们

  1. 计算从目前收集的所有标注中得出的每个训练样本的 ActiveLab 共识标签。
  2. 使用这些共识标签在当前训练集上训练我们的 Transformer 分类模型。
  3. 评估测试集(具有高质量的真实标签)上的测试准确率。
  4. 运行交叉验证,以获得模型对整个训练集和未标注集的样本外预测类概率。
  5. 获取训练集和未标注集中的每个样本的 ActiveLab 主动学习分数。这些分数估计对每个样本收集另一个标注的信息量。
  6. 选择具有最低主动学习分数的子集(n = batch_size)的样本。
  7. 为选定的 n 个样本收集一个额外的标注。
  8. 将新标注(以及如果被选中则以前未标注的样本)添加到我们的训练集中,以进行下一轮迭代。

随后,我比较了在主动学习标记的数据上训练的模型与在**随机选择**标记的数据上训练的模型。对于每一轮随机选择,我使用多数投票共识(在步骤 1 中)而不是 ActiveLab 共识,然后只随机选择**n** 个样本以收集额外的标签(在步骤 6 中),而不是使用 ActiveLab 分数。

关于 Activelab 共识标签和主动学习分数的更多直觉将在笔记本的后面部分分享。

activelab.png

模型训练和评估

我首先对我的测试集和训练集进行 Token 化,然后初始化一个预训练的 DistilBert Transformer 模型。使用 300 个训练步骤对 DistilBert 进行微调,在我的数据中实现了准确率和训练时间之间的良好平衡。这个分类器输出预测的类概率,我将这些概率转换为类预测,然后评估其准确率。

使用主动学习评分来决定下一步标注什么

在每一轮主动学习中,我们通过对当前训练集进行 3 折交叉验证来拟合我们的 Transformer 模型。这使我们能够获得训练集中每个示例的样本外预测类概率,我们也可以使用训练好的 Transformer 来获得未标记池中每个示例的样本外预测类概率。所有这些都由内部实现的 get_pred_probs 辅助方法完成。使用样本外预测可以帮助我们避免由于潜在过拟合而造成的偏差。

获得这些概率预测后,我将其传递到开源 cleanlab 包中的 get_active_learning_scores 方法,该方法实现了 ActiveLab 算法。此方法为我们提供了所有标记和未标记数据的评分。较低的评分表示收集一个额外的标签对当前模型最有帮助的数据点(评分在标记数据和未标记数据之间直接可比较)。

我将最低评分的示例批次形成一批要收集标注的示例(通过 get_idx_to_label 方法)。在这里,我始终在每一轮中收集完全相同数量的标注(在主动学习和随机选择方法中)。对于此应用程序,我将每个示例的标注最大数量限制为 5(不想在同一个示例上重复投入太多精力)。

添加新标注

combined_example_ids 是我们要收集标注的文本示例的 ID。对于这些示例中的每一个,我们使用 get_annotation 辅助方法从标注者那里收集一个新标注。在这里,我们优先选择来自已为另一个示例标注的标注者的标注。如果给定示例的任何标注者都不存在于训练集中,我们会随机选择一个。在这种情况下,我们在训练集中添加一列,表示新的标注者。最后,我们将新收集的标注添加到训练集中。如果相应的示例以前未标注,我们也会将其添加到训练集中并将其从未标记的集合中删除。

现在我们已经完成了一轮收集新标注的过程,并重新训练了 Transformer 模型,使其在更新后的训练集上进行训练。我们重复此过程多次,以不断扩展训练数据集并改进模型。

# For this Active Learning demo, we add 25 additional annotations to the training set
# each iteration, for 25 rounds.
num_rounds = 25
batch_size_to_label = 25
model_accuracy_arr = np.full(num_rounds, np.nan)

# The 'selection_method' varible determines if we use ActiveLab or random selection
# to choose the new annotations each round.
selection_method = "random"
# selection_method = 'active_learning'

# Each round we:
# - train our model
# - evaluate on unchanging test set
# - collect and add new annotations to training set
for i in range(num_rounds):

    # X_labeled_full is updated each iteration. We drop the text column which leaves us with just the annotations.
    multiannotator_labels = X_labeled_full.drop(["text"], axis=1)

    # Use majority vote when using random selection to select the consensus label for each example.
    if i == 0 or selection_method == "random":
        consensus_labels = get_majority_vote_label(multiannotator_labels)

    # When using ActiveLab, use cleanlab's CrowdLab to select the consensus label for each example.
    else:
        results = get_label_quality_multiannotator(
            multiannotator_labels,
            pred_probs_labeled,
            calibrate_probs=True,
        )
        consensus_labels = results["label_quality"]["consensus_label"].values

    # We only need the text and label columns.
    train_set = X_labeled_full[["text"]]
    train_set["label"] = consensus_labels
    test_set = test[["text", "label"]]

    # Train our Transformer model on the full set of labeled data to evaluate model accuracy for the current round.
    # This is an optional step for demonstration purposes, in practical applications
    # you may not have ground truth labels.
    trainer = get_trainer(train_set, test_set)
    trainer.train()
    eval_metrics = trainer.evaluate()
    # set statistics
    model_accuracy_arr[i] = eval_metrics["eval_accuracy"]

    # For ActiveLab, we need to run cross-validation to get out-of-sample predicted probabilites.
    if selection_method == "active_learning":
        pred_probs, pred_probs_unlabeled = get_pred_probs(train_set, X_unlabeled)

        # Compute active learning scores.
        active_learning_scores, active_learning_scores_unlabeled = get_active_learning_scores(
            multiannotator_labels, pred_probs, pred_probs_unlabeled
        )

        # Get the indices of examples to collect more labels for.
        chosen_examples_labeled, chosen_examples_unlabeled = get_idx_to_label(
            X_labeled_full,
            X_unlabeled,
            extra_annotations,
            batch_size_to_label,
            active_learning_scores,
            active_learning_scores_unlabeled,
        )

    # We don't need to run cross-validation, just get random examples to collect annotations for.
    if selection_method == "random":
        chosen_examples_labeled, chosen_examples_unlabeled = get_idx_to_label_random(
            X_labeled_full, X_unlabeled, extra_annotations, batch_size_to_label
        )

    unlabeled_example_ids = np.array([])
    # Check to see if we still have unlabeled examples left.
    if X_unlabeled is not None:
        # Get unlabeled text examples we want to collect annotations for.
        new_text = X_unlabeled.iloc[chosen_examples_unlabeled]
        unlabeled_example_ids = new_text.index.values
        num_ex, num_annot = len(new_text), multiannotator_labels.shape[1]
        empty_annot = pd.DataFrame(
            data=np.full((num_ex, num_annot), np.NaN),
            columns=multiannotator_labels.columns,
            index=unlabeled_example_ids,
        )
        new_unlabeled_df = pd.concat([new_text, empty_annot], axis=1)

        # Combine unlabeled text examples with existing, labeled examples.
        X_labeled_full = pd.concat([X_labeled_full, new_unlabeled_df], axis=0)

        # Remove examples from X_unlabeled and check if empty.
        # Once it is empty we set it to None to handle appropriately elsewhere.
        X_unlabeled = X_unlabeled.drop(new_text.index)
        if X_unlabeled.empty:
            X_unlabeled = None

    if selection_method == "active_learning":
        # Update pred_prob arrays with newly added examples if necessary.
        if pred_probs_unlabeled is not None and len(chosen_examples_unlabeled) != 0:
            pred_probs_new = pred_probs_unlabeled[chosen_examples_unlabeled, :]
            pred_probs_labeled = np.concatenate((pred_probs, pred_probs_new))
            pred_probs_unlabeled = np.delete(pred_probs_unlabeled, chosen_examples_unlabeled, axis=0)
        # Otherwise we have nothing to modify.
        else:
            pred_probs_labeled = pred_probs

    # Get combined list of text ID's to relabel.
    labeled_example_ids = X_labeled_full.iloc[chosen_examples_labeled].index.values
    combined_example_ids = np.concatenate([labeled_example_ids, unlabeled_example_ids])

    # Now we collect annotations for the selected examples.
    for example_id in combined_example_ids:
        # Choose which annotator to collect annotation from.
        chosen_annotator = get_annotator(example_id)
        # Collect new annotation.
        new_annotation = get_annotation(example_id, chosen_annotator)
        # New annotator has been selected.
        if chosen_annotator not in X_labeled_full.columns.values:
            empty_col = np.full((len(X_labeled_full),), np.nan)
            X_labeled_full[chosen_annotator] = empty_col

        # Add selected annotation to the training set.
        X_labeled_full.at[example_id, chosen_annotator] = new_annotation

结果

在运行 25 轮主动学习(标注数据批次并重新训练 Transformer 模型)后,每轮收集 25 个标注。我重复了所有这些操作,下一次使用随机选择来选择每一轮中要标注的示例——作为基线比较。在标注更多数据之前,两种方法都从相同的 100 个示例的初始训练集开始(因此在第一轮中实现了大致相同的 Transformer 准确率)。由于训练 Transformer 中存在固有的随机性,我运行了整个过程五次(针对每种数据标注策略),并报告了五次重复运行中测试准确率的标准差(阴影区域)和平均值(实线)。

# Get numpy array of results.
!wget -nc -O 'random_acc.npy' 'https://huggingface.co/datasets/Cleanlab/stanford-politeness/resolve/main/activelearn_acc.npy'
!wget -nc -O 'activelearn_acc.npy' 'https://huggingface.co/datasets/Cleanlab/stanford-politeness/resolve/main/random_acc.npy'
# Helper method to compute std dev across 2D array of accuracies.
def compute_std_dev(accuracy):
    def compute_std_dev_ind(accs):
        mean = np.mean(accs)
        std_dev = np.std(accs)
        return np.array([mean - std_dev, mean + std_dev])

    std_dev = np.apply_along_axis(compute_std_dev_ind, 0, accuracy)
    return std_dev
>>> al_acc = np.load("activelearn_acc.npy")
>>> rand_acc = np.load("random_acc.npy")

>>> rand_acc_std = compute_std_dev(rand_acc)
>>> al_acc_std = compute_std_dev(al_acc)

>>> plt.plot(range(1, al_acc.shape[1] + 1), np.mean(al_acc, axis=0), label="active learning", color="green")
>>> plt.fill_between(range(1, al_acc.shape[1] + 1), al_acc_std[0], al_acc_std[1], alpha=0.3, color="green")

>>> plt.plot(range(1, rand_acc.shape[1] + 1), np.mean(rand_acc, axis=0), label="random", color="red")
>>> plt.fill_between(range(1, rand_acc.shape[1] + 1), rand_acc_std[0], rand_acc_std[1], alpha=0.1, color="red")

>>> plt.hlines(y=0.9, xmin=1.0, xmax=25.0, color="black", linestyle="dotted")
>>> plt.legend()
>>> plt.xlabel("Round Number")
>>> plt.ylabel("Test Accuracy")
>>> plt.title("ActiveLab vs Random Annotation Selection --- 5 Runs")
>>> plt.savefig("al-results.png")
>>> plt.show()

我们看到,选择下一步要标注哪些数据会对模型性能产生重大影响。使用 ActiveLab 的主动学习在每一轮中始终以显著优势优于随机选择。例如,在第 4 轮中,训练集中有 275 个总标注,我们通过主动学习获得了 91% 的准确率,而没有巧妙的选择要标注内容的策略则只有 76% 的准确率。总的来说,使用主动学习构建的数据集拟合的 Transformer 模型的错误率大约是 50%,无论总标注预算如何!

在为文本分类标注数据时,您应该考虑使用带有重新标注选项的主动学习,以更好地考虑不完美的标注者。

< > 在 GitHub 上更新