开源 AI 食谱文档

使用 Cleanlab 和主动学习标注文本数据

开源 AI 食谱

加入 Hugging Face 社区

并获得增强的文档体验

在模型、数据集和 Spaces 上协作

通过加速推理获得更快的示例

切换文档主题

开始使用

使用 Cleanlab 和主动学习标注文本数据

作者： Aravind Putrevu

在本笔记本中，我重点介绍了使用主动学习来改进微调的 Hugging Face Transformer 用于文本分类，同时保持从人工标注员处收集的标签总数较低。当资源限制阻止您获取所有数据的标签时，主动学习旨在通过选择数据标注员应花费精力标注的示例来节省时间和金钱。

什么是主动学习？

主动学习有助于优先确定要标注哪些数据，以便最大限度地提高在标注数据上训练的监督式机器学习模型的性能。这个过程通常是迭代进行的——在每一轮中，主动学习都会告诉我们应该收集哪些示例的额外标注，以便在有限的标注预算下最大程度地改进我们当前的模型。ActiveLab 是一种主动学习算法，当来自人工标注员的标签存在噪声，以及我们应该为先前标注的示例（其标签似乎可疑）收集更多标注，而不是为尚未标注的示例收集标注时，这种算法特别有用。在为一批数据收集这些新标注以增加我们的训练数据集后，我们重新训练我们的模型并评估其测试准确率。

ActiveLab thumb.webp

在本笔记本中，我考虑一个二元文本分类任务：预测特定短语是礼貌的还是不礼貌的。

在为 Transformer 模型收集额外标注时，使用 ActiveLab 的主动学习比随机选择要好得多。无论总标注预算如何，它始终如一地产生更好的模型，误差率降低约 50%。

本笔记本的其余部分将介绍您可以用来实现这些结果的开源代码。

设置环境

!pip install datasets==2.20.0 transformers==4.25.1 scikit-learn==1.1.2 matplotlib==3.5.3 cleanlab

import pandas as pd

pd.set_option("max_colwidth", None)
import numpy as np
import random
import transformers
import datasets
import matplotlib.pyplot as plt

from cleanlab.multiannotator import (
    get_majority_vote_label,
    get_active_learning_scores,
    get_label_quality_multiannotator,
)
from transformers import AutoTokenizer, AutoModel
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
from datasets import load_dataset, Dataset, DatasetDict, ClassLabel
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold
from scipy.special import softmax
from datetime import datetime

收集和组织数据

在这里，我们下载本笔记本所需的数据。

labeled_data_file = {"labeled": "X_labeled_full.csv"}
unlabeled_data_file = {"unlabeled": "X_labeled_full.csv"}
test_data_file = {"test": "test.csv"}

X_labeled_full = load_dataset("Cleanlab/stanford-politeness", split="labeled", data_files=labeled_data_file)
X_unlabeled = load_dataset("Cleanlab/stanford-politeness", split="unlabeled", data_files=unlabeled_data_file)
test = load_dataset("Cleanlab/stanford-politeness", split="test", data_files=test_data_file)

!wget -nc -O 'extra_annotations.npy' 'https://huggingface.co/datasets/Cleanlab/stanford-politeness/resolve/main/extra_annotations.npy?download=true'

extra_annotations = np.load("extra_annotations.npy",allow_pickle=True).item()

X_labeled_full = X_labeled_full.to_pandas()
X_labeled_full.set_index("id", inplace=True)
X_unlabeled = X_unlabeled.to_pandas()
X_unlabeled.set_index("id", inplace=True)
test = test.to_pandas()

对文本的礼貌性进行分类

我们正在使用斯坦福礼貌语料库作为数据集。

它被构建为一个二元文本分类任务，用于分类每个短语是礼貌的还是不礼貌的。人工标注员会获得选定的文本短语，并提供关于其礼貌性的（不完美的）标注：0 表示不礼貌，1 表示礼貌。

在标注数据上训练 Transformer 分类器后，我们测量模型在一组留出的测试示例上的准确率，我对这些示例的真实标签感到自信，因为它们来自 5 位标注员对每个示例进行标注的共识。

至于训练数据，我们有

X_labeled_full：我们的初始训练集，只有少量 100 个文本示例，每个示例标注了 2 个标注。
X_unlabeled：大量的 1900 个未标注文本示例，我们可以考虑让标注员进行标注。
extra_annotations：当请求示例的标注时，我们从中提取的额外标注池

可视化数据

# Multi-annotated Data
X_labeled_full.head()

# Unlabeled Data
X_unlabeled.head()

# extra_annotations contains the annotations that we will use when an additional annotation is requested.
extra_annotations

# Random sample of extra_annotations to see format.
{k: extra_annotations[k] for k in random.sample(extra_annotations.keys(), 5)}

查看测试集中的一些示例

>>> num_to_label = {0: "Impolite", 1: "Polite"}
>>> for i in range(2):
...     print(f"{num_to_label[i]} examples:")
...     subset = test[test.label == i][["text"]].sample(n=3, random_state=2)
...     print(subset)

Impolite examples:

不礼貌的示例

	文本
120	以及浪费我们的时间。我只能重复一遍：你们为什么不做建设性的工作，添加关于你们心爱的马其顿的内容呢？
150	与其告诉我关闭某些 afd 是多么错误，不如把时间花在处理当前的 afd 积压 <url> 上。如果我的决定如此错误，为什么你们没有重新开放它们呢？
326	这本应该按照 CFD 移动到 <url>。为什么没有移动？

礼貌的示例

	文本
498	您好，我提出了取消保护替马西泮页面的可能性 <url>。您有什么想法？
132	由于某些编辑，页面对齐方式已更改。请问您能帮忙吗？
131	我很高兴您对整体外观感到满意。在我标注所有街道之前，文本大小、字体样式等可以吗？

辅助方法

以下部分包含本笔记本所需的所有辅助方法。

get_idx_to_label 旨在用于主动学习场景，尤其是在处理标记数据和未标记数据的混合时。其主要目标是根据主动学习分数确定应选择哪些示例（来自标记和未标记数据集）进行额外标注。

# Helper method to get indices of examples with the lowest active learning score to collect more labels for.
def get_idx_to_label(
    X_labeled_full,
    X_unlabeled,
    extra_annotations,
    batch_size_to_label,
    active_learning_scores,
    active_learning_scores_unlabeled=None,
):
    if active_learning_scores_unlabeled is None:
        active_learning_scores_unlabeled = np.array([])

    to_label_idx = []
    to_label_idx_unlabeled = []

    num_labeled = len(active_learning_scores)
    active_learning_scores_combined = np.concatenate((active_learning_scores, active_learning_scores_unlabeled))
    to_label_idx_combined = np.argsort(active_learning_scores_combined)

    # We want to collect the n=batch_size best examples to collect another annotation for.
    i = 0
    while (len(to_label_idx) + len(to_label_idx_unlabeled)) < batch_size_to_label:
        idx = to_label_idx_combined[i]
        # We know this is an already annotated example.
        if idx < num_labeled:
            text_id = X_labeled_full.iloc[idx].name
            # Make sure we have an annotation left to collect.
            if text_id in extra_annotations and extra_annotations[text_id]:
                to_label_idx.append(idx)
        # We know this is an example that is currently not annotated.
        else:
            # Subtract off offset to get back original index.
            idx -= num_labeled
            text_id = X_unlabeled.iloc[idx].name
            # Make sure we have an annotation left to collect.
            if text_id in extra_annotations and extra_annotations[text_id]:
                to_label_idx_unlabeled.append(idx)
        i += 1

    to_label_idx = np.array(to_label_idx)
    to_label_idx_unlabeled = np.array(to_label_idx_unlabeled)
    return to_label_idx, to_label_idx_unlabeled

get_idx_to_label_random 旨在用于主动学习环境，其中用于额外标注的数据点的选择是随机完成的，而不是基于模型的不确定性或学习分数。这种方法可以用作与更复杂的主动学习策略进行比较的基线，或者在不清楚如何对示例进行评分的情况下使用。

# Helper method to get indices of random examples to collect more labels for.
def get_idx_to_label_random(X_labeled_full, X_unlabeled, extra_annotations, batch_size_to_label):
    to_label_idx = []
    to_label_idx_unlabeled = []

    # Generate list of indices for both sets of examples.
    labeled_idx = [(x, "labeled") for x in range(len(X_labeled_full))]
    unlabeled_idx = []
    if X_unlabeled is not None:
        unlabeled_idx = [(x, "unlabeled") for x in range(len(X_unlabeled))]
    combined_idx = labeled_idx + unlabeled_idx

    # We want to collect the n=batch_size random examples to collect another annotation for.
    while (len(to_label_idx) + len(to_label_idx_unlabeled)) < batch_size_to_label:
        # Random choice from indices.
        # We time-seed to ensure randomness.
        random.seed(datetime.now().timestamp())
        choice = random.choice(combined_idx)
        idx, which_subset = choice
        # We know this is an already annotated example.
        if which_subset == "labeled":
            text_id = X_labeled_full.iloc[idx].name
            # Make sure we have an annotation left to collect.
            if text_id in extra_annotations and extra_annotations[text_id]:
                to_label_idx.append(idx)
            combined_idx.remove(choice)
        # We know this is an example that is currently not annotated.
        else:
            text_id = X_unlabeled.iloc[idx].name
            # Make sure we have an annotation left to collect.
            if text_id in extra_annotations and extra_annotations[text_id]:
                to_label_idx_unlabeled.append(idx)
            combined_idx.remove(choice)

    to_label_idx = np.array(to_label_idx)
    to_label_idx_unlabeled = np.array(to_label_idx_unlabeled)
    return to_label_idx, to_label_idx_unlabeled

以下是一些实用方法，可帮助我们计算标准差、选择先前标注过该示例的特定标注员以及一些用于标记文本示例的标记化函数。

# Helper method to compute std dev across 2D array of accuracies.
def compute_std_dev(accuracy):
    def compute_std_dev_ind(accs):
        mean = np.mean(accs)
        std_dev = np.std(accs)
        return np.array([mean - std_dev, mean + std_dev])

    std_dev = np.apply_along_axis(compute_std_dev_ind, 0, accuracy)
    return std_dev


# Helper method to select which annotator we should collect another annotation from.
def choose_existing(annotators, existing_annotators):
    for annotator in annotators:
        # If we find one that has already given an annotation, we return it.
        if annotator in existing_annotators:
            return annotator
    # If we don't find an existing, just return a random one.
    choice = random.choice(list(annotators.keys()))
    return choice


# Helper method for Trainer.
def compute_metrics(p):
    logits, labels = p
    pred = np.argmax(logits, axis=1)
    pred_probs = softmax(logits, axis=1)
    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    return {"logits": logits, "pred_probs": pred_probs, "accuracy": accuracy}


# Helper method to tokenize text.
def tokenize_function(examples):
    model_name = "distilbert-base-uncased"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return tokenizer(examples["text"], padding="max_length", truncation=True)


# Helper method to tokenize given dataset.
def tokenize_data(data):
    dataset = Dataset.from_dict({"label": data["label"], "text": data["text"].values})
    tokenized_dataset = dataset.map(tokenize_function, batched=True)
    tokenized_dataset = tokenized_dataset.cast_column("label", ClassLabel(names=["0", "1"]))
    return tokenized_dataset

此处的 get_trainer 函数旨在为使用 DistilBERT（BERT 模型的精简版本，更轻更快）的文本分类任务设置训练环境。

# Helper method to initiate a new Trainer with given train and test sets.
def get_trainer(train_set, test_set):

    # Model params.
    model_name = "distilbert-base-uncased"
    model_folder = "model_training"
    max_training_steps = 300
    num_classes = 2

    # Set training args.
    # We time-seed to ensure randomness between different benchmarking runs.
    training_args = TrainingArguments(
        max_steps=max_training_steps, output_dir=model_folder, seed=int(datetime.now().timestamp())
    )

    # Tokenize train/test set.
    train_tokenized_dataset = tokenize_data(train_set)
    test_tokenized_dataset = tokenize_data(test_set)

    # Initiate a pre-trained model.
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_classes)
    trainer = Trainer(
        model=model,
        args=training_args,
        compute_metrics=compute_metrics,
        train_dataset=train_tokenized_dataset,
        eval_dataset=test_tokenized_dataset,
    )
    return trainer

get_pred_probs 函数使用交叉验证对给定数据集执行样本外预测概率计算，并为未标记数据提供额外的处理。

# Helper method to manually compute cross-validated predicted probabilities needed for ActiveLab.
def get_pred_probs(X, X_unlabeled):
    """Uses cross-validation to obtain out-of-sample predicted probabilities
    for given dataset"""

    # Generate cross-val splits.
    n_splits = 3
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True)
    skf_splits = [[train_index, test_index] for train_index, test_index in skf.split(X=X["text"], y=X["label"])]

    # Initiate empty array to store pred_probs.
    num_examples, num_classes = len(X), len(X.label.value_counts())
    pred_probs = np.full((num_examples, num_classes), np.NaN)
    pred_probs_unlabeled = None

    # If we use up all examples from the initial unlabeled pool, X_unlabeled will be None.
    if X_unlabeled is not None:
        pred_probs_unlabeled = np.full((n_splits, len(X_unlabeled), num_classes), np.NaN)

    # Iterate through cross-validation folds.
    for split_num, split in enumerate(skf_splits):
        train_index, test_index = split

        train_set = X.iloc[train_index]
        test_set = X.iloc[test_index]

        # Get trainer with train/test subsets.
        trainer = get_trainer(train_set, test_set)
        trainer.train()
        eval_metrics = trainer.evaluate()

        # Get pred_probs and insert into dataframe.
        pred_probs_fold = eval_metrics["eval_pred_probs"]
        pred_probs[test_index] = pred_probs_fold

        # Since we don't have labels for the unlabeled pool, we compute pred_probs at each round of CV
        # and then average the results at the end.
        if X_unlabeled is not None:
            dataset_unlabeled = Dataset.from_dict({"text": X_unlabeled["text"].values})
            unlabeled_tokenized_dataset = dataset_unlabeled.map(tokenize_function, batched=True)
            logits = trainer.predict(unlabeled_tokenized_dataset).predictions
            curr_pred_probs_unlabeled = softmax(logits, axis=1)
            pred_probs_unlabeled[split_num] = curr_pred_probs_unlabeled

    # Here we average the pred_probs from each round of CV to get pred_probs for the unlabeled pool.
    if X_unlabeled is not None:
        pred_probs_unlabeled = np.mean(np.array(pred_probs_unlabeled), axis=0)

    return pred_probs, pred_probs_unlabeled

get_annotator 函数基于一组标准确定最合适的标注员，以从特定示例中收集新的标注，而 get_annotation 则侧重于从选定的标注员处收集给定示例的实际标注，它还会删除从池中收集的标注，以防止再次选择它。

# Helper method to determine which annotator to collect annotation from for given example.
def get_annotator(example_id):
    # Update who has already annotated atleast one example.
    existing_annotators = set(X_labeled_full.drop("text", axis=1).columns)
    # Returns the annotator we want to collect annotation from.
    # Chooses existing annotators first.
    annotators = extra_annotations[example_id]
    chosen_annotator = choose_existing(annotators, existing_annotators)
    return chosen_annotator


# Helper method to collect an annotation for given text example.
def get_annotation(example_id, chosen_annotator):

    # Collect new annotation.
    new_annotation = extra_annotations[example_id][chosen_annotator]

    # Remove annotation.
    del extra_annotations[example_id][chosen_annotator]

    return new_annotation

运行以下单元格以隐藏下一个模型训练块的 HTML 输出。

%%html
<style>
    div.output_stderr {
    display: none;
    }
</style>

使用的方法

对于每个主动学习轮次，我们执行以下操作：

计算迄今为止收集的所有标注中得出的每个训练示例的 ActiveLab 共识标签。
使用这些共识标签在当前训练集上训练我们的 Transformer 分类模型。
评估测试集上的测试准确率（该测试集具有高质量的真实标签）。
运行交叉验证，以从我们的模型中获取整个训练集和未标记集的样本外预测类别概率。
获取训练集和未标记集中每个示例的 ActiveLab 主动学习分数。这些分数估计为每个示例收集另一个标注的信息量。
选择具有最低主动学习分数的示例子集（n = batch_size）。
为每个选定的 n 个示例收集一个额外的标注。
将新的标注（以及新选择的先前未标注的示例）添加到我们的训练集中，以进行下一次迭代。

随后，我比较通过主动学习标注的数据训练的模型与通过随机选择标注的数据训练的模型。对于每个随机选择轮次，我使用多数投票共识而不是 ActiveLab 共识（在步骤 1 中），然后仅随机选择 n 个示例来收集额外的标签，而不是使用 ActiveLab 分数（在步骤 6 中）。

关于 Activelab 共识标签和主动学习的更多直觉将在笔记本中进一步分享。

模型训练和评估

我首先标记我的测试集和训练集，然后初始化一个预训练的 DistilBert Transformer 模型。使用 300 个训练步骤微调 DistilBert 在我的数据准确率和训练时间之间取得了良好的平衡。此分类器输出预测的类别概率，我在评估其准确率之前将其转换为类别预测。

使用主动学习分数来决定接下来要标注什么

在主动学习的每一轮中，我们通过在当前训练集上进行 3 折交叉验证来拟合我们的 Transformer 模型。这使我们能够获得训练集中每个示例的样本外预测类别概率，并且我们还可以使用训练后的 Transformer 来获得未标记池中每个示例的样本外预测类别概率。所有这些都在 get_pred_probs 辅助方法中内部实现。使用样本外预测有助于我们避免因潜在的过拟合而产生的偏差。

一旦我有了这些概率预测，我就将它们传递到开源 cleanlab 包中的 get_active_learning_scores 方法中，该方法实现了 ActiveLab 算法。此方法为我们提供了所有标记和未标记数据的分数。较低的分数表示数据点，对于这些数据点，收集一个额外的标签对于我们当前的模型来说信息量最大（分数在标记数据和未标记数据之间直接可比）。

我形成一批具有最低分数的示例作为要收集标注的示例（通过 get_idx_to_label 方法）。在这里，我始终在每一轮中收集完全相同数量的标注（在主动学习和随机选择方法下）。对于此应用程序，我将每个示例的最大标注数限制为 5 个（不想花费精力一遍又一遍地标注相同的示例）。

添加新的标注

combined_example_ids 是我们要收集标注的文本示例的 ID。对于每个 ID，我们使用 get_annotation 辅助方法从标注员处收集新的标注。在这里，我们优先选择来自已经标注了另一个示例的标注员的标注。如果给定示例的标注员都不存在于训练集中，我们随机选择一个。在这种情况下，我们在训练集中添加一个新列，表示新的标注员。最后，我们将新收集的标注添加到训练集中。如果相应的示例以前未标注，我们也会将其添加到训练集中，并将其从未标记集合中删除。

我们现在完成了一轮收集新标注，并在更新后的训练集上重新训练 Transformer 模型。我们在多轮中重复此过程，以不断扩大训练数据集并改进我们的模型。

# For this Active Learning demo, we add 25 additional annotations to the training set
# each iteration, for 25 rounds.
num_rounds = 25
batch_size_to_label = 25
model_accuracy_arr = np.full(num_rounds, np.nan)

# The 'selection_method' varible determines if we use ActiveLab or random selection
# to choose the new annotations each round.
selection_method = "random"
# selection_method = 'active_learning'

# Each round we:
# - train our model
# - evaluate on unchanging test set
# - collect and add new annotations to training set
for i in range(num_rounds):

    # X_labeled_full is updated each iteration. We drop the text column which leaves us with just the annotations.
    multiannotator_labels = X_labeled_full.drop(["text"], axis=1)

    # Use majority vote when using random selection to select the consensus label for each example.
    if i == 0 or selection_method == "random":
        consensus_labels = get_majority_vote_label(multiannotator_labels)

    # When using ActiveLab, use cleanlab's CrowdLab to select the consensus label for each example.
    else:
        results = get_label_quality_multiannotator(
            multiannotator_labels,
            pred_probs_labeled,
            calibrate_probs=True,
        )
        consensus_labels = results["label_quality"]["consensus_label"].values

    # We only need the text and label columns.
    train_set = X_labeled_full[["text"]]
    train_set["label"] = consensus_labels
    test_set = test[["text", "label"]]

    # Train our Transformer model on the full set of labeled data to evaluate model accuracy for the current round.
    # This is an optional step for demonstration purposes, in practical applications
    # you may not have ground truth labels.
    trainer = get_trainer(train_set, test_set)
    trainer.train()
    eval_metrics = trainer.evaluate()
    # set statistics
    model_accuracy_arr[i] = eval_metrics["eval_accuracy"]

    # For ActiveLab, we need to run cross-validation to get out-of-sample predicted probabilites.
    if selection_method == "active_learning":
        pred_probs, pred_probs_unlabeled = get_pred_probs(train_set, X_unlabeled)

        # Compute active learning scores.
        active_learning_scores, active_learning_scores_unlabeled = get_active_learning_scores(
            multiannotator_labels, pred_probs, pred_probs_unlabeled
        )

        # Get the indices of examples to collect more labels for.
        chosen_examples_labeled, chosen_examples_unlabeled = get_idx_to_label(
            X_labeled_full,
            X_unlabeled,
            extra_annotations,
            batch_size_to_label,
            active_learning_scores,
            active_learning_scores_unlabeled,
        )

    # We don't need to run cross-validation, just get random examples to collect annotations for.
    if selection_method == "random":
        chosen_examples_labeled, chosen_examples_unlabeled = get_idx_to_label_random(
            X_labeled_full, X_unlabeled, extra_annotations, batch_size_to_label
        )

    unlabeled_example_ids = np.array([])
    # Check to see if we still have unlabeled examples left.
    if X_unlabeled is not None:
        # Get unlabeled text examples we want to collect annotations for.
        new_text = X_unlabeled.iloc[chosen_examples_unlabeled]
        unlabeled_example_ids = new_text.index.values
        num_ex, num_annot = len(new_text), multiannotator_labels.shape[1]
        empty_annot = pd.DataFrame(
            data=np.full((num_ex, num_annot), np.NaN),
            columns=multiannotator_labels.columns,
            index=unlabeled_example_ids,
        )
        new_unlabeled_df = pd.concat([new_text, empty_annot], axis=1)

        # Combine unlabeled text examples with existing, labeled examples.
        X_labeled_full = pd.concat([X_labeled_full, new_unlabeled_df], axis=0)

        # Remove examples from X_unlabeled and check if empty.
        # Once it is empty we set it to None to handle appropriately elsewhere.
        X_unlabeled = X_unlabeled.drop(new_text.index)
        if X_unlabeled.empty:
            X_unlabeled = None

    if selection_method == "active_learning":
        # Update pred_prob arrays with newly added examples if necessary.
        if pred_probs_unlabeled is not None and len(chosen_examples_unlabeled) != 0:
            pred_probs_new = pred_probs_unlabeled[chosen_examples_unlabeled, :]
            pred_probs_labeled = np.concatenate((pred_probs, pred_probs_new))
            pred_probs_unlabeled = np.delete(pred_probs_unlabeled, chosen_examples_unlabeled, axis=0)
        # Otherwise we have nothing to modify.
        else:
            pred_probs_labeled = pred_probs

    # Get combined list of text ID's to relabel.
    labeled_example_ids = X_labeled_full.iloc[chosen_examples_labeled].index.values
    combined_example_ids = np.concatenate([labeled_example_ids, unlabeled_example_ids])

    # Now we collect annotations for the selected examples.
    for example_id in combined_example_ids:
        # Choose which annotator to collect annotation from.
        chosen_annotator = get_annotator(example_id)
        # Collect new annotation.
        new_annotation = get_annotation(example_id, chosen_annotator)
        # New annotator has been selected.
        if chosen_annotator not in X_labeled_full.columns.values:
            empty_col = np.full((len(X_labeled_full),), np.nan)
            X_labeled_full[chosen_annotator] = empty_col

        # Add selected annotation to the training set.
        X_labeled_full.at[example_id, chosen_annotator] = new_annotation

结果

在运行了 25 轮主动学习（标记数据批次并重新训练 Transformer 模型）之后，每轮收集 25 个注释。我重复了所有这些操作，下一次使用随机选择来选择每轮要注释的示例——作为基线比较。在注释额外数据之前，两种方法都从相同的 100 个示例的初始训练集开始（因此在第一轮中实现了大致相同的 Transformer 准确率）。由于 Transformer 训练中固有的随机性，我将整个过程运行了五次（对于每种数据标记策略），并报告了五个重复运行的测试准确率的标准偏差（阴影区域）和平均值（实线）。

# Get numpy array of results.
!wget -nc -O 'random_acc.npy' 'https://huggingface.co/datasets/Cleanlab/stanford-politeness/resolve/main/activelearn_acc.npy'
!wget -nc -O 'activelearn_acc.npy' 'https://huggingface.co/datasets/Cleanlab/stanford-politeness/resolve/main/random_acc.npy'

# Helper method to compute std dev across 2D array of accuracies.
def compute_std_dev(accuracy):
    def compute_std_dev_ind(accs):
        mean = np.mean(accs)
        std_dev = np.std(accs)
        return np.array([mean - std_dev, mean + std_dev])

    std_dev = np.apply_along_axis(compute_std_dev_ind, 0, accuracy)
    return std_dev

>>> al_acc = np.load("activelearn_acc.npy")
>>> rand_acc = np.load("random_acc.npy")

>>> rand_acc_std = compute_std_dev(rand_acc)
>>> al_acc_std = compute_std_dev(al_acc)

>>> plt.plot(range(1, al_acc.shape[1] + 1), np.mean(al_acc, axis=0), label="active learning", color="green")
>>> plt.fill_between(range(1, al_acc.shape[1] + 1), al_acc_std[0], al_acc_std[1], alpha=0.3, color="green")

>>> plt.plot(range(1, rand_acc.shape[1] + 1), np.mean(rand_acc, axis=0), label="random", color="red")
>>> plt.fill_between(range(1, rand_acc.shape[1] + 1), rand_acc_std[0], rand_acc_std[1], alpha=0.1, color="red")

>>> plt.hlines(y=0.9, xmin=1.0, xmax=25.0, color="black", linestyle="dotted")
>>> plt.legend()
>>> plt.xlabel("Round Number")
>>> plt.ylabel("Test Accuracy")
>>> plt.title("ActiveLab vs Random Annotation Selection --- 5 Runs")
>>> plt.savefig("al-results.png")
>>> plt.show()

我们看到，选择接下来要注释的数据对模型性能有巨大的影响。在每一轮中，使用 ActiveLab 的主动学习始终明显优于随机选择。例如，在第 4 轮中，训练集中共有 275 个注释，通过主动学习我们获得了 91% 的准确率，而没有明智的选择注释策略的情况下，准确率仅为 76%。总的来说，无论总标记预算如何，通过主动学习构建的数据集上拟合的 Transformer 模型，其错误率都只有大约 50%！

当为文本分类标记数据时，您应该考虑使用带有重新标记选项的主动学习，以更好地应对不完美的注释者。

< > 更新在 GitHub 上

←使用 Cleanlab 检测文本数据集中的问题使用 Gemma、Elasticsearch 和开源模型构建 RAG 系统→