微调掩码语言模型

对于许多涉及 Transformer 模型的 NLP 应用，你可以直接从 Hugging Face Hub 获取预训练模型，并在你的数据上针对手头的任务对其进行微调。如果用于预训练的语料库与用于微调的语料库差异不大，迁移学习通常会产生良好的结果。

但是，在训练特定于任务的头部之前，你可能需要先在你的数据上微调语言模型。例如，如果你的数据集包含法律合同或科学文章，则 BERT 之类的普通 Transformer 模型通常会将语料库中的特定于域的词视为稀有词，导致的性能可能不尽如人意。通过在域内数据上微调语言模型，你可以提高许多下游任务的性能，这意味着你通常只需要执行此步骤一次！

这种在域内数据上微调预训练语言模型的过程通常被称为域适应。它在 2018 年由 ULMFiT 推广，这是第一个基于 LSTM 的神经架构之一，它使迁移学习真正适用于 NLP。下图显示了使用 ULMFiT 进行域适应的示例；在本节中，我们将执行类似的操作，但使用 Transformer 而不是 LSTM！

在本节结束时，你将拥有一个掩码语言模型，它可以在 Hub 上完成句子，如下所示

让我们深入研究！

🙋 如果“掩码语言建模”和“预训练模型”这些术语听起来很陌生，请查看第 1 章，我们将在那里解释所有这些核心概念，并附带视频！

选择用于掩码语言建模的预训练模型

首先，让我们选择一个适合掩码语言建模的预训练模型。如以下屏幕截图所示，你可以在 Hugging Face Hub 上应用“Fill-Mask”过滤器找到候选列表。

尽管 BERT 和 RoBERTa 系列模型的下载量最多，但我们将使用一个名为 DistilBERT 的模型，该模型的训练速度要快得多，而且在下游性能方面几乎没有损失。该模型使用一种称为 知识蒸馏 的特殊技术进行训练，其中 BERT 之类的庞大“教师模型”用于指导训练具有更少参数的“学生模型”。在本节中，详细解释知识蒸馏将使我们偏离主题，但如果你感兴趣，可以在 使用 Transformer 的自然语言处理（俗称 Transformer 教科书）中详细了解它。

让我们使用 AutoModelForMaskedLM 类下载 DistilBERT

from transformers import AutoModelForMaskedLM

model_checkpoint = "distilbert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

我们可以通过调用 num_parameters() 方法查看此模型有多少个参数

distilbert_num_parameters = model.num_parameters() / 1_000_000
print(f"'>>> DistilBERT number of parameters: {round(distilbert_num_parameters)}M'")
print(f"'>>> BERT number of parameters: 110M'")

'>>> DistilBERT number of parameters: 67M'
'>>> BERT number of parameters: 110M'

DistilBERT 拥有约 6700 万个参数，大约是 BERT 基础模型大小的一半，这意味着训练速度大约提高一倍——棒极了！现在，让我们看看这个模型预测哪些令牌最有可能完成一小段文本。

text = "This is a great [MASK]."

作为人类，我们可以想象很多用于 [MASK] 令牌的可能性，例如“day”、“ride”或“painting”。对于预训练模型，预测取决于模型训练的语料库，因为它学习获取数据中存在的统计模式。与 BERT 一样，DistilBERT 在英文维基百科和 BookCorpus 数据集上进行预训练，因此我们预计 [MASK] 的预测会反映这些领域。为了预测掩码，我们需要 DistilBERT 的分词器来生成模型的输入，因此让我们也从 Hub 下载它。

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

有了分词器和模型，我们现在可以将文本示例传递给模型，提取 logits 并打印出前 5 个候选。

import torch

inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits
# Find the location of [MASK] and extract its logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
# Pick the [MASK] candidates with the highest logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

'>>> This is a great deal.'
'>>> This is a great success.'
'>>> This is a great adventure.'
'>>> This is a great idea.'
'>>> This is a great feat.'

我们可以从输出中看到，模型的预测是指日常术语，考虑到英文维基百科的基础，这可能并不奇怪。让我们看看如何将这个领域更改为更利基的领域——高度两极化的电影评论！

数据集

为了展示领域自适应，我们将使用著名的大型电影评论数据集（简称 IMDb），这是一个经常用于基准情感分析模型的电影评论语料库。通过在这个语料库上微调 DistilBERT，我们期望语言模型会将它的词汇从预训练的维基百科的事实数据，适应到电影评论中更主观的元素。我们可以使用 🤗 Datasets 中的 load_dataset() 函数从 Hugging Face Hub 获取数据。

from datasets import load_dataset

imdb_dataset = load_dataset("imdb")
imdb_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

我们可以看到，train 和 test 拆分各包含 25,000 条评论，而一个名为 unsupervised 的未标记拆分包含 50,000 条评论。让我们看一下几个样本，了解一下我们正在处理的文本类型。就像我们在课程的前几章中所做的那样，我们将链接 Dataset.shuffle() 和 Dataset.select() 函数以创建随机样本。

sample = imdb_dataset["train"].shuffle(seed=42).select(range(3))

for row in sample:
    print(f"\n'>>> Review: {row['text']}'")
    print(f"'>>> Label: {row['label']}'")


'>>> Review: This is your typical Priyadarshan movie--a bunch of loony characters out on some silly mission. His signature climax has the entire cast of the film coming together and fighting each other in some crazy moshpit over hidden money. Whether it is a winning lottery ticket in Malamaal Weekly, black money in Hera Pheri, "kodokoo" in Phir Hera Pheri, etc., etc., the director is becoming ridiculously predictable. Don\'t get me wrong; as clichéd and preposterous his movies may be, I usually end up enjoying the comedy. However, in most his previous movies there has actually been some good humor, (Hungama and Hera Pheri being noteworthy ones). Now, the hilarity of his films is fading as he is using the same formula over and over again.<br /><br />Songs are good. Tanushree Datta looks awesome. Rajpal Yadav is irritating, and Tusshar is not a whole lot better. Kunal Khemu is OK, and Sharman Joshi is the best.'
'>>> Label: 0'

'>>> Review: Okay, the story makes no sense, the characters lack any dimensionally, the best dialogue is ad-libs about the low quality of movie, the cinematography is dismal, and only editing saves a bit of the muddle, but Sam" Peckinpah directed the film. Somehow, his direction is not enough. For those who appreciate Peckinpah and his great work, this movie is a disappointment. Even a great cast cannot redeem the time the viewer wastes with this minimal effort.<br /><br />The proper response to the movie is the contempt that the director San Peckinpah, James Caan, Robert Duvall, Burt Young, Bo Hopkins, Arthur Hill, and even Gig Young bring to their work. Watch the great Peckinpah films. Skip this mess.'
'>>> Label: 0'

'>>> Review: I saw this movie at the theaters when I was about 6 or 7 years old. I loved it then, and have recently come to own a VHS version. <br /><br />My 4 and 6 year old children love this movie and have been asking again and again to watch it. <br /><br />I have enjoyed watching it again too. Though I have to admit it is not as good on a little TV.<br /><br />I do not have older children so I do not know what they would think of it. <br /><br />The songs are very cute. My daughter keeps singing them over and over.<br /><br />Hope this helps.'
'>>> Label: 1'

没错，这些确实是电影评论，如果你年纪够大，你甚至可以理解最后一条评论中关于拥有 VHS 版本的评论 😜! 虽然我们不需要语言建模的标签，但我们已经可以看出，0 表示负面评论，而 1 对应于正面评论。

✏️ 试试看！ 创建 unsupervised 拆分的随机样本，并验证标签既不是 0 也不是 1。当你这样做的时候，你也可以检查 train 和 test 拆分中的标签确实是 0 或 1 —— 这对于每个 NLP 从业人员在开始一个新项目时执行的有效性检查非常有用！

既然我们已经快速浏览了数据，让我们深入了解如何将其准备用于掩码语言建模。正如我们将看到的，与我们在第 3 章中看到的序列分类任务相比，需要采取一些额外的步骤。让我们开始吧！

预处理数据

对于自回归和掩码语言建模，一个常见的预处理步骤是将所有示例连接起来，然后将整个语料库拆分为大小相同的块。这与我们通常的做法大不相同，我们通常只是对单个示例进行标记。为什么要将所有内容连接在一起？原因是，如果单个示例太长，它们可能会被截断，这会导致丢失可能对语言建模任务有用的信息！

因此，为了开始，我们将像往常一样对语料库进行标记，但不在我们的标记器中设置 truncation=True 选项。我们还将获取单词 ID（如果它们可用，如果我们使用的是快速标记器，如第 6 章所述），因为我们稍后将需要它们来进行整词掩码。我们将把它包装在一个简单的函数中，同时，我们将删除 text 和 label 列，因为我们不再需要它们。

def tokenize_function(examples):
    result = tokenizer(examples["text"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result


# Use batched=True to activate fast multithreading!
tokenized_datasets = imdb_dataset.map(
    tokenize_function, batched=True, remove_columns=["text", "label"]
)
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'word_ids'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['attention_mask', 'input_ids', 'word_ids'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['attention_mask', 'input_ids', 'word_ids'],
        num_rows: 50000
    })
})

由于 DistilBERT 是一个 BERT 类模型，我们可以看到，编码的文本包含我们在其他章节中看到的 input_ids 和 attention_mask，以及我们添加的 word_ids。

现在我们已经对电影评论进行了标记，下一步是将它们全部组合在一起，并将结果拆分为块。但是这些块应该有多大？这最终将取决于你拥有的 GPU 内存数量，但一个好的起点是查看模型的最大上下文大小。这可以通过检查标记器的 model_max_length 属性来推断。

tokenizer.model_max_length

此值来自与检查点关联的tokenizer_config.json 文件；在本例中，我们可以看到上下文大小为 512 个标记，就像 BERT 一样。

✏️ 试试看！ 一些 Transformer 模型，如BigBird 和Longformer，比 BERT 和其他早期 Transformer 模型具有更长的上下文长度。为其中一个检查点实例化标记器，并验证 model_max_length 是否与模型卡中引用的内容一致。

因此，为了在 Google Colab 上找到的 GPU 上运行我们的实验，我们将选择一个稍微小一点的可以放入内存的模型。

chunk_size = 128

请注意，在实际应用中使用较小的块大小可能是有害的，因此你应该使用一个与你将模型应用于的用例相对应的块大小。

现在到了有趣的部分。为了展示连接是如何工作的，让我们从我们标记化的训练集中获取一些评论，并打印出每条评论的标记数量。

# Slicing produces a list of lists for each feature
tokenized_samples = tokenized_datasets["train"][:3]

for idx, sample in enumerate(tokenized_samples["input_ids"]):
    print(f"'>>> Review {idx} length: {len(sample)}'")

'>>> Review 0 length: 200'
'>>> Review 1 length: 559'
'>>> Review 2 length: 192'

然后，我们可以使用简单的字典推导来将所有这些示例连接起来，如下所示。

concatenated_examples = {
    k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
}
total_length = len(concatenated_examples["input_ids"])
print(f"'>>> Concatenated reviews length: {total_length}'")

'>>> Concatenated reviews length: 951'

太好了，总长度符合要求——所以现在让我们将连接的评论拆分为由 chunk_size 指定大小的块。为此，我们遍历 concatenated_examples 中的特征，并使用列表推导来创建每个特征的切片。结果是一个包含每个特征的块的字典。

chunks = {
    k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
    for k, t in concatenated_examples.items()
}

for chunk in chunks["input_ids"]:
    print(f"'>>> Chunk length: {len(chunk)}'")

'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 55'

正如你在这个例子中看到的，最后一个块通常会小于最大块大小。有两种主要策略可以处理这个问题。

如果最后一个块小于 chunk_size，则将其删除。
对最后一个块进行填充，直到其长度等于 chunk_size。

我们将在这里采用第一种方法，所以让我们将所有上述逻辑包装在一个单个函数中，我们可以将其应用于我们标记化的数据集。

def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

请注意，在 group_texts() 的最后一步中，我们创建了一个新的 labels 列，它是 input_ids 列的副本。正如我们很快就会看到的，这是因为在掩码语言建模中，目标是预测输入批次中随机掩码的标记，通过创建一个 labels 列，我们为我们的语言模型提供了学习的真实情况。

现在让我们使用我们可靠的 Dataset.map() 函数将 group_texts() 应用于我们标记化的数据集。

lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'labels', 'word_ids'],
        num_rows: 61289
    })
    test: Dataset({
        features: ['attention_mask', 'input_ids', 'labels', 'word_ids'],
        num_rows: 59905
    })
    unsupervised: Dataset({
        features: ['attention_mask', 'input_ids', 'labels', 'word_ids'],
        num_rows: 122963
    })
})

你可以看到，将文本分组然后分块产生了比我们最初的 25,000 个 train 和 test 拆分的示例多得多的示例。这是因为我们现在有了包含跨越原始语料库中多个示例的连续标记的示例。你可以通过查看其中一个块中的特殊 [SEP] 和 [CLS] 标记来明确地看到这一点。

tokenizer.decode(lm_datasets["train"][1]["input_ids"])

".... at.......... high. a classic line : inspector : i'm here to sack one of your teachers. student : welcome to bromwell high. i expect that many adults of my age think that bromwell high is far fetched. what a pity that it isn't! [SEP] [CLS] homelessness ( or houselessness as george carlin stated ) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school, work, or vote for the matter. most people think of the homeless"

在这个例子中，你可以看到两个重叠的电影评论，一个关于高中电影，另一个关于无家可归。让我们也看看掩码语言建模的标签是什么样的。

tokenizer.decode(lm_datasets["train"][1]["labels"])

".... at.......... high. a classic line : inspector : i'm here to sack one of your teachers. student : welcome to bromwell high. i expect that many adults of my age think that bromwell high is far fetched. what a pity that it isn't! [SEP] [CLS] homelessness ( or houselessness as george carlin stated ) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school, work, or vote for the matter. most people think of the homeless"

正如我们上面在 group_texts() 函数中所预期的那样，这看起来与解码的 input_ids 相同——但我们的模型怎么可能学到任何东西呢？我们遗漏了一个关键步骤：在输入中的随机位置插入 [MASK] 标记！让我们看看如何在微调时使用特殊的 data collator 这样做。

使用 Trainer API 微调 DistilBERT

微调掩码语言模型几乎与我们在第 3 章中所做的那样微调序列分类模型相同。唯一的区别是我们需要一个特殊的 data collator，它可以随机掩码每一批文本中的某些标记。幸运的是，🤗 Transformers 为此任务准备了一个专门的 DataCollatorForLanguageModeling。我们只需要将标记器和 mlm_probability 参数传递给它，该参数指定要掩码的标记的比例。我们将选择 15%，这是 BERT 使用的数量，也是文献中常见的选择。

from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

为了查看随机掩码是如何工作的，让我们将一些示例馈送到 data collator。由于它期望一个 dict 列表，其中每个 dict 代表一个连续文本块，因此我们在将批次馈送到 collator 之前先遍历数据集。我们为这个 data collator 删除了 "word_ids" 键，因为它不希望它。

samples = [lm_datasets["train"][i] for i in range(2)]
for sample in samples:
    _ = sample.pop("word_ids")

for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")

'>>> [CLS] bromwell [MASK] is a cartoon comedy. it ran at the same [MASK] as some other [MASK] about school life, [MASK] as " teachers ". [MASK] [MASK] [MASK] in the teaching [MASK] lead [MASK] to believe that bromwell high\'[MASK] satire is much closer to reality than is " teachers ". the scramble [MASK] [MASK] financially, the [MASK]ful students whogn [MASK] right through [MASK] pathetic teachers\'pomp, the pettiness of the whole situation, distinction remind me of the schools i knew and their students. when i saw [MASK] episode in [MASK] a student repeatedly tried to burn down the school, [MASK] immediately recalled. [MASK]...'

'>>> .... at.. [MASK]... [MASK]... high. a classic line plucked inspector : i\'[MASK] here to [MASK] one of your [MASK]. student : welcome to bromwell [MASK]. i expect that many adults of my age think that [MASK]mwell [MASK] is [MASK] fetched. what a pity that it isn\'t! [SEP] [CLS] [MASK]ness ( or [MASK]lessness as george 宇in stated )公 been an issue for years but never [MASK] plan to help those on the street that were once considered human [MASK] did everything from going to school, [MASK], [MASK] vote for the matter. most people think [MASK] the homeless'

很好，它起作用了！我们可以看到，[MASK] 标记已随机插入到我们文本中的各个位置。这些将是我们的模型在训练期间必须预测的标记——data collator 的妙处在于它会在每个批次中随机化 [MASK] 插入！

✏️ 试试看！ 运行上面的代码片段几次，看看随机掩码是如何在你眼前发生的！还可以将 tokenizer.decode() 方法替换为 tokenizer.convert_ids_to_tokens()，以查看有时给定单词中的单个标记被掩码，而其他标记没有被掩码。

随机掩码的一个副作用是，当使用 Trainer 时，我们的评估指标将不确定，因为我们对训练集和测试集使用了相同的 data collator。我们将在稍后查看如何使用 🤗 Accelerate 的定制评估循环的灵活性来冻结随机性，从而看到如何解决这个问题。

在训练掩码语言建模模型时，可以使用的一种技术是对整个单词而不是单个标记进行掩码。这种方法称为整词掩码。如果我们想使用整词掩码，我们将需要自己构建一个 data collator。data collator 只是一个函数，它接受一个样本列表并将它们转换为一个批次，所以让我们现在就这样做！我们将使用之前计算的单词 ID 来创建单词索引与对应标记之间的映射，然后随机决定要掩码哪些单词，并将该掩码应用于输入。请注意，标签都是 -100，除了对应于掩码单词的标签。

import collections
import numpy as np

from transformers import default_data_collator

wwm_probability = 0.2


def whole_word_masking_data_collator(features):
    for feature in features:
        word_ids = feature.pop("word_ids")

        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id
        feature["labels"] = new_labels

    return default_data_collator(features)

接下来，我们可以尝试对之前的相同样本进行测试。

samples = [lm_datasets["train"][i] for i in range(2)]
batch = whole_word_masking_data_collator(samples)

for chunk in batch["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")

'>>> [CLS] bromwell high is a cartoon comedy [MASK] it ran at the same time as some other programs about school life, such as " teachers ". my 35 years in the teaching profession lead me to believe that bromwell high\'s satire is much closer to reality than is " teachers ". the scramble to survive financially, the insightful students who can see right through their pathetic teachers\'pomp, the pettiness of the whole situation, all remind me of the schools i knew and their students. when i saw the episode in which a student repeatedly tried to burn down the school, i immediately recalled.....'

'>>> .... [MASK] [MASK] [MASK] [MASK]....... high. a classic line : inspector : i\'m here to sack one of your teachers. student : welcome to bromwell high. i expect that many adults of my age think that bromwell high is far fetched. what a pity that it isn\'t! [SEP] [CLS] homelessness ( or houselessness as george carlin stated ) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school, work, or vote for the matter. most people think of the homeless'

✏️ 试试看！ 运行上面的代码片段几次，看看随机掩码是如何在你眼前发生的！还可以将 tokenizer.decode() 方法替换为 tokenizer.convert_ids_to_tokens()，以查看给定单词中的标记总是被一起掩码。

现在我们有了两个数据整理器，剩下的微调步骤都是标准的。如果您没有幸运地获得一个神话般的 P100 GPU 😭，那么在 Google Colab 上进行训练可能需要一段时间，所以我们首先将训练集的大小缩减到几千个样本。不用担心，我们仍然可以得到一个相当不错的语言模型！在 🤗 Datasets 中快速缩减数据集的一种方法是使用 Dataset.train_test_split() 函数，我们在第 5 章中见过。

train_size = 10_000
test_size = int(0.1 * train_size)

downsampled_dataset = lm_datasets["train"].train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'labels', 'word_ids'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['attention_mask', 'input_ids', 'labels', 'word_ids'],
        num_rows: 1000
    })
})

这将自动创建新的 train 和 test 分割，训练集大小设置为 10,000 个样本，验证集为训练集的 10%——如果您有强大的 GPU，可以随意增加这个比例！接下来我们需要做的就是登录 Hugging Face Hub。如果您在笔记本中运行此代码，可以使用以下实用函数

from huggingface_hub import notebook_login

notebook_login()

这将显示一个窗口，您可以在其中输入您的凭据。或者，您也可以在您喜欢的终端中运行

huggingface-cli login

并在那里登录。

登录后，我们可以为 Trainer 指定参数

from transformers import TrainingArguments

batch_size = 64
# Show the training loss with every epoch
logging_steps = len(downsampled_dataset["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-imdb",
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    push_to_hub=True,
    fp16=True,
    logging_steps=logging_steps,
)

在这里，我们调整了一些默认选项，包括 logging_steps 以确保我们跟踪每个 epoch 的训练损失。我们还使用了 fp16=True 来启用混合精度训练，这为我们提供了速度上的进一步提升。默认情况下，Trainer 将删除任何不在模型 forward() 方法中的列。这意味着，如果您使用的是全词掩码整理器，那么您还需要设置 remove_unused_columns=False 以确保我们不会在训练期间丢失 word_ids 列。

请注意，您可以使用 hub_model_id 参数指定您要推送到哪个仓库的名称（特别是，您必须使用此参数才能推送到组织）。例如，当我们将模型推送到 huggingface-course 组织时，我们在 TrainingArguments 中添加了 hub_model_id="huggingface-course/distilbert-finetuned-imdb"。默认情况下，使用的仓库将在您的命名空间中，并以您设置的输出目录命名，因此在本例中将是 "lewtun/distilbert-finetuned-imdb"。

现在我们拥有了实例化 Trainer 的所有要素。这里我们只使用标准的 data_collator，但您可以尝试使用全词掩码整理器，并将结果进行比较作为练习。

from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

现在我们准备运行 trainer.train() 了——但在运行之前，让我们简要地看一下困惑度，这是一个用于评估语言模型性能的常用指标。

语言模型的困惑度

与其他任务（如文本分类或问答）不同，这些任务为我们提供了一个标记的语料库来进行训练，而在语言建模中，我们没有明确的标签。那么，我们如何确定什么是好的语言模型呢？就像您手机中的自动更正功能一样，好的语言模型是指能够为语法正确的句子分配高概率，而为无意义句子分配低概率的模型。为了让您更好地了解这一点，您可以在网上找到许多“自动更正失败”的案例，其中手机中的模型产生了非常有趣（而且常常不合适）的补全！

假设我们的测试集主要由语法正确的句子组成，那么衡量我们语言模型质量的一种方法是计算它为测试集中所有句子的下一个词分配的概率。高概率表示模型对未见过的样本没有“惊讶”或“困惑”，并表明它已经学习了语言中语法的基本模式。困惑度有各种数学定义，但我们将使用的定义将其定义为交叉熵损失的指数。因此，我们可以使用 Trainer.evaluate() 函数计算测试集上的交叉熵损失，然后取结果的指数来计算预训练模型的困惑度。

import math

eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

>>> Perplexity: 21.75

困惑度得分越低，语言模型越好。我们可以看到，我们最初的模型的值有点大。让我们看看是否可以通过微调来降低它！为此，我们首先运行训练循环

trainer.train()

然后像以前一样计算测试集上的最终困惑度

eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

>>> Perplexity: 11.32

不错——这在困惑度上有了很大的下降，这说明模型已经学习了一些关于电影评论领域的信息！

训练完成后，我们可以将包含训练信息的模型卡推送到 Hub（检查点在训练过程中本身就会保存）

trainer.push_to_hub()

✏️ 您的回合！在将数据整理器更改为全词掩码整理器后，运行上面的训练。您得到更好的结果了吗？

在我们的用例中，我们不需要对训练循环进行任何特殊操作，但在某些情况下，您可能需要实现一些自定义逻辑。对于这些应用程序，您可以使用 🤗 Accelerate——让我们来看一看！

使用 🤗 Accelerate 微调 DistilBERT

正如我们在 Trainer 中看到的那样，微调掩码语言模型与第 3 章中的文本分类示例非常相似。实际上，唯一的细微差别是使用了特殊的数据整理器，我们已经在本节的前面介绍过了！

但是，我们看到 DataCollatorForLanguageModeling 也会在每次评估中应用随机掩码，因此我们将在每次训练运行中看到困惑度得分的一些波动。消除这种随机性来源的一种方法是在整个测试集上一次应用掩码，然后使用 🤗 Transformers 中的默认数据整理器在评估期间收集批次。要了解其工作原理，让我们实现一个简单的函数，该函数将在批次上应用掩码，类似于我们第一次遇到 DataCollatorForLanguageModeling 时的做法。

def insert_random_mask(batch):
    features = [dict(zip(batch, t)) for t in zip(*batch.values())]
    masked_inputs = data_collator(features)
    # Create a new "masked" column for each column in the dataset
    return {"masked_" + k: v.numpy() for k, v in masked_inputs.items()}

接下来，我们将此函数应用于我们的测试集，并删除未掩码的列，以便我们可以用掩码列替换它们。您可以通过将上面的 data_collator 替换为适当的整理器来使用全词掩码，在这种情况下，您应该删除这里的第一行。

downsampled_dataset = downsampled_dataset.remove_columns(["word_ids"])
eval_dataset = downsampled_dataset["test"].map(
    insert_random_mask,
    batched=True,
    remove_columns=downsampled_dataset["test"].column_names,
)
eval_dataset = eval_dataset.rename_columns(
    {
        "masked_input_ids": "input_ids",
        "masked_attention_mask": "attention_mask",
        "masked_labels": "labels",
    }
)

然后我们可以像往常一样设置数据加载器，但我们将使用 🤗 Transformers 中的 default_data_collator 用于评估集。

from torch.utils.data import DataLoader
from transformers import default_data_collator

batch_size = 64
train_dataloader = DataLoader(
    downsampled_dataset["train"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)
eval_dataloader = DataLoader(
    eval_dataset, batch_size=batch_size, collate_fn=default_data_collator
)

从这里开始，我们按照 🤗 Accelerate 的标准步骤进行。首要任务是加载预训练模型的新版本

model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

然后我们需要指定优化器；我们将使用标准的 AdamW

from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

使用这些对象，我们现在可以使用 Accelerator 对象为训练做好一切准备。

from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

现在我们的模型、优化器和数据加载器已配置完成，我们可以像如下这样指定学习率调度器

from transformers import get_scheduler

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

在训练之前，还有一件事要做：在 Hugging Face Hub 上创建一个模型仓库！我们可以使用 🤗 Hub 库首先生成我们的仓库的完整名称

from huggingface_hub import get_full_repo_name

model_name = "distilbert-base-uncased-finetuned-imdb-accelerate"
repo_name = get_full_repo_name(model_name)
repo_name

'lewtun/distilbert-base-uncased-finetuned-imdb-accelerate'

然后使用 🤗 Hub 中的 Repository 类创建并克隆仓库

from huggingface_hub import Repository

output_dir = model_name
repo = Repository(output_dir, clone_from=repo_name)

完成此操作后，只需编写完整的训练和评估循环即可

from tqdm.auto import tqdm
import torch
import math

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    losses = []
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)

        loss = outputs.loss
        losses.append(accelerator.gather(loss.repeat(batch_size)))

    losses = torch.cat(losses)
    losses = losses[: len(eval_dataset)]
    try:
        perplexity = math.exp(torch.mean(losses))
    except OverflowError:
        perplexity = float("inf")

    print(f">>> Epoch {epoch}: Perplexity: {perplexity}")

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )

>>> Epoch 0: Perplexity: 11.397545307900472
>>> Epoch 1: Perplexity: 10.904909330983092
>>> Epoch 2: Perplexity: 10.729503505340409

不错，我们能够在每个 epoch 中评估困惑度，并确保多次训练运行是可重复的！

使用微调的模型

您可以通过使用 Hub 上的窗口或在本地使用 🤗 Transformers 中的 pipeline 来与微调的模型进行交互。让我们使用后者使用 fill-mask 管道下载我们的模型

from transformers import pipeline

mask_filler = pipeline(
    "fill-mask", model="huggingface-course/distilbert-base-uncased-finetuned-imdb"
)

然后我们可以将示例文本“This is a great [MASK]”输入管道，并查看前 5 个预测结果。

preds = mask_filler(text)

for pred in preds:
    print(f">>> {pred['sequence']}")

'>>> this is a great movie.'
'>>> this is a great film.'
'>>> this is a great story.'
'>>> this is a great movies.'
'>>> this is a great character.'

很酷——我们的模型显然已经调整了权重，以便预测与电影更相关的词语！

这总结了我们第一次训练语言模型的实验。在第 6 节中，您将学习如何从头开始训练 GPT-2 等自回归模型；如果您想了解如何预训练您自己的 Transformer 模型，请访问那里！

✏️ 试一试！为了量化领域适应的好处，请在预训练和微调的 DistilBERT 检查点上为 IMDb 标签微调一个分类器。如果您需要关于文本分类的复习，请查看第 3 章。

NLP 课程