Masked language modeling

Masked language modeling 预测序列中被遮蔽的 token，并且模型可以双向关注 token。这意味着模型可以完全访问左侧和右侧的 token。Masked language modeling 非常适合需要对整个序列有良好上下文理解的任务。BERT 是 masked language model 的一个例子。

本指南将向您展示如何：

在 ELI5 数据集的 r/askscience 子集上微调 DistilRoBERTa。
使用您微调的模型进行推理。

要查看与此任务兼容的所有架构和检查点，我们建议查看 task-page

在开始之前，请确保您已安装所有必要的库

pip install transformers datasets evaluate

我们鼓励您登录您的 Hugging Face 帐户，以便您可以上传并与社区分享您的模型。出现提示时，输入您的 token 以登录

>>> from huggingface_hub import notebook_login

>>> notebook_login()

加载 ELI5 数据集

首先，使用 🤗 Datasets 库从 ELI5-Category 数据集中加载前 5000 个示例。这将让您有机会进行实验，并在花费更多时间在完整数据集上进行训练之前确保一切正常。

>>> from datasets import load_dataset

>>> eli5 = load_dataset("eli5_category", split="train[:5000]")

使用 train_test_split 方法将数据集的 train 拆分拆分为训练集和测试集

>>> eli5 = eli5.train_test_split(test_size=0.2)

然后看一个例子

>>> eli5["train"][0]
{'q_id': '7h191n',
 'title': 'What does the tax bill that was passed today mean? How will it affect Americans in each tax bracket?',
 'selftext': '',
 'category': 'Economics',
 'subreddit': 'explainlikeimfive',
 'answers': {'a_id': ['dqnds8l', 'dqnd1jl', 'dqng3i1', 'dqnku5x'],
  'text': ["The tax bill is 500 pages long and there were a lot of changes still going on right to the end. It's not just an adjustment to the income tax brackets, it's a whole bunch of changes. As such there is no good answer to your question. The big take aways are: - Big reduction in corporate income tax rate will make large companies very happy. - Pass through rate change will make certain styles of business (law firms, hedge funds) extremely happy - Income tax changes are moderate, and are set to expire (though it's the kind of thing that might just always get re-applied without being made permanent) - People in high tax states (California, New York) lose out, and many of them will end up with their taxes raised.",
   'None yet. It has to be reconciled with a vastly different house bill and then passed again.',
   'Also: does this apply to 2017 taxes? Or does it start with 2018 taxes?',
   'This article explains both the House and senate bills, including the proposed changes to your income taxes based on your income level. URL_0'],
  'score': [21, 19, 5, 3],
  'text_urls': [[],
   [],
   [],
   ['https://www.investopedia.com/news/trumps-tax-reform-what-can-be-done/']]},
 'title_urls': ['url'],
 'selftext_urls': ['url']}

虽然这看起来很多，但您真正感兴趣的只是 text 字段。语言建模任务的酷之处在于您不需要标签（也称为无监督任务），因为下一个词就是标签。

预处理

对于 masked language modeling，下一步是加载 DistilRoBERTa tokenizer 以处理 text 子字段

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilroberta-base")

您会从上面的示例中注意到，text 字段实际上嵌套在 answers 内部。这意味着您需要使用 flatten 方法从其嵌套结构中提取 text 子字段

>>> eli5 = eli5.flatten()
>>> eli5["train"][0]
{'q_id': '7h191n',
 'title': 'What does the tax bill that was passed today mean? How will it affect Americans in each tax bracket?',
 'selftext': '',
 'category': 'Economics',
 'subreddit': 'explainlikeimfive',
 'answers.a_id': ['dqnds8l', 'dqnd1jl', 'dqng3i1', 'dqnku5x'],
 'answers.text': ["The tax bill is 500 pages long and there were a lot of changes still going on right to the end. It's not just an adjustment to the income tax brackets, it's a whole bunch of changes. As such there is no good answer to your question. The big take aways are: - Big reduction in corporate income tax rate will make large companies very happy. - Pass through rate change will make certain styles of business (law firms, hedge funds) extremely happy - Income tax changes are moderate, and are set to expire (though it's the kind of thing that might just always get re-applied without being made permanent) - People in high tax states (California, New York) lose out, and many of them will end up with their taxes raised.",
  'None yet. It has to be reconciled with a vastly different house bill and then passed again.',
  'Also: does this apply to 2017 taxes? Or does it start with 2018 taxes?',
  'This article explains both the House and senate bills, including the proposed changes to your income taxes based on your income level. URL_0'],
 'answers.score': [21, 19, 5, 3],
 'answers.text_urls': [[],
  [],
  [],
  ['https://www.investopedia.com/news/trumps-tax-reform-what-can-be-done/']],
 'title_urls': ['url'],
 'selftext_urls': ['url']}

现在，每个子字段都是一个单独的列，如 answers 前缀所示，并且 text 字段现在是一个列表。与其单独标记化每个句子，不如将列表转换为字符串，以便您可以联合标记化它们。

这是一个第一个预处理函数，用于连接每个示例的字符串列表并标记化结果

>>> def preprocess_function(examples):
...     return tokenizer([" ".join(x) for x in examples["answers.text"]])

要将此预处理函数应用于整个数据集，请使用 🤗 Datasets map 方法。您可以通过设置 batched=True 来加快 map 函数的速度，以一次处理数据集的多个元素，并通过 num_proc 增加进程数。删除您不需要的任何列

>>> tokenized_eli5 = eli5.map(
...     preprocess_function,
...     batched=True,
...     num_proc=4,
...     remove_columns=eli5["train"].column_names,
... )

此数据集包含 token 序列，但其中一些序列的长度超过了模型的最大输入长度。

您现在可以使用第二个预处理函数来

连接所有序列
将连接的序列拆分为由 block_size 定义的较短的块，block_size 应该既短于最大输入长度，又足够短以适应您的 GPU RAM。

>>> block_size = 128


>>> def group_texts(examples):
...     # Concatenate all texts.
...     concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
...     total_length = len(concatenated_examples[list(examples.keys())[0]])
...     # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
...     # customize this part to your needs.
...     if total_length >= block_size:
...         total_length = (total_length // block_size) * block_size
...     # Split by chunks of block_size.
...     result = {
...         k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
...         for k, t in concatenated_examples.items()
...     }
...     return result

将 group_texts 函数应用于整个数据集

>>> lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)

现在使用 DataCollatorForLanguageModeling 创建一批示例。在整理期间动态填充句子以适应批次中最长长度，而不是将整个数据集填充到最大长度，这样做效率更高。

Pytorch

隐藏 Pytorch 内容

使用序列结束 token 作为填充 token，并指定 mlm_probability 以在每次迭代数据时随机遮蔽 token

>>> from transformers import DataCollatorForLanguageModeling

>>> tokenizer.pad_token = tokenizer.eos_token
>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

TensorFlow

隐藏 TensorFlow 内容

使用序列结束 token 作为填充 token，并指定 mlm_probability 以在每次迭代数据时随机遮蔽 token

>>> from transformers import DataCollatorForLanguageModeling

>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15, return_tensors="tf")

训练

Pytorch

隐藏 Pytorch 内容

如果您不熟悉使用 Trainer 微调模型，请查看此处的基本教程！

您现在可以开始训练您的模型了！使用 AutoModelForMaskedLM 加载 DistilRoBERTa

>>> from transformers import AutoModelForMaskedLM

>>> model = AutoModelForMaskedLM.from_pretrained("distilbert/distilroberta-base")

此时，仅剩三个步骤：

在 TrainingArguments 中定义您的训练超参数。唯一必需的参数是 output_dir，它指定保存模型的位置。您将通过设置 push_to_hub=True 将此模型推送到 Hub（您需要登录 Hugging Face 才能上传您的模型）。
将训练参数传递给 Trainer，以及模型、数据集和数据整理器。
调用 train() 来微调您的模型。

>>> training_args = TrainingArguments(
...     output_dir="my_awesome_eli5_mlm_model",
...     eval_strategy="epoch",
...     learning_rate=2e-5,
...     num_train_epochs=3,
...     weight_decay=0.01,
...     push_to_hub=True,
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=lm_dataset["train"],
...     eval_dataset=lm_dataset["test"],
...     data_collator=data_collator,
...     tokenizer=tokenizer,
... )

>>> trainer.train()

训练完成后，使用 evaluate() 方法评估您的模型并获取其困惑度

>>> import math

>>> eval_results = trainer.evaluate()
>>> print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
Perplexity: 8.76

然后使用 push_to_hub() 方法将您的模型分享到 Hub，以便每个人都可以使用您的模型

>>> trainer.push_to_hub()

TensorFlow

隐藏 TensorFlow 内容

如果您不熟悉使用 Keras 微调模型，请查看此处的基本教程！

要在 TensorFlow 中微调模型，请首先设置优化器函数、学习率调度和一些训练超参数

>>> from transformers import create_optimizer, AdamWeightDecay

>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

然后您可以使用 TFAutoModelForMaskedLM 加载 DistilRoBERTa

>>> from transformers import TFAutoModelForMaskedLM

>>> model = TFAutoModelForMaskedLM.from_pretrained("distilbert/distilroberta-base")

使用 prepare_tf_dataset() 将您的数据集转换为 tf.data.Dataset 格式

>>> tf_train_set = model.prepare_tf_dataset(
...     lm_dataset["train"],
...     shuffle=True,
...     batch_size=16,
...     collate_fn=data_collator,
... )

>>> tf_test_set = model.prepare_tf_dataset(
...     lm_dataset["test"],
...     shuffle=False,
...     batch_size=16,
...     collate_fn=data_collator,
... )

使用 compile 配置模型以进行训练。请注意，Transformers 模型都具有默认的与任务相关的损失函数，因此您无需指定损失函数，除非您想要指定

>>> import tensorflow as tf

>>> model.compile(optimizer=optimizer)  # No loss argument!

这可以通过在 PushToHubCallback 中指定推送模型和 tokenizer 的位置来完成

>>> from transformers.keras_callbacks import PushToHubCallback

>>> callback = PushToHubCallback(
...     output_dir="my_awesome_eli5_mlm_model",
...     tokenizer=tokenizer,
... )

最后，您已准备好开始训练您的模型！使用您的训练和验证数据集、epoch 数以及您的回调调用 fit 以微调模型

>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=[callback])

训练完成后，您的模型将自动上传到 Hub，以便每个人都可以使用它！

有关如何为 masked language modeling 微调模型的更深入示例，请查看相应的 PyTorch notebook 或 TensorFlow notebook。

推理

太棒了，既然您已经微调了模型，您就可以使用它进行推理了！

想出一些您希望模型填充空白的文本，并使用特殊的 <mask> token 来指示空白

>>> text = "The Milky Way is a <mask> galaxy."

尝试对微调模型进行推理的最简单方法是在 pipeline() 中使用它。使用您的模型实例化一个用于 fill-mask 的 pipeline，并将您的文本传递给它。如果您愿意，可以使用 top_k 参数来指定要返回多少个预测

>>> from transformers import pipeline

>>> mask_filler = pipeline("fill-mask", "username/my_awesome_eli5_mlm_model")
>>> mask_filler(text, top_k=3)
[{'score': 0.5150994658470154,
  'token': 21300,
  'token_str': ' spiral',
  'sequence': 'The Milky Way is a spiral galaxy.'},
 {'score': 0.07087188959121704,
  'token': 2232,
  'token_str': ' massive',
  'sequence': 'The Milky Way is a massive galaxy.'},
 {'score': 0.06434620916843414,
  'token': 650,
  'token_str': ' small',
  'sequence': 'The Milky Way is a small galaxy.'}]

Pytorch

隐藏 Pytorch 内容

标记化文本并将 input_ids 作为 PyTorch 张量返回。您还需要指定 <mask> token 的位置

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("username/my_awesome_eli5_mlm_model")
>>> inputs = tokenizer(text, return_tensors="pt")
>>> mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]

将您的输入传递给模型并返回遮蔽 token 的 logits

>>> from transformers import AutoModelForMaskedLM

>>> model = AutoModelForMaskedLM.from_pretrained("username/my_awesome_eli5_mlm_model")
>>> logits = model(**inputs).logits
>>> mask_token_logits = logits[0, mask_token_index, :]

然后返回概率最高的三个遮蔽 token 并打印出来

>>> top_3_tokens = torch.topk(mask_token_logits, 3, dim=1).indices[0].tolist()

>>> for token in top_3_tokens:
...     print(text.replace(tokenizer.mask_token, tokenizer.decode([token])))
The Milky Way is a spiral galaxy.
The Milky Way is a massive galaxy.
The Milky Way is a small galaxy.

TensorFlow

隐藏 TensorFlow 内容

标记化文本并将 input_ids 作为 TensorFlow 张量返回。您还需要指定 <mask> token 的位置

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("username/my_awesome_eli5_mlm_model")
>>> inputs = tokenizer(text, return_tensors="tf")
>>> mask_token_index = tf.where(inputs["input_ids"] == tokenizer.mask_token_id)[0, 1]

将您的输入传递给模型并返回遮蔽 token 的 logits

>>> from transformers import TFAutoModelForMaskedLM

>>> model = TFAutoModelForMaskedLM.from_pretrained("username/my_awesome_eli5_mlm_model")
>>> logits = model(**inputs).logits
>>> mask_token_logits = logits[0, mask_token_index, :]

然后返回概率最高的三个遮蔽 token 并打印出来

>>> top_3_tokens = tf.math.top_k(mask_token_logits, 3).indices.numpy()

>>> for token in top_3_tokens:
...     print(text.replace(tokenizer.mask_token, tokenizer.decode([token])))
The Milky Way is a spiral galaxy.
The Milky Way is a massive galaxy.
The Milky Way is a small galaxy.

< > 在 GitHub 上更新