Transformers 文档

掩码语言建模

Hugging Face's logo
加入Hugging Face社区

并获得增强的文档体验

开始使用

掩码语言建模

掩码语言建模预测序列中被掩码的词元,并且模型可以双向关注词元。这意味着模型可以完全访问左侧和右侧的词元。掩码语言建模非常适合需要对整个序列有良好上下文理解的任务。BERT就是一个掩码语言模型的例子。

本指南将向您展示如何

  1. DistilRoBERTa上微调r/askscience子集的ELI5数据集。
  2. 使用您微调的模型进行推理。

要查看与此任务兼容的所有架构和检查点,我们建议您查看任务页面

在开始之前,请确保您已安装所有必要的库

pip install transformers datasets evaluate

我们鼓励您登录您的Hugging Face账户,以便您可以上传和与社区共享您的模型。出现提示时,输入您的token登录

>>> from huggingface_hub import notebook_login

>>> notebook_login()

加载ELI5数据集

首先使用🤗 Datasets库加载来自ELI5-Category数据集的前5000个示例。这将使您有机会进行实验并确保一切正常,然后再花费更多时间在完整数据集上进行训练。

>>> from datasets import load_dataset

>>> eli5 = load_dataset("eli5_category", split="train[:5000]")

使用train_test_split方法将数据集的train拆分拆分为训练集和测试集

>>> eli5 = eli5.train_test_split(test_size=0.2)

然后查看一个示例

>>> eli5["train"][0]
{'q_id': '7h191n',
 'title': 'What does the tax bill that was passed today mean? How will it affect Americans in each tax bracket?',
 'selftext': '',
 'category': 'Economics',
 'subreddit': 'explainlikeimfive',
 'answers': {'a_id': ['dqnds8l', 'dqnd1jl', 'dqng3i1', 'dqnku5x'],
  'text': ["The tax bill is 500 pages long and there were a lot of changes still going on right to the end. It's not just an adjustment to the income tax brackets, it's a whole bunch of changes. As such there is no good answer to your question. The big take aways are: - Big reduction in corporate income tax rate will make large companies very happy. - Pass through rate change will make certain styles of business (law firms, hedge funds) extremely happy - Income tax changes are moderate, and are set to expire (though it's the kind of thing that might just always get re-applied without being made permanent) - People in high tax states (California, New York) lose out, and many of them will end up with their taxes raised.",
   'None yet. It has to be reconciled with a vastly different house bill and then passed again.',
   'Also: does this apply to 2017 taxes? Or does it start with 2018 taxes?',
   'This article explains both the House and senate bills, including the proposed changes to your income taxes based on your income level. URL_0'],
  'score': [21, 19, 5, 3],
  'text_urls': [[],
   [],
   [],
   ['https://www.investopedia.com/news/trumps-tax-reform-what-can-be-done/']]},
 'title_urls': ['url'],
 'selftext_urls': ['url']}

虽然这看起来可能很多,但您实际上只对text字段感兴趣。语言建模任务的妙处在于您不需要标签(也称为无监督任务),因为下一个词就是标签。

预处理

对于掩码语言建模,下一步是加载一个DistilRoBERTa分词器来处理text子字段

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilroberta-base")

您会从上面的示例中注意到,text字段实际上嵌套在answers中。这意味着您需要使用flatten方法从其嵌套结构中提取text子字段

>>> eli5 = eli5.flatten()
>>> eli5["train"][0]
{'q_id': '7h191n',
 'title': 'What does the tax bill that was passed today mean? How will it affect Americans in each tax bracket?',
 'selftext': '',
 'category': 'Economics',
 'subreddit': 'explainlikeimfive',
 'answers.a_id': ['dqnds8l', 'dqnd1jl', 'dqng3i1', 'dqnku5x'],
 'answers.text': ["The tax bill is 500 pages long and there were a lot of changes still going on right to the end. It's not just an adjustment to the income tax brackets, it's a whole bunch of changes. As such there is no good answer to your question. The big take aways are: - Big reduction in corporate income tax rate will make large companies very happy. - Pass through rate change will make certain styles of business (law firms, hedge funds) extremely happy - Income tax changes are moderate, and are set to expire (though it's the kind of thing that might just always get re-applied without being made permanent) - People in high tax states (California, New York) lose out, and many of them will end up with their taxes raised.",
  'None yet. It has to be reconciled with a vastly different house bill and then passed again.',
  'Also: does this apply to 2017 taxes? Or does it start with 2018 taxes?',
  'This article explains both the House and senate bills, including the proposed changes to your income taxes based on your income level. URL_0'],
 'answers.score': [21, 19, 5, 3],
 'answers.text_urls': [[],
  [],
  [],
  ['https://www.investopedia.com/news/trumps-tax-reform-what-can-be-done/']],
 'title_urls': ['url'],
 'selftext_urls': ['url']}

现在每个子字段都是一个单独的列,如answers前缀所示,并且text字段现在是一个列表。不要单独对每个句子进行分词,而是将列表转换为字符串,以便可以联合对它们进行分词。

这是一个用于连接每个示例的字符串列表并对结果进行分词的第一个预处理函数

>>> def preprocess_function(examples):
...     return tokenizer([" ".join(x) for x in examples["answers.text"]])

要将此预处理函数应用于整个数据集,请使用🤗 Datasets的map方法。您可以通过设置batched=True一次处理数据集的多个元素来加快map函数的速度,并使用num_proc增加进程数。删除任何不需要的列

>>> tokenized_eli5 = eli5.map(
...     preprocess_function,
...     batched=True,
...     num_proc=4,
...     remove_columns=eli5["train"].column_names,
... )

此数据集包含词元序列,但其中一些序列的长度超过了模型的最大输入长度。

您现在可以使用第二个预处理函数来

  • 连接所有序列
  • 将连接的序列拆分为由block_size定义的较短的块,这些块应该都短于最大输入长度,并且对于您的GPU RAM来说足够短。
>>> block_size = 128


>>> def group_texts(examples):
...     # Concatenate all texts.
...     concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
...     total_length = len(concatenated_examples[list(examples.keys())[0]])
...     # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
...     # customize this part to your needs.
...     if total_length >= block_size:
...         total_length = (total_length // block_size) * block_size
...     # Split by chunks of block_size.
...     result = {
...         k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
...         for k, t in concatenated_examples.items()
...     }
...     return result

在整个数据集上应用group_texts函数

>>> lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)

现在使用DataCollatorForLanguageModeling创建一批示例。在整理过程中,动态地将句子填充到批次中的最长长度,而不是将整个数据集填充到最大长度,这样效率更高。

Pytorch
隐藏Pytorch内容

使用序列结束词元作为填充词元,并指定mlm_probability以在每次迭代数据时随机掩码词元

>>> from transformers import DataCollatorForLanguageModeling

>>> tokenizer.pad_token = tokenizer.eos_token
>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)
TensorFlow
隐藏TensorFlow内容

使用序列结束词元作为填充词元,并指定mlm_probability以在每次迭代数据时随机掩码词元

>>> from transformers import DataCollatorForLanguageModeling

>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15, return_tensors="tf")

训练

Pytorch
隐藏Pytorch内容

如果您不熟悉使用Trainer微调模型,请查看此处的基本教程!

您现在可以开始训练您的模型了!使用AutoModelForMaskedLM加载DistilRoBERTa

>>> from transformers import AutoModelForMaskedLM

>>> model = AutoModelForMaskedLM.from_pretrained("distilbert/distilroberta-base")

此时,只剩下三个步骤

  1. TrainingArguments中定义您的训练超参数。唯一必需的参数是output_dir,它指定保存模型的位置。您将通过设置push_to_hub=True将此模型推送到Hub(您需要登录Hugging Face才能上传您的模型)。
  2. 将训练参数传递给Trainer以及模型、数据集和数据整理器。
  3. 调用train()来微调您的模型。
>>> training_args = TrainingArguments(
...     output_dir="my_awesome_eli5_mlm_model",
...     eval_strategy="epoch",
...     learning_rate=2e-5,
...     num_train_epochs=3,
...     weight_decay=0.01,
...     push_to_hub=True,
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=lm_dataset["train"],
...     eval_dataset=lm_dataset["test"],
...     data_collator=data_collator,
...     tokenizer=tokenizer,
... )

>>> trainer.train()

训练完成后,使用evaluate()方法评估您的模型并获取其困惑度

>>> import math

>>> eval_results = trainer.evaluate()
>>> print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
Perplexity: 8.76

然后使用push_to_hub()方法将您的模型共享到Hub,以便每个人都可以使用您的模型

>>> trainer.push_to_hub()
TensorFlow
隐藏TensorFlow内容

如果您不熟悉使用Keras微调模型,请查看此处的基本教程!

要使用TensorFlow微调模型,请首先设置优化器函数、学习率计划和一些训练超参数
>>> from transformers import create_optimizer, AdamWeightDecay

>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

然后,您可以使用TFAutoModelForMaskedLM加载DistilRoBERTa

>>> from transformers import TFAutoModelForMaskedLM

>>> model = TFAutoModelForMaskedLM.from_pretrained("distilbert/distilroberta-base")

使用prepare_tf_dataset()将您的数据集转换为tf.data.Dataset格式

>>> tf_train_set = model.prepare_tf_dataset(
...     lm_dataset["train"],
...     shuffle=True,
...     batch_size=16,
...     collate_fn=data_collator,
... )

>>> tf_test_set = model.prepare_tf_dataset(
...     lm_dataset["test"],
...     shuffle=False,
...     batch_size=16,
...     collate_fn=data_collator,
... )

使用compile配置模型以进行训练。请注意,Transformers模型都具有默认的任务相关损失函数,因此您无需指定一个,除非您希望这样做

>>> import tensorflow as tf

>>> model.compile(optimizer=optimizer)  # No loss argument!

这可以通过在PushToHubCallback中指定要推送模型和分词器的位置来完成

>>> from transformers.keras_callbacks import PushToHubCallback

>>> callback = PushToHubCallback(
...     output_dir="my_awesome_eli5_mlm_model",
...     tokenizer=tokenizer,
... )

最后,您就可以开始训练您的模型了!使用您的训练集和验证集、时期数以及您的回调调用fit来微调模型

>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=[callback])

训练完成后,您的模型会自动上传到Hub,以便每个人都可以使用它!

有关如何为掩码语言建模微调模型的更深入示例,请查看相应的PyTorch笔记本TensorFlow笔记本

推理

太好了,现在您已经微调了一个模型,您可以使用它进行推理!

想出一些您希望模型填空的文本,并使用特殊的<mask>词元来表示空白

>>> text = "The Milky Way is a <mask> galaxy."

尝试使用您微调的模型进行推理的最简单方法是在pipeline()中使用它。使用您的模型实例化一个用于填充掩码的pipeline,并将您的文本传递给它。如果您愿意,可以使用top_k参数指定要返回多少个预测

>>> from transformers import pipeline

>>> mask_filler = pipeline("fill-mask", "username/my_awesome_eli5_mlm_model")
>>> mask_filler(text, top_k=3)
[{'score': 0.5150994658470154,
  'token': 21300,
  'token_str': ' spiral',
  'sequence': 'The Milky Way is a spiral galaxy.'},
 {'score': 0.07087188959121704,
  'token': 2232,
  'token_str': ' massive',
  'sequence': 'The Milky Way is a massive galaxy.'},
 {'score': 0.06434620916843414,
  'token': 650,
  'token_str': ' small',
  'sequence': 'The Milky Way is a small galaxy.'}]
Pytorch
隐藏Pytorch内容

对文本进行分词并将input_ids作为PyTorch张量返回。您还需要指定<mask>词元的位置

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("username/my_awesome_eli5_mlm_model")
>>> inputs = tokenizer(text, return_tensors="pt")
>>> mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]

将您的输入传递给模型并返回被掩码词元的logits

>>> from transformers import AutoModelForMaskedLM

>>> model = AutoModelForMaskedLM.from_pretrained("username/my_awesome_eli5_mlm_model")
>>> logits = model(**inputs).logits
>>> mask_token_logits = logits[0, mask_token_index, :]

然后返回概率最高的三个被掩码的词元并打印出来

>>> top_3_tokens = torch.topk(mask_token_logits, 3, dim=1).indices[0].tolist()

>>> for token in top_3_tokens:
...     print(text.replace(tokenizer.mask_token, tokenizer.decode([token])))
The Milky Way is a spiral galaxy.
The Milky Way is a massive galaxy.
The Milky Way is a small galaxy.
TensorFlow
隐藏TensorFlow内容

对文本进行分词并将input_ids作为TensorFlow张量返回。您还需要指定<mask>词元的位置

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("username/my_awesome_eli5_mlm_model")
>>> inputs = tokenizer(text, return_tensors="tf")
>>> mask_token_index = tf.where(inputs["input_ids"] == tokenizer.mask_token_id)[0, 1]

将您的输入传递给模型并返回被掩码词元的logits

>>> from transformers import TFAutoModelForMaskedLM

>>> model = TFAutoModelForMaskedLM.from_pretrained("username/my_awesome_eli5_mlm_model")
>>> logits = model(**inputs).logits
>>> mask_token_logits = logits[0, mask_token_index, :]

然后返回概率最高的三个被掩码的词元并打印出来

>>> top_3_tokens = tf.math.top_k(mask_token_logits, 3).indices.numpy()

>>> for token in top_3_tokens:
...     print(text.replace(tokenizer.mask_token, tokenizer.decode([token])))
The Milky Way is a spiral galaxy.
The Milky Way is a massive galaxy.
The Milky Way is a small galaxy.
< > 在GitHub上更新