Transformers 文档
掩码语言建模
并获得增强的文档体验
开始使用
掩码语言建模
掩码语言建模预测序列中的掩码标记,模型可以双向关注标记。这意味着模型可以完全访问左侧和右侧的标记。掩码语言建模非常适合需要对整个序列进行良好上下文理解的任务。BERT是掩码语言模型的一个例子。
本指南将向您展示如何:
- 在 ELI5 数据集的 r/askscience 子集上微调 DistilRoBERTa 模型。
- 使用您的微调模型进行推理。
要查看与此任务兼容的所有架构和检查点,建议查看任务页面
在开始之前,请确保您已安装所有必要的库
pip install transformers datasets evaluate
我们鼓励您登录到 Hugging Face 帐户,以便您可以将模型上传并与社区共享。当出现提示时,输入您的令牌进行登录。
>>> from huggingface_hub import notebook_login
>>> notebook_login()
加载 ELI5 数据集
首先使用 🤗 Datasets 库加载 ELI5-Category 数据集中的前 5000 个示例。这会让你有机会在花更多时间训练整个数据集之前进行实验并确保一切正常。
>>> from datasets import load_dataset
>>> eli5 = load_dataset("eli5_category", split="train[:5000]")
使用 train_test_split 方法将数据集的 `train` 拆分为训练集和测试集
>>> eli5 = eli5.train_test_split(test_size=0.2)
然后查看一个示例
>>> eli5["train"][0]
{'q_id': '7h191n',
'title': 'What does the tax bill that was passed today mean? How will it affect Americans in each tax bracket?',
'selftext': '',
'category': 'Economics',
'subreddit': 'explainlikeimfive',
'answers': {'a_id': ['dqnds8l', 'dqnd1jl', 'dqng3i1', 'dqnku5x'],
'text': ["The tax bill is 500 pages long and there were a lot of changes still going on right to the end. It's not just an adjustment to the income tax brackets, it's a whole bunch of changes. As such there is no good answer to your question. The big take aways are: - Big reduction in corporate income tax rate will make large companies very happy. - Pass through rate change will make certain styles of business (law firms, hedge funds) extremely happy - Income tax changes are moderate, and are set to expire (though it's the kind of thing that might just always get re-applied without being made permanent) - People in high tax states (California, New York) lose out, and many of them will end up with their taxes raised.",
'None yet. It has to be reconciled with a vastly different house bill and then passed again.',
'Also: does this apply to 2017 taxes? Or does it start with 2018 taxes?',
'This article explains both the House and senate bills, including the proposed changes to your income taxes based on your income level. URL_0'],
'score': [21, 19, 5, 3],
'text_urls': [[],
[],
[],
['https://www.investopedia.com/news/trumps-tax-reform-what-can-be-done/']]},
'title_urls': ['url'],
'selftext_urls': ['url']}
虽然这看起来很多,但你真正感兴趣的只有 `text` 字段。语言建模任务的酷之处在于你不需要标签(也称为无监督任务),因为下一个词*就是*标签。
预处理
对于掩码语言建模,下一步是加载 DistilRoBERTa 分词器来处理 `text` 子字段。
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilroberta-base")
你会注意到,在上面的例子中,`text` 字段实际上嵌套在 `answers` 中。这意味着你需要使用 `flatten` 方法从其嵌套结构中提取 `text` 子字段。
>>> eli5 = eli5.flatten()
>>> eli5["train"][0]
{'q_id': '7h191n',
'title': 'What does the tax bill that was passed today mean? How will it affect Americans in each tax bracket?',
'selftext': '',
'category': 'Economics',
'subreddit': 'explainlikeimfive',
'answers.a_id': ['dqnds8l', 'dqnd1jl', 'dqng3i1', 'dqnku5x'],
'answers.text': ["The tax bill is 500 pages long and there were a lot of changes still going on right to the end. It's not just an adjustment to the income tax brackets, it's a whole bunch of changes. As such there is no good answer to your question. The big take aways are: - Big reduction in corporate income tax rate will make large companies very happy. - Pass through rate change will make certain styles of business (law firms, hedge funds) extremely happy - Income tax changes are moderate, and are set to expire (though it's the kind of thing that might just always get re-applied without being made permanent) - People in high tax states (California, New York) lose out, and many of them will end up with their taxes raised.",
'None yet. It has to be reconciled with a vastly different house bill and then passed again.',
'Also: does this apply to 2017 taxes? Or does it start with 2018 taxes?',
'This article explains both the House and senate bills, including the proposed changes to your income taxes based on your income level. URL_0'],
'answers.score': [21, 19, 5, 3],
'answers.text_urls': [[],
[],
[],
['https://www.investopedia.com/news/trumps-tax-reform-what-can-be-done/']],
'title_urls': ['url'],
'selftext_urls': ['url']}
每个子字段现在都是一个单独的列,由 `answers` 前缀指示,并且 `text` 字段现在是一个列表。为了不单独标记每个句子,请将列表转换为字符串,以便您可以共同标记它们。
这是一个第一个预处理函数,用于连接每个示例的字符串列表并标记结果。
>>> def preprocess_function(examples):
... return tokenizer([" ".join(x) for x in examples["answers.text"]])
要将此预处理函数应用于整个数据集,请使用 🤗 Datasets map 方法。通过设置 `batched=True` 来一次处理数据集的多个元素,并增加 `num_proc` 的进程数,可以加速 `map` 函数。删除所有不需要的列。
>>> tokenized_eli5 = eli5.map(
... preprocess_function,
... batched=True,
... num_proc=4,
... remove_columns=eli5["train"].column_names,
... )
此数据集包含标记序列,但其中一些序列的长度超过了模型的最大输入长度。
你现在可以使用第二个预处理函数来
- 连接所有序列
- 将连接后的序列分割成由 `block_size` 定义的更短的块,这些块应该既短于最大输入长度,又足够短以适应你的 GPU RAM。
>>> block_size = 128
>>> def group_texts(examples):
... # Concatenate all texts.
... concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
... total_length = len(concatenated_examples[list(examples.keys())[0]])
... # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
... # customize this part to your needs.
... if total_length >= block_size:
... total_length = (total_length // block_size) * block_size
... # Split by chunks of block_size.
... result = {
... k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
... for k, t in concatenated_examples.items()
... }
... return result
将 `group_texts` 函数应用于整个数据集。
>>> lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)
现在使用 DataCollatorForLanguageModeling 创建一个示例批次。在整理过程中,将句子动态填充到批次中最长长度比将整个数据集填充到最大长度更高效。
使用序列结束标记作为填充标记,并指定 `mlm_probability` 在每次迭代数据时随机掩码标记。
>>> from transformers import DataCollatorForLanguageModeling
>>> tokenizer.pad_token = tokenizer.eos_token
>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)
使用序列结束标记作为填充标记,并指定 `mlm_probability` 在每次迭代数据时随机掩码标记。
>>> from transformers import DataCollatorForLanguageModeling
>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15, return_tensors="tf")
训练
你现在已经准备好开始训练你的模型了!使用 AutoModelForMaskedLM 加载 DistilRoBERTa。
>>> from transformers import AutoModelForMaskedLM
>>> model = AutoModelForMaskedLM.from_pretrained("distilbert/distilroberta-base")
此时,只剩下三个步骤
- 在 TrainingArguments 中定义你的训练超参数。唯一必需的参数是 `output_dir`,它指定了模型的保存位置。通过设置 `push_to_hub=True`,你将把这个模型推送到 Hub(你需要登录 Hugging Face 才能上传你的模型)。
- 将训练参数以及模型、数据集和数据收集器传递给 Trainer。
- 调用 train() 来微调您的模型。
>>> training_args = TrainingArguments(
... output_dir="my_awesome_eli5_mlm_model",
... eval_strategy="epoch",
... learning_rate=2e-5,
... num_train_epochs=3,
... weight_decay=0.01,
... push_to_hub=True,
... )
>>> trainer = Trainer(
... model=model,
... args=training_args,
... train_dataset=lm_dataset["train"],
... eval_dataset=lm_dataset["test"],
... data_collator=data_collator,
... tokenizer=tokenizer,
... )
>>> trainer.train()
训练完成后,使用 evaluate() 方法评估您的模型并获取其困惑度。
>>> import math
>>> eval_results = trainer.evaluate()
>>> print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
Perplexity: 8.76
然后使用 push_to_hub() 方法将你的模型分享到 Hub,以便所有人都可以使用你的模型。
>>> trainer.push_to_hub()
如果您不熟悉如何使用 Keras 对模型进行微调,请参阅此处的基本教程!
>>> from transformers import create_optimizer, AdamWeightDecay
>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
然后你可以使用 TFAutoModelForMaskedLM 加载 DistilRoBERTa。
>>> from transformers import TFAutoModelForMaskedLM
>>> model = TFAutoModelForMaskedLM.from_pretrained("distilbert/distilroberta-base")
使用 prepare_tf_dataset() 将数据集转换为 tf.data.Dataset
格式
>>> tf_train_set = model.prepare_tf_dataset(
... lm_dataset["train"],
... shuffle=True,
... batch_size=16,
... collate_fn=data_collator,
... )
>>> tf_test_set = model.prepare_tf_dataset(
... lm_dataset["test"],
... shuffle=False,
... batch_size=16,
... collate_fn=data_collator,
... )
使用 `compile` 配置模型进行训练。请注意,Transformers 模型都具有默认的任务相关损失函数,因此除非您需要,否则无需指定一个。
>>> import tensorflow as tf
>>> model.compile(optimizer=optimizer) # No loss argument!
这可以通过在 PushToHubCallback 中指定模型和分词器的推送位置来完成。
>>> from transformers.keras_callbacks import PushToHubCallback
>>> callback = PushToHubCallback(
... output_dir="my_awesome_eli5_mlm_model",
... tokenizer=tokenizer,
... )
最后,您已准备好开始训练模型!调用 `fit` 方法,传入您的训练和验证数据集、训练轮数以及用于微调模型的回调。
>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=[callback])
训练完成后,您的模型会自动上传到 Hub,供所有人使用!
有关如何微调掩码语言建模模型的更深入示例,请参阅相应的PyTorch notebook 或 TensorFlow notebook。
推理
太棒了,现在您已经微调了模型,您可以将其用于推理了!
想出一些你希望模型填充空白的文本,并使用特殊的 `<mask>` 标记来指示空白。
>>> text = "The Milky Way is a <mask> galaxy."
尝试微调模型进行推理的最简单方法是在 pipeline() 中使用它。实例化一个用于填充掩码的 `pipeline`,传入你的模型,然后传入你的文本。如果你愿意,可以使用 `top_k` 参数来指定返回多少个预测。
>>> from transformers import pipeline
>>> mask_filler = pipeline("fill-mask", "username/my_awesome_eli5_mlm_model")
>>> mask_filler(text, top_k=3)
[{'score': 0.5150994658470154,
'token': 21300,
'token_str': ' spiral',
'sequence': 'The Milky Way is a spiral galaxy.'},
{'score': 0.07087188959121704,
'token': 2232,
'token_str': ' massive',
'sequence': 'The Milky Way is a massive galaxy.'},
{'score': 0.06434620916843414,
'token': 650,
'token_str': ' small',
'sequence': 'The Milky Way is a small galaxy.'}]
将文本分词并以 PyTorch 张量形式返回 `input_ids`。您还需要指定 `<mask>` 标记的位置。
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("username/my_awesome_eli5_mlm_model")
>>> inputs = tokenizer(text, return_tensors="pt")
>>> mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
将你的输入传递给模型并返回掩码标记的 `logits`。
>>> from transformers import AutoModelForMaskedLM
>>> model = AutoModelForMaskedLM.from_pretrained("username/my_awesome_eli5_mlm_model")
>>> logits = model(**inputs).logits
>>> mask_token_logits = logits[0, mask_token_index, :]
然后返回概率最高的三种掩码标记并打印出来。
>>> top_3_tokens = torch.topk(mask_token_logits, 3, dim=1).indices[0].tolist()
>>> for token in top_3_tokens:
... print(text.replace(tokenizer.mask_token, tokenizer.decode([token])))
The Milky Way is a spiral galaxy.
The Milky Way is a massive galaxy.
The Milky Way is a small galaxy.
将文本分词并以 TensorFlow 张量形式返回 `input_ids`。您还需要指定 `<mask>` 标记的位置。
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("username/my_awesome_eli5_mlm_model")
>>> inputs = tokenizer(text, return_tensors="tf")
>>> mask_token_index = tf.where(inputs["input_ids"] == tokenizer.mask_token_id)[0, 1]
将你的输入传递给模型并返回掩码标记的 `logits`。
>>> from transformers import TFAutoModelForMaskedLM
>>> model = TFAutoModelForMaskedLM.from_pretrained("username/my_awesome_eli5_mlm_model")
>>> logits = model(**inputs).logits
>>> mask_token_logits = logits[0, mask_token_index, :]
然后返回概率最高的三种掩码标记并打印出来。
>>> top_3_tokens = tf.math.top_k(mask_token_logits, 3).indices.numpy()
>>> for token in top_3_tokens:
... print(text.replace(tokenizer.mask_token, tokenizer.decode([token])))
The Milky Way is a spiral galaxy.
The Milky Way is a massive galaxy.
The Milky Way is a small galaxy.