Transformers 文档

因果语言模型

Hugging Face's logo
加入 Hugging Face 社区

并获得增强的文档体验

开始使用

因果语言模型

语言模型有两种类型:因果语言模型和掩码语言模型。本指南主要介绍因果语言模型。因果语言模型常用于文本生成。你可以用这些模型进行创意应用,例如“选择你自己的文本冒险”或像 Copilot 或 CodeParrot 这样的智能编码助手。

因果语言模型预测序列中的下一个标记,模型只能关注左侧的标记。这意味着模型无法看到未来的标记。GPT-2 是因果语言模型的一个示例。

本指南将向您展示如何:

  1. 在 ELI5 数据集的 r/askscience 子集上微调 DistilGPT2
  2. 使用您的微调模型进行推理。

要查看所有与此任务兼容的架构和检查点,我们建议查看任务页面

在开始之前,请确保您已安装所有必要的库

pip install transformers datasets evaluate

我们鼓励您登录到 Hugging Face 帐户,以便您可以将模型上传并与社区共享。当出现提示时,输入您的令牌进行登录。

>>> from huggingface_hub import notebook_login

>>> notebook_login()

加载 ELI5 数据集

首先使用 🤗 Datasets 库加载 ELI5-Category 数据集的前 5000 个示例。这将让你有机会在花费更多时间训练完整数据集之前进行实验并确保一切正常。

>>> from datasets import load_dataset

>>> eli5 = load_dataset("eli5_category", split="train[:5000]")

使用 train_test_split 方法将数据集的 `train` 拆分为训练集和测试集

>>> eli5 = eli5.train_test_split(test_size=0.2)

然后查看一个示例

>>> eli5["train"][0]
{'q_id': '7h191n',
 'title': 'What does the tax bill that was passed today mean? How will it affect Americans in each tax bracket?',
 'selftext': '',
 'category': 'Economics',
 'subreddit': 'explainlikeimfive',
 'answers': {'a_id': ['dqnds8l', 'dqnd1jl', 'dqng3i1', 'dqnku5x'],
  'text': ["The tax bill is 500 pages long and there were a lot of changes still going on right to the end. It's not just an adjustment to the income tax brackets, it's a whole bunch of changes. As such there is no good answer to your question. The big take aways are: - Big reduction in corporate income tax rate will make large companies very happy. - Pass through rate change will make certain styles of business (law firms, hedge funds) extremely happy - Income tax changes are moderate, and are set to expire (though it's the kind of thing that might just always get re-applied without being made permanent) - People in high tax states (California, New York) lose out, and many of them will end up with their taxes raised.",
   'None yet. It has to be reconciled with a vastly different house bill and then passed again.',
   'Also: does this apply to 2017 taxes? Or does it start with 2018 taxes?',
   'This article explains both the House and senate bills, including the proposed changes to your income taxes based on your income level. URL_0'],
  'score': [21, 19, 5, 3],
  'text_urls': [[],
   [],
   [],
   ['https://www.investopedia.com/news/trumps-tax-reform-what-can-be-done/']]},
 'title_urls': ['url'],
 'selftext_urls': ['url']}

虽然这看起来很多,但你真正感兴趣的只是 `text` 字段。语言建模任务的妙处在于你不需要标签(也称为无监督任务),因为下一个词*就是*标签。

预处理

下一步是加载 DistilGPT2 分词器来处理 `text` 子字段

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2")

你会注意到上面的示例中,`text` 字段实际上嵌套在 `answers` 中。这意味着你需要使用 `flatten` 方法从其嵌套结构中提取 `text` 子字段

>>> eli5 = eli5.flatten()
>>> eli5["train"][0]
{'q_id': '7h191n',
 'title': 'What does the tax bill that was passed today mean? How will it affect Americans in each tax bracket?',
 'selftext': '',
 'category': 'Economics',
 'subreddit': 'explainlikeimfive',
 'answers.a_id': ['dqnds8l', 'dqnd1jl', 'dqng3i1', 'dqnku5x'],
 'answers.text': ["The tax bill is 500 pages long and there were a lot of changes still going on right to the end. It's not just an adjustment to the income tax brackets, it's a whole bunch of changes. As such there is no good answer to your question. The big take aways are: - Big reduction in corporate income tax rate will make large companies very happy. - Pass through rate change will make certain styles of business (law firms, hedge funds) extremely happy - Income tax changes are moderate, and are set to expire (though it's the kind of thing that might just always get re-applied without being made permanent) - People in high tax states (California, New York) lose out, and many of them will end up with their taxes raised.",
  'None yet. It has to be reconciled with a vastly different house bill and then passed again.',
  'Also: does this apply to 2017 taxes? Or does it start with 2018 taxes?',
  'This article explains both the House and senate bills, including the proposed changes to your income taxes based on your income level. URL_0'],
 'answers.score': [21, 19, 5, 3],
 'answers.text_urls': [[],
  [],
  [],
  ['https://www.investopedia.com/news/trumps-tax-reform-what-can-be-done/']],
 'title_urls': ['url'],
 'selftext_urls': ['url']}

现在每个子字段都成为一个独立的列,由 `answers` 前缀表示,并且 `text` 字段现在是一个列表。为了避免单独对每个句子进行分词,我们将列表转换为字符串,以便可以联合分词。

这是一个将每个示例的字符串列表连接起来并对结果进行分词的第一个预处理函数。

>>> def preprocess_function(examples):
...     return tokenizer([" ".join(x) for x in examples["answers.text"]])

要在整个数据集上应用此预处理函数,请使用 🤗 Datasets 的 map 方法。通过设置 `batched=True` 来一次处理数据集的多个元素,并增加 `num_proc` 的进程数,可以加快 `map` 函数的速度。删除所有不需要的列

>>> tokenized_eli5 = eli5.map(
...     preprocess_function,
...     batched=True,
...     num_proc=4,
...     remove_columns=eli5["train"].column_names,
... )

此数据集包含标记序列,但其中一些序列的长度超过了模型的最大输入长度。

现在可以使用第二个预处理函数来

  • 连接所有序列
  • 将连接后的序列分割成由 `block_size` 定义的更短的块,这些块应该既短于最大输入长度,又足够短以适应你的 GPU RAM。
>>> block_size = 128


>>> def group_texts(examples):
...     # Concatenate all texts.
...     concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
...     total_length = len(concatenated_examples[list(examples.keys())[0]])
...     # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
...     # customize this part to your needs.
...     if total_length >= block_size:
...         total_length = (total_length // block_size) * block_size
...     # Split by chunks of block_size.
...     result = {
...         k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
...         for k, t in concatenated_examples.items()
...     }
...     result["labels"] = result["input_ids"].copy()
...     return result

在整个数据集上应用 `group_texts` 函数。

>>> lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)

现在,使用 DataCollatorForLanguageModeling 创建一批示例。在整理过程中,将句子动态填充到批次中最长的长度比将整个数据集填充到最大长度更有效。

Pytorch
隐藏 Pytorch 内容

使用序列结束标记作为填充标记,并将 `mlm=False`。这将把输入作为向右偏移一个元素的标签。

>>> from transformers import DataCollatorForLanguageModeling

>>> tokenizer.pad_token = tokenizer.eos_token
>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
TensorFlow
隐藏 TensorFlow 内容

使用序列结束标记作为填充标记,并将 `mlm=False`。这将把输入作为向右偏移一个元素的标签。

>>> from transformers import DataCollatorForLanguageModeling

>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="tf")

训练

Pytorch
隐藏 Pytorch 内容

如果你不熟悉使用 Trainer 微调模型,请查看基本教程

现在你已准备好开始训练模型!使用 AutoModelForCausalLM 加载 DistilGPT2

>>> from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

>>> model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")

此时,只剩下三个步骤

  1. TrainingArguments 中定义你的训练超参数。唯一必需的参数是 `output_dir`,它指定了模型保存的位置。通过设置 `push_to_hub=True`,你将把此模型推送到 Hub(你需要登录 Hugging Face 才能上传模型)。
  2. 将训练参数以及模型、数据集和数据整理器传递给 Trainer
  3. 调用 train() 来微调您的模型。
>>> training_args = TrainingArguments(
...     output_dir="my_awesome_eli5_clm-model",
...     eval_strategy="epoch",
...     learning_rate=2e-5,
...     weight_decay=0.01,
...     push_to_hub=True,
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=lm_dataset["train"],
...     eval_dataset=lm_dataset["test"],
...     data_collator=data_collator,
...     tokenizer=tokenizer,
... )

>>> trainer.train()

训练完成后,使用 evaluate() 方法评估你的模型并获取其困惑度

>>> import math

>>> eval_results = trainer.evaluate()
>>> print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
Perplexity: 49.61

然后使用 push_to_hub() 方法将你的模型共享到 Hub,以便所有人都可以使用你的模型。

>>> trainer.push_to_hub()
TensorFlow
隐藏 TensorFlow 内容

如果你不熟悉使用 Keras 微调模型,请查看基本教程

要在 TensorFlow 中对模型进行微调,首先要设置优化器函数、学习率调度和一些训练超参数
>>> from transformers import create_optimizer, AdamWeightDecay

>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

然后你可以使用 TFAutoModelForCausalLM 加载 DistilGPT2

>>> from transformers import TFAutoModelForCausalLM

>>> model = TFAutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")

使用 prepare_tf_dataset() 将数据集转换为 tf.data.Dataset 格式

>>> tf_train_set = model.prepare_tf_dataset(
...     lm_dataset["train"],
...     shuffle=True,
...     batch_size=16,
...     collate_fn=data_collator,
... )

>>> tf_test_set = model.prepare_tf_dataset(
...     lm_dataset["test"],
...     shuffle=False,
...     batch_size=16,
...     collate_fn=data_collator,
... )

使用 `compile` 配置模型进行训练。请注意,Transformers 模型都具有默认的任务相关损失函数,因此除非你愿意,否则无需指定一个。

>>> import tensorflow as tf

>>> model.compile(optimizer=optimizer)  # No loss argument!

这可以通过在 PushToHubCallback 中指定要推送模型和分词器的位置来完成。

>>> from transformers.keras_callbacks import PushToHubCallback

>>> callback = PushToHubCallback(
...     output_dir="my_awesome_eli5_clm-model",
...     tokenizer=tokenizer,
... )

最后,你已准备好开始训练模型!调用 `fit`,传入你的训练和验证数据集、训练轮数以及你的回调函数来微调模型。

>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=[callback])

训练完成后,您的模型会自动上传到 Hub,供所有人使用!

有关如何微调因果语言模型模型的更深入示例,请查看相应的 PyTorch notebookTensorFlow notebook

推理

太棒了,现在您已经微调了模型,您可以将其用于推理了!

想出一个你想要生成文本的提示。

>>> prompt = "Somatic hypermutation allows the immune system to"

试用你微调过的模型进行推理的最简单方法是将其用于 pipeline()。使用你的模型实例化一个文本生成 `pipeline`,并将你的文本传递给它。

>>> from transformers import pipeline

>>> generator = pipeline("text-generation", model="username/my_awesome_eli5_clm-model")
>>> generator(prompt)
[{'generated_text': "Somatic hypermutation allows the immune system to be able to effectively reverse the damage caused by an infection.\n\n\nThe damage caused by an infection is caused by the immune system's ability to perform its own self-correcting tasks."}]
Pytorch
隐藏 Pytorch 内容

对文本进行标记并返回 input_ids 作为 PyTorch 张量

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("username/my_awesome_eli5_clm-model")
>>> inputs = tokenizer(prompt, return_tensors="pt").input_ids

使用 generate() 方法生成文本。有关不同文本生成策略和控制生成参数的更多详细信息,请查看文本生成策略页面。

>>> from transformers import AutoModelForCausalLM

>>> model = AutoModelForCausalLM.from_pretrained("username/my_awesome_eli5_clm-model")
>>> outputs = model.generate(inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)

将生成的 token ID 解码回文本

>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
["Somatic hypermutation allows the immune system to react to drugs with the ability to adapt to a different environmental situation. In other words, a system of 'hypermutation' can help the immune system to adapt to a different environmental situation or in some cases even a single life. In contrast, researchers at the University of Massachusetts-Boston have found that 'hypermutation' is much stronger in mice than in humans but can be found in humans, and that it's not completely unknown to the immune system. A study on how the immune system"]
TensorFlow
隐藏 TensorFlow 内容

对文本进行标记,并将 input_ids 作为 TensorFlow 张量返回

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("username/my_awesome_eli5_clm-model")
>>> inputs = tokenizer(prompt, return_tensors="tf").input_ids

使用 `~transformers.generation_tf_utils.TFGenerationMixin.generate` 方法创建摘要。有关不同文本生成策略和控制生成参数的更多详细信息,请查看文本生成策略页面。

>>> from transformers import TFAutoModelForCausalLM

>>> model = TFAutoModelForCausalLM.from_pretrained("username/my_awesome_eli5_clm-model")
>>> outputs = model.generate(input_ids=inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)

将生成的 token ID 解码回文本

>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Somatic hypermutation allows the immune system to detect the presence of other viruses as they become more prevalent. Therefore, researchers have identified a high proportion of human viruses. The proportion of virus-associated viruses in our study increases with age. Therefore, we propose a simple algorithm to detect the presence of these new viruses in our samples as a sign of improved immunity. A first study based on this algorithm, which will be published in Science on Friday, aims to show that this finding could translate into the development of a better vaccine that is more effective for']
< > 在 GitHub 上更新