合成数据集生成技术：Self-Instruct

社区文章发布于 2024年5月15日

这篇文章是关于合成数据生成技术系列的一部分。您可能还想查看 Awesome Synthetic (text) datasets，我将在此收集这些文章。

为了训练大型语言模型（LLM）更好地遵循指令或作为聊天模型运行，您通常需要一个包含指令和响应组合的数据集。由于手动创建这些数据可能非常耗时，因此越来越多的人使用 LLM 来生成这些数据。

最简单地，您可以使用大型语言模型（LLM）生成对手写提示/指令的响应，从而创建合成指令遵循数据集。然而，对于许多应用程序而言，您可能希望在最终数据集中包含大量的提示。在确保多样性的同时手动创建所有这些数据将是具有挑战性的。有多种方法可以尝试消除这个瓶颈。

在本博客文章中，我将讨论论文 SELF-INSTRUCT: Aligning Language Models with Self-Generated Instructions 中概述的技术，正如论文标题所示，该技术旨在克服手动生成指令的需求。

Self-Instruct 是一个框架，旨在帮助语言模型提高遵循自然语言指令的能力。它通过利用模型自身的生成来创建大量指令数据。通过 Self-Instruct，无需依赖大量的手动标注即可提升语言模型的指令遵循能力。来源

该论文概述了一个流程，从一个初始的指令种子数据集开始，逐步扩展到一个更大的合成生成指令数据集。

作者在论文中包含了生成指令的步骤以及用于清理数据的过滤步骤。由于我们的目标是关注特定论文的核心技术，因此我们只关注指令生成部分。此步骤还可以与自论文发布以来引入的其他数据过滤方法（或您自己的自定义过滤器）结合使用。

指令生成

回到我们最初的挑战：如何在不手动编写所有指令的情况下生成它们？如上图所示，这些步骤包括从原始种子中采样，过滤种子任务以查看它们是否是分类任务，然后生成新的指令。新指令生成后，它们会被过滤并添加到任务池中。通过这种方式，您可以不断地从初始种子任务中创建新指令并扩大种子任务池。使用数据过滤步骤旨在确保提示仍具有多样性，并避免向数据集中添加高度重复的指令。

这在实践中是什么样子的？

让我们看一个来自175个初始种子任务数据集的例子。

{"id": "seed_task_0",
"name": "breakfast_suggestion",
"instruction": "Is there anything I can eat for a breakfast that doesn't include eggs, yet includes protein, and has roughly 700-1000 calories?",
"instances": [{"input": "", "output": "Yes, you can have 1 oatmeal banana protein shake and 4 strips of bacon. The oatmeal banana protein shake may contain 1/2 cup oatmeal, 60 grams whey protein powder, 1/2 medium banana, 1tbsp flaxseed oil and 1/2 cup watter, totalling about 550 calories. The 4 strips of bacon contains about 200 calories."}],
"is_classification": false}

如您所见，此行包含相当标准的指令，例如“有什么……”，一些响应（即实例字段），以及一个指示是否为分类任务的标签。论文概述了两种从这些数据生成新指令的主要方法。如果指令是分类任务，则使用一种提示方法；如果它是标准生成任务，则使用另一种提示。让我们从额外的非分类任务指令提示如何呈现开始。

> Come up with a series of tasks:
> 
> Task 1: {instruction for existing task 1}
> Task 2: {instruction for existing task 2}
> Task 3: {instruction for existing task 3}
> Task 4: {instruction for existing task 4}
> Task 5: {instruction for existing task 5}
> Task 6: {instruction for existing task 6}
> Task 7: {instruction for existing task 7}
> Task 8: {instruction for existing task 8}
> Task 9:

如您所见，该提示为 LLM 提供了一些任务示例，并鼓励模型生成新的指令。需要注意的一个关键细节是：在原始论文中，作者使用的是 GPT3，而不是经过指令微调/聊天模型。由于这不是一个经过指令微调的模型，因此以结构化格式提供少量示例的提示通常能更好地引导模型生成有用的内容。

我们可以看看这个过程在实践中是什么样子的（使用 huggingface_hub 和 BigScience Bloom 模型代替 GPT-3）

from huggingface_hub import InferenceClient

client = InferenceClient('bigscience/bloom')

def encode_prompt(prompt_instructions, classification=False):
    """Encode multiple prompt instructions into a single string."""
    if classification:
        prompt = "Come up with a series of classification tasks. Try to specify the possible output labels when possible.\n"
    else:
        prompt = "Come up with a series of tasks:\n"
    for idx, instruction in enumerate(prompt_instructions):
        instruction = re.sub(r"\s+", " ", instruction).strip().rstrip(":")
        prompt += f"{idx+1}. {instruction}\n"
    prompt += f"{len(prompt_instructions) + 1}."
    return prompt

prompt = encode_prompt(dataset['instruction']) #

对于非分类任务，这会产生一个看起来像这样的提示

Come up with a series of tasks:
1. Is there anything I can eat for a breakfast that doesn't include eggs, yet includes protein, and has roughly 700-1000 calories?
2. What is the relation between the given pairs?
3. Generate a one-sentence description for each of the following people.
4. Describe a situation in which the given stereotype can harm you.
5. Generate an appropriate subjective title for the following email
6. How do you answer this question in a job interview?
7. Brainstorm a list of possible New Year's resolutions.
8. Explain the following idiom to me, and try to give me some examples.
9.

然后我们可以将此提示传递给 LLM。

client.text_generation(prompt, return_full_text=False, temperature=0.7, max_new_tokens=512)
>>>  Think of a time when you were incredibly confident, and explain why.\n10. What is the difference between a real and normal friend?

我们可以看到 LLM 会响应新的指令（我们还会从 LLM 获得一些额外的文本）。如果想在实践中使用此功能，我们可以做更多工作来优化生成参数（温度等）。

文本分类任务的过程和提示略有不同。为了避免 LLM 仅仅返回标签标记，他们将标签放在前面，然后显示生成该标签的文本，例如这样：

Instruction: Find out if the given text is positive about the company discussed.
Class Label: Positive
Input: Hugging Face is a wonderful platform for machine learning developers.

注意：您可以在此处找到包含这些示例的笔记本。

使用 Self-Instruct

这篇论文在学术研究（超过 1,000 次引用）和社区对该方法的实际采用方面都产生了非常大的影响（您可以在此处找到一些引用该方法的日期集）。

Self Instruct 方法有几种实现

官方 GitHub 仓库：https://github.com/yizhongw/self-instruct
Distilabel 实现
airoboros：Self Instruct 的修改版本。

在实践中，这种方法的大多数使用已经不再严格遵循论文中概述的提示/方法。由于该论文发表以来，开源和闭源的指令遵循模型的质量已显著提高，因此使用它更直接地提示模型生成新指令通常更有意义。

尽管论文中概述的具体方法经常被调整，但该论文对于更好地理解如何进行合成数据生成仍然很有帮助。

社区

通过拖放到文本输入框、粘贴或点击此处上传图片、音频和视频。

点击或粘贴此处以上传图片

· 注册或登录发表评论