开源 AI 食谱文档

使用 distilabel 生成偏好数据集

Hugging Face's logo
加入 Hugging Face 社区

并获得增强文档体验

开始使用

Open In Colab

使用 distilabel 生成偏好数据集

作者:David BerensteinSara Han Díaz

在本教程中,我们将使用 distilabel 为 DPO、ORPO 或 RLHF 生成合成偏好数据集。distilabel 是一款针对需要基于经过验证的研究论文构建快速、可靠和可扩展流水线的工程师的合成数据和 AI 反馈框架。请查看此处的文档。

为了生成响应并对其进行评估,我们将使用与 distilabel 集成的无服务器 HF 推理 API。这是免费的但有速率限制,允许您通过简单的 HTTP 请求测试和评估超过 150,000 个公共模型或您自己的私有模型,并通过 Hugging Face 共享基础设施进行快速推理。如果您需要更多计算能力,您可以使用Hugging Face 推理端点部署您自己的推理端点。

最后,为了进一步整理数据,我们将使用Argilla,它允许我们提供对数据质量的人工反馈。Argilla 是一款针对需要为其项目构建高质量数据集的 AI 工程师和领域专家的协作工具。请查看此处的文档。

开始

安装依赖项

要完成本教程,您需要安装 distilabel SDK 和一些通过 pip 安装的第三方库。在本教程中,我们将使用**免费但限速的 Hugging Face 无服务器推理 API**,因此我们需要将其作为额外的 distilabel 依赖项进行安装。您可以通过运行以下命令来安装它们

!pip install "distilabel[hf-inference-endpoints]"
!pip install "transformers~=4.0" "torch~=2.0"

让我们导入所需的库

from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import (
    LoadDataFromHub,
    GroupColumns,
    FormatTextGenerationDPO,
    PreferenceToArgilla,
)
from distilabel.steps.tasks import TextGeneration, UltraFeedback

您需要一个HF_TOKEN才能使用 HF 推理端点。登录以直接在此笔记本中使用它。

import os
from huggingface_hub import login

login(token=os.getenv("HF_TOKEN"), add_to_git_credential=True)

(可选) 部署 Argilla

您可以跳过此步骤或将其替换为任何其他数据评估工具,但模型质量会因缺乏数据质量而下降,因此我们建议您查看您的数据。如果您已经部署了 Argilla,则可以跳过此步骤。否则,您可以按照本指南快速部署 Argilla。

此外,您需要将 Argilla 作为 distilabel 的额外依赖项进行安装。

!pip install "distilabel[argilla, hf-inference-endpoints]"

定义管道

为了生成我们的偏好数据集,我们需要定义一个包含所有必要步骤的Pipeline。下面,我们将详细介绍每个步骤。

加载数据集

我们将使用来自 Hugging Face Hub 的argilla/10Kprompts-mini数据集作为源数据。

  • 组件:LoadDataFromHub
  • 输入列:instructiontopic,与加载的数据集中的相同
  • 输出列:instructiontopic
load_dataset = LoadDataFromHub(
    repo_id="argilla/10Kprompts-mini",
    num_examples=1,
    pipeline=Pipeline(name="showcase-pipeline"),
)
load_dataset.load()
next(load_dataset.process())

生成回复

我们需要为给定的指令生成回复。我们将使用两个不同的模型,它们通过无服务器推理 API 在 Hugging Face Hub 上可用:meta-llama/Meta-Llama-3-8B-Instructmistralai/Mixtral-8x7B-Instruct-v0.1。我们还将指示每个模型的生成参数。

  • 组件:使用InferenceEndpointsLLM 的具有 LLM 的TextGeneration 任务
  • 输入列:instruction
  • 输出列:每个模型的generationdistilabel_metadatamodel_name

对于您的用例并改进结果,您可以使用任何您选择的其他 LLM

>>> generate_responses = [
...     TextGeneration(
...         llm=InferenceEndpointsLLM(
...             model_id="meta-llama/Meta-Llama-3-8B-Instruct",
...             tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
...             generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
...         ),
...         pipeline=Pipeline(name="showcase-pipeline"),
...     ),
...     TextGeneration(
...         llm=InferenceEndpointsLLM(
...             model_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
...             tokenizer_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
...             generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
...         ),
...         pipeline=Pipeline(name="showcase-pipeline"),
...     ),
... ]
>>> for task in generate_responses:
...     task.load()
...     print(next(task.process([{"instruction": "Which are the top cities in Spain?"}])))
[{'instruction': 'Which are the top cities in Spain?', 'generation': 'Spain is a country with a rich culture, history, and architecture, and it has many great cities to visit. Here are some of the top cities in Spain:\n\n1. **Madrid**: The capital city of Spain, known for its vibrant nightlife, museums, and historic landmarks like the Royal Palace and Prado Museum.\n2. **Barcelona**: The second-largest city in Spain, famous for its modernist architecture, beaches, and iconic landmarks like La Sagrada Família and Park Güell, designed by Antoni Gaudí.\n3. **Valencia**: Located on the Mediterranean coast, Valencia is known for its beautiful beaches, City of Arts and Sciences, and delicious local cuisine, such as paella.\n4. **Seville**: The capital of Andalusia, Seville is famous for its stunning cathedral, Royal Alcázar Palace, and lively flamenco music scene.\n5. **Málaga**: A coastal city in southern Spain, Málaga is known for its rich history, beautiful beaches, and being the birthplace of Pablo Picasso.\n6. **Zaragoza**: Located in the northeastern region of Aragon, Zaragoza is a city with a rich history, known for its Roman ruins, Gothic cathedral, and beautiful parks.\n7. **Granada**: A city in the Andalusian region, Granada is famous for its stunning Alhambra palace and generalife gardens, a UNESCO World Heritage Site.\n8. **Bilbao**: A city in the Basque Country, Bilbao is known for its modern architecture, including the Guggenheim Museum, and its rich cultural heritage.\n9. **Alicante**: A coastal city in the Valencia region, Alicante is famous for its beautiful beaches, historic castle, and lively nightlife.\n10. **San Sebastián**: A city in the Basque Country, San Sebastián is known for its stunning beaches, gastronomic scene, and cultural events like the San Sebastián International Film Festival.\n\nThese are just a few of the many great cities in Spain, each with its own unique character and attractions.', 'distilabel_metadata': {'raw_output_text_generation_0': 'Spain is a country with a rich culture, history, and architecture, and it has many great cities to visit. Here are some of the top cities in Spain:\n\n1. **Madrid**: The capital city of Spain, known for its vibrant nightlife, museums, and historic landmarks like the Royal Palace and Prado Museum.\n2. **Barcelona**: The second-largest city in Spain, famous for its modernist architecture, beaches, and iconic landmarks like La Sagrada Família and Park Güell, designed by Antoni Gaudí.\n3. **Valencia**: Located on the Mediterranean coast, Valencia is known for its beautiful beaches, City of Arts and Sciences, and delicious local cuisine, such as paella.\n4. **Seville**: The capital of Andalusia, Seville is famous for its stunning cathedral, Royal Alcázar Palace, and lively flamenco music scene.\n5. **Málaga**: A coastal city in southern Spain, Málaga is known for its rich history, beautiful beaches, and being the birthplace of Pablo Picasso.\n6. **Zaragoza**: Located in the northeastern region of Aragon, Zaragoza is a city with a rich history, known for its Roman ruins, Gothic cathedral, and beautiful parks.\n7. **Granada**: A city in the Andalusian region, Granada is famous for its stunning Alhambra palace and generalife gardens, a UNESCO World Heritage Site.\n8. **Bilbao**: A city in the Basque Country, Bilbao is known for its modern architecture, including the Guggenheim Museum, and its rich cultural heritage.\n9. **Alicante**: A coastal city in the Valencia region, Alicante is famous for its beautiful beaches, historic castle, and lively nightlife.\n10. **San Sebastián**: A city in the Basque Country, San Sebastián is known for its stunning beaches, gastronomic scene, and cultural events like the San Sebastián International Film Festival.\n\nThese are just a few of the many great cities in Spain, each with its own unique character and attractions.'}, 'model_name': 'meta-llama/Meta-Llama-3-8B-Instruct'}]
[{'instruction': 'Which are the top cities in Spain?', 'generation': ' Here are some of the top cities in Spain based on various factors such as tourism, culture, history, and quality of life:\n\n1. Madrid: The capital and largest city in Spain, Madrid is known for its vibrant nightlife, world-class museums (such as the Prado Museum and Reina Sofia Museum), stunning parks (such as the Retiro Park), and delicious food.\n\n2. Barcelona: Famous for its unique architecture, Barcelona is home to several UNESCO World Heritage sites designed by Antoni Gaudí, including the Sagrada Familia and Park Güell. The city also boasts beautiful beaches, a lively arts scene, and delicious Catalan cuisine.\n\n3. Valencia: A coastal city located in the east of Spain, Valencia is known for its City of Arts and Sciences, a modern architectural complex that includes a planetarium, opera house, and museum of interactive science. The city is also famous for its paella, a traditional Spanish dish made with rice, vegetables, and seafood.\n\n4. Seville: The capital of Andalusia, Seville is famous for its flamenco dancing, stunning cathedral (the largest Gothic cathedral in the world), and the Alcázar, a beautiful palace made up of a series of rooms and courtyards.\n\n5. Granada: Located in the foothills of the Sierra Nevada mountains, Granada is known for its stunning Alhambra palace, a Moorish fortress that dates back to the 9th century. The city is also famous for its tapas, a traditional Spanish dish that is often served for free with drinks.\n\n6. Bilbao: A city in the Basque Country, Bilbao is famous for its modern architecture, including the Guggenheim Museum, a contemporary art museum designed by Frank Gehry. The city is also known for its pintxos, a type of Basque tapas that are served in bars and restaurants.\n\n7. Málaga: A coastal city in Andalusia, Málaga is known for its beautiful beaches, historic sites (including the Alcazaba and Gibralfaro castles), and the Picasso Museum, which is dedicated to the famous Spanish artist who was born in the city.\n\nThese are just a few of the many wonderful cities in Spain.', 'distilabel_metadata': {'raw_output_text_generation_0': ' Here are some of the top cities in Spain based on various factors such as tourism, culture, history, and quality of life:\n\n1. Madrid: The capital and largest city in Spain, Madrid is known for its vibrant nightlife, world-class museums (such as the Prado Museum and Reina Sofia Museum), stunning parks (such as the Retiro Park), and delicious food.\n\n2. Barcelona: Famous for its unique architecture, Barcelona is home to several UNESCO World Heritage sites designed by Antoni Gaudí, including the Sagrada Familia and Park Güell. The city also boasts beautiful beaches, a lively arts scene, and delicious Catalan cuisine.\n\n3. Valencia: A coastal city located in the east of Spain, Valencia is known for its City of Arts and Sciences, a modern architectural complex that includes a planetarium, opera house, and museum of interactive science. The city is also famous for its paella, a traditional Spanish dish made with rice, vegetables, and seafood.\n\n4. Seville: The capital of Andalusia, Seville is famous for its flamenco dancing, stunning cathedral (the largest Gothic cathedral in the world), and the Alcázar, a beautiful palace made up of a series of rooms and courtyards.\n\n5. Granada: Located in the foothills of the Sierra Nevada mountains, Granada is known for its stunning Alhambra palace, a Moorish fortress that dates back to the 9th century. The city is also famous for its tapas, a traditional Spanish dish that is often served for free with drinks.\n\n6. Bilbao: A city in the Basque Country, Bilbao is famous for its modern architecture, including the Guggenheim Museum, a contemporary art museum designed by Frank Gehry. The city is also known for its pintxos, a type of Basque tapas that are served in bars and restaurants.\n\n7. Málaga: A coastal city in Andalusia, Málaga is known for its beautiful beaches, historic sites (including the Alcazaba and Gibralfaro castles), and the Picasso Museum, which is dedicated to the famous Spanish artist who was born in the city.\n\nThese are just a few of the many wonderful cities in Spain.'}, 'model_name': 'mistralai/Mixtral-8x7B-Instruct-v0.1'}]

组合回复

评估回复的任务需要一个回复列表作为输入。但是,每个模型的回复都保存在子集text_generation_0text_generation_1 的 generation 列中。我们将这两列合并到一个列和default 子集中。

  • 组件:GroupColumns
  • 输入列:来自text_generation_0text_generation_1generationmodel_name
  • 输出列:generationsmodel_names
group_responses = GroupColumns(
    columns=["generation", "model_name"],
    output_columns=["generations", "model_names"],
    pipeline=Pipeline(name="showcase-pipeline"),
)
next(
    group_responses.process(
        [
            {
                "generation": "Madrid",
                "model_name": "meta-llama/Meta-Llama-3-8B-Instruct",
            },
        ],
        [
            {
                "generation": "Barcelona",
                "model_name": "mistralai/Mixtral-8x7B-Instruct-v0.1",
            }
        ],
    )
)

评估回复

为了构建我们的偏好数据集,我们需要评估模型生成的回复。我们将为此使用meta-llama/Meta-Llama-3-70B-Instruct,应用UltraFeedback任务,根据不同的维度(有用性、诚实性、指令遵循、真实性)判断回复。

  • 组件:使用InferenceEndpointsLLM 的具有 LLM 的UltraFeedback 任务
  • 输入列:instructiongenerations
  • 输出列:ratingsrationalesdistilabel_metadatamodel_name

对于您的用例并改进结果,您可以使用任何您选择的其他 LLM

evaluate_responses = UltraFeedback(
    aspect="overall-rating",
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
        generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
    ),
    pipeline=Pipeline(name="showcase-pipeline"),
)
evaluate_responses.load()
next(
    evaluate_responses.process(
        [
            {
                "instruction": "What's the capital of Spain?",
                "generations": ["Madrid", "Barcelona"],
            }
        ]
    )
)

转换为偏好数据集

  • 您可以使用chosenrejected 列将其自动转换为偏好数据集。
    • 组件:FormatTextGenerationDPO 步骤
    • 输入列:instructiongenerationsgeneration_modelsratings
    • 输出列:promptprompt_idchosenchosen_modelchosen_ratingrejectedrejected_modelrejected_rating
format_dpo = FormatTextGenerationDPO(pipeline=Pipeline(name="showcase-pipeline"))
format_dpo.load()
next(
    format_dpo.process(
        [
            {
                "instruction": "What's the capital of Spain?",
                "generations": ["Madrid", "Barcelona"],
                "generation_models": [
                    "Meta-Llama-3-8B-Instruct",
                    "Mixtral-8x7B-Instruct-v0.1",
                ],
                "ratings": [5, 1],
            }
        ]
    )
)
  • 或者,您可以使用 Argilla 手动标记数据并将其转换为偏好数据集。
    • 组件:PreferenceToArgilla 步骤
    • 输入列:instructiongenerationsgeneration_modelsratings
    • 输出列:instructiongenerationsgeneration_modelsratings
to_argilla = PreferenceToArgilla(
    dataset_name="preference-dataset",
    dataset_workspace="argilla",
    api_url="https://[your-owner-name]-[your-space-name].hf.space",
    api_key="[your-api-key]",
    num_generations=2,
)

运行管道

下面,您可以看到完整的管道定义

with Pipeline(name="generate-dataset") as pipeline:

    load_dataset = LoadDataFromHub(repo_id="argilla/10Kprompts-mini")

    generate_responses = [
        TextGeneration(
            llm=InferenceEndpointsLLM(
                model_id="meta-llama/Meta-Llama-3-8B-Instruct",
                tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
                generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
            )
        ),
        TextGeneration(
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
                tokenizer_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
                generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
            )
        ),
    ]

    group_responses = GroupColumns(
        columns=["generation", "model_name"],
        output_columns=["generations", "model_names"],
    )

    evaluate_responses = UltraFeedback(
        aspect="overall-rating",
        llm=InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3-70B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
            generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
        ),
    )

    format_dpo = FormatTextGenerationDPO()

    to_argilla = PreferenceToArgilla(
        dataset_name="preference-dataset",
        dataset_workspace="argilla",
        api_url="https://[your-owner-name]-[your-space-name].hf.space",
        api_key="[your-api-key]",
        num_generations=2,
    )

    for task in generate_responses:
        load_dataset.connect(task)
        task.connect(group_responses)
    group_responses.connect(evaluate_responses)
    evaluate_responses.connect(format_dpo, to_argilla)

现在让我们运行管道并生成偏好数据集。

distiset = pipeline.run()

让我们检查偏好数据集!如果您已将数据加载到 Argilla,则可以在 Argilla UI 中开始标注

您可以将数据集推送到 Hub 以与社区共享,并将其嵌入以浏览数据

distiset.push_to_hub("[your-owner-name]/example-preference-dataset")

结论

在本教程中,我们展示了使用 distilabel 生成偏好数据集的管道的详细步骤。您可以根据自己的用例自定义此管道,并通过 Hugging Face Hub 与社区共享您的数据集,或使用它们来训练 DPO 或 ORPO 的模型。

我们使用了一个包含提示的数据集,通过无服务器 Hugging Face 推理 API 使用两种不同的模型生成响应。接下来,我们使用第三个模型根据 UltraFeedback 标准评估了这些响应。最后,我们将数据转换为偏好数据集,并使用 Argilla 进行进一步的整理。

< > 在 GitHub 上更新