开源 AI 食谱文档

使用 distilabel 生成偏好数据集

Hugging Face's logo
加入 Hugging Face 社区

并获得增强的文档体验

开始使用

Open In Colab

使用 distilabel 生成偏好数据集

作者:David BerensteinSara Han Díaz

在本教程中,我们将使用 distilabel 为 DPO、ORPO 或 RLHF 生成一个合成偏好数据集。distilabel 是一个合成数据和 AI 反馈框架,专为需要基于已验证研究论文的快速、可靠且可扩展流水线的工程师设计。请在此处查看文档

为了生成响应并对其进行评估,我们将使用与 distilabel 集成的无服务器 HF 推理 API。这项服务是免费但有速率限制的,允许您通过简单的 HTTP 请求测试和评估超过 15 万个公共模型或您自己的私有模型,并在 Hugging Face 的共享基础设施上进行快速推理。如果您需要更多计算能力,可以使用 Hugging Face 推理端点部署您自己的推理端点。

最后,为了进一步整理数据,我们将使用 Argilla,它使我们能够就数据质量提供人工反馈。Argilla 是一个为 AI 工程师和领域专家设计的协作工具,他们需要为自己的项目构建高质量的数据集。请在此处查看文档

开始

安装依赖

要完成本教程,您需要通过 pip 安装 distilabel SDK 和一些第三方库。我们将使用免费但有速率限制的 Hugging Face 无服务器推理 API,因此需要将其作为 distilabel 的额外依赖项进行安装。您可以通过运行以下命令来安装它们:

!pip install "distilabel[hf-inference-endpoints]"
!pip install "transformers~=4.0" "torch~=2.0"

让我们进行必要的导入:

from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import (
    LoadDataFromHub,
    GroupColumns,
    FormatTextGenerationDPO,
    PreferenceToArgilla,
)
from distilabel.steps.tasks import TextGeneration, UltraFeedback

您需要一个 HF_TOKEN 来使用 HF 推理端点。请登录以便在本笔记本中直接使用它。

import os
from huggingface_hub import login

login(token=os.getenv("HF_TOKEN"), add_to_git_credential=True)

(可选)部署 Argilla

您可以跳过此步骤,或用任何其他数据评估工具替代,但您的模型质量会因数据质量不佳而受损,因此我们确实建议您检查您的数据。如果您已经部署了 Argilla,可以跳过此步骤。否则,您可以按照本指南快速部署 Argilla。

此外,您还需要将 Argilla 安装为 distilabel 的额外依赖项。

!pip install "distilabel[argilla, hf-inference-endpoints]"

定义流水线

为了生成我们的偏好数据集,我们需要定义一个包含所有必要步骤的 Pipeline。下面,我们将详细介绍每个步骤。

加载数据集

我们将使用来自 Hugging Face Hub 的 argilla/10Kprompts-mini 数据集作为源数据。

  • 组件:LoadDataFromHub
  • 输入列:instructiontopic,与加载的数据集中的列相同
  • 输出列:instructiontopic
load_dataset = LoadDataFromHub(
    repo_id="argilla/10Kprompts-mini",
    num_examples=1,
    pipeline=Pipeline(name="showcase-pipeline"),
)
load_dataset.load()
next(load_dataset.process())

生成响应

我们需要为给定的指令生成响应。我们将使用两个通过无服务器推理 API 在 Hugging Face Hub 上可用的不同模型:meta-llama/Meta-Llama-3-8B-Instructmistralai/Mixtral-8x7B-Instruct-v0.1。我们还将为每个模型指定生成参数。

  • 组件:使用 InferenceEndpointsLLMTextGeneration 任务
  • 输入列:instruction
  • 输出列:每个模型的 generationdistilabel_metadatamodel_name

为了满足您的用例并改善结果,您可以使用任何其他您选择的 LLM

>>> generate_responses = [
...     TextGeneration(
...         llm=InferenceEndpointsLLM(
...             model_id="meta-llama/Meta-Llama-3-8B-Instruct",
...             tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
...             generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
...         ),
...         pipeline=Pipeline(name="showcase-pipeline"),
...     ),
...     TextGeneration(
...         llm=InferenceEndpointsLLM(
...             model_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
...             tokenizer_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
...             generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
...         ),
...         pipeline=Pipeline(name="showcase-pipeline"),
...     ),
... ]
>>> for task in generate_responses:
...     task.load()
...     print(next(task.process([{"instruction": "Which are the top cities in Spain?"}])))
[{'instruction': 'Which are the top cities in Spain?', 'generation': 'Spain is a country with a rich culture, history, and architecture, and it has many great cities to visit. Here are some of the top cities in Spain:\n\n1. **Madrid**: The capital city of Spain, known for its vibrant nightlife, museums, and historic landmarks like the Royal Palace and Prado Museum.\n2. **Barcelona**: The second-largest city in Spain, famous for its modernist architecture, beaches, and iconic landmarks like La Sagrada Família and Park Güell, designed by Antoni Gaudí.\n3. **Valencia**: Located on the Mediterranean coast, Valencia is known for its beautiful beaches, City of Arts and Sciences, and delicious local cuisine, such as paella.\n4. **Seville**: The capital of Andalusia, Seville is famous for its stunning cathedral, Royal Alcázar Palace, and lively flamenco music scene.\n5. **Málaga**: A coastal city in southern Spain, Málaga is known for its rich history, beautiful beaches, and being the birthplace of Pablo Picasso.\n6. **Zaragoza**: Located in the northeastern region of Aragon, Zaragoza is a city with a rich history, known for its Roman ruins, Gothic cathedral, and beautiful parks.\n7. **Granada**: A city in the Andalusian region, Granada is famous for its stunning Alhambra palace and generalife gardens, a UNESCO World Heritage Site.\n8. **Bilbao**: A city in the Basque Country, Bilbao is known for its modern architecture, including the Guggenheim Museum, and its rich cultural heritage.\n9. **Alicante**: A coastal city in the Valencia region, Alicante is famous for its beautiful beaches, historic castle, and lively nightlife.\n10. **San Sebastián**: A city in the Basque Country, San Sebastián is known for its stunning beaches, gastronomic scene, and cultural events like the San Sebastián International Film Festival.\n\nThese are just a few of the many great cities in Spain, each with its own unique character and attractions.', 'distilabel_metadata': {'raw_output_text_generation_0': 'Spain is a country with a rich culture, history, and architecture, and it has many great cities to visit. Here are some of the top cities in Spain:\n\n1. **Madrid**: The capital city of Spain, known for its vibrant nightlife, museums, and historic landmarks like the Royal Palace and Prado Museum.\n2. **Barcelona**: The second-largest city in Spain, famous for its modernist architecture, beaches, and iconic landmarks like La Sagrada Família and Park Güell, designed by Antoni Gaudí.\n3. **Valencia**: Located on the Mediterranean coast, Valencia is known for its beautiful beaches, City of Arts and Sciences, and delicious local cuisine, such as paella.\n4. **Seville**: The capital of Andalusia, Seville is famous for its stunning cathedral, Royal Alcázar Palace, and lively flamenco music scene.\n5. **Málaga**: A coastal city in southern Spain, Málaga is known for its rich history, beautiful beaches, and being the birthplace of Pablo Picasso.\n6. **Zaragoza**: Located in the northeastern region of Aragon, Zaragoza is a city with a rich history, known for its Roman ruins, Gothic cathedral, and beautiful parks.\n7. **Granada**: A city in the Andalusian region, Granada is famous for its stunning Alhambra palace and generalife gardens, a UNESCO World Heritage Site.\n8. **Bilbao**: A city in the Basque Country, Bilbao is known for its modern architecture, including the Guggenheim Museum, and its rich cultural heritage.\n9. **Alicante**: A coastal city in the Valencia region, Alicante is famous for its beautiful beaches, historic castle, and lively nightlife.\n10. **San Sebastián**: A city in the Basque Country, San Sebastián is known for its stunning beaches, gastronomic scene, and cultural events like the San Sebastián International Film Festival.\n\nThese are just a few of the many great cities in Spain, each with its own unique character and attractions.'}, 'model_name': 'meta-llama/Meta-Llama-3-8B-Instruct'}]
[{'instruction': 'Which are the top cities in Spain?', 'generation': ' Here are some of the top cities in Spain based on various factors such as tourism, culture, history, and quality of life:\n\n1. Madrid: The capital and largest city in Spain, Madrid is known for its vibrant nightlife, world-class museums (such as the Prado Museum and Reina Sofia Museum), stunning parks (such as the Retiro Park), and delicious food.\n\n2. Barcelona: Famous for its unique architecture, Barcelona is home to several UNESCO World Heritage sites designed by Antoni Gaudí, including the Sagrada Familia and Park Güell. The city also boasts beautiful beaches, a lively arts scene, and delicious Catalan cuisine.\n\n3. Valencia: A coastal city located in the east of Spain, Valencia is known for its City of Arts and Sciences, a modern architectural complex that includes a planetarium, opera house, and museum of interactive science. The city is also famous for its paella, a traditional Spanish dish made with rice, vegetables, and seafood.\n\n4. Seville: The capital of Andalusia, Seville is famous for its flamenco dancing, stunning cathedral (the largest Gothic cathedral in the world), and the Alcázar, a beautiful palace made up of a series of rooms and courtyards.\n\n5. Granada: Located in the foothills of the Sierra Nevada mountains, Granada is known for its stunning Alhambra palace, a Moorish fortress that dates back to the 9th century. The city is also famous for its tapas, a traditional Spanish dish that is often served for free with drinks.\n\n6. Bilbao: A city in the Basque Country, Bilbao is famous for its modern architecture, including the Guggenheim Museum, a contemporary art museum designed by Frank Gehry. The city is also known for its pintxos, a type of Basque tapas that are served in bars and restaurants.\n\n7. Málaga: A coastal city in Andalusia, Málaga is known for its beautiful beaches, historic sites (including the Alcazaba and Gibralfaro castles), and the Picasso Museum, which is dedicated to the famous Spanish artist who was born in the city.\n\nThese are just a few of the many wonderful cities in Spain.', 'distilabel_metadata': {'raw_output_text_generation_0': ' Here are some of the top cities in Spain based on various factors such as tourism, culture, history, and quality of life:\n\n1. Madrid: The capital and largest city in Spain, Madrid is known for its vibrant nightlife, world-class museums (such as the Prado Museum and Reina Sofia Museum), stunning parks (such as the Retiro Park), and delicious food.\n\n2. Barcelona: Famous for its unique architecture, Barcelona is home to several UNESCO World Heritage sites designed by Antoni Gaudí, including the Sagrada Familia and Park Güell. The city also boasts beautiful beaches, a lively arts scene, and delicious Catalan cuisine.\n\n3. Valencia: A coastal city located in the east of Spain, Valencia is known for its City of Arts and Sciences, a modern architectural complex that includes a planetarium, opera house, and museum of interactive science. The city is also famous for its paella, a traditional Spanish dish made with rice, vegetables, and seafood.\n\n4. Seville: The capital of Andalusia, Seville is famous for its flamenco dancing, stunning cathedral (the largest Gothic cathedral in the world), and the Alcázar, a beautiful palace made up of a series of rooms and courtyards.\n\n5. Granada: Located in the foothills of the Sierra Nevada mountains, Granada is known for its stunning Alhambra palace, a Moorish fortress that dates back to the 9th century. The city is also famous for its tapas, a traditional Spanish dish that is often served for free with drinks.\n\n6. Bilbao: A city in the Basque Country, Bilbao is famous for its modern architecture, including the Guggenheim Museum, a contemporary art museum designed by Frank Gehry. The city is also known for its pintxos, a type of Basque tapas that are served in bars and restaurants.\n\n7. Málaga: A coastal city in Andalusia, Málaga is known for its beautiful beaches, historic sites (including the Alcazaba and Gibralfaro castles), and the Picasso Museum, which is dedicated to the famous Spanish artist who was born in the city.\n\nThese are just a few of the many wonderful cities in Spain.'}, 'model_name': 'mistralai/Mixtral-8x7B-Instruct-v0.1'}]

分组响应

评估响应的任务需要一个生成列表作为输入。然而,每个模型的响应都保存在子集 text_generation_0text_generation_1 的 generation 列中。我们将把这两个列合并到一个单独的列和 default 子集中。

  • 组件:GroupColumns
  • 输入列:来自 text_generation_0text_generation_1generationmodel_name
  • 输出列:generationsmodel_names
group_responses = GroupColumns(
    columns=["generation", "model_name"],
    output_columns=["generations", "model_names"],
    pipeline=Pipeline(name="showcase-pipeline"),
)
next(
    group_responses.process(
        [
            {
                "generation": "Madrid",
                "model_name": "meta-llama/Meta-Llama-3-8B-Instruct",
            },
        ],
        [
            {
                "generation": "Barcelona",
                "model_name": "mistralai/Mixtral-8x7B-Instruct-v0.1",
            }
        ],
    )
)

评估响应

为了构建我们的偏好数据集,我们需要评估模型生成的响应。我们将使用 meta-llama/Meta-Llama-3-70B-Instruct 来完成此任务,应用 UltraFeedback 任务,该任务会根据不同维度(帮助性、诚实性、遵循指令、真实性)来评判响应。

  • 组件:使用 InferenceEndpointsLLMUltraFeedback 任务
  • 输入列:instruction, generations
  • 输出列:ratings, rationales, distilabel_metadata, model_name

为了满足您的用例并改善结果,您可以使用任何其他您选择的 LLM

evaluate_responses = UltraFeedback(
    aspect="overall-rating",
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
        generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
    ),
    pipeline=Pipeline(name="showcase-pipeline"),
)
evaluate_responses.load()
next(
    evaluate_responses.process(
        [
            {
                "instruction": "What's the capital of Spain?",
                "generations": ["Madrid", "Barcelona"],
            }
        ]
    )
)

转换为偏好数据集

  • 您可以自动将其转换为包含 chosenrejected 列的偏好数据集。
    • 组件:FormatTextGenerationDPO 步骤
    • 输入列:instructiongenerationsgeneration_modelsratings
    • 输出列:promptprompt_idchosenchosen_modelchosen_ratingrejectedrejected_modelrejected_rating
format_dpo = FormatTextGenerationDPO(pipeline=Pipeline(name="showcase-pipeline"))
format_dpo.load()
next(
    format_dpo.process(
        [
            {
                "instruction": "What's the capital of Spain?",
                "generations": ["Madrid", "Barcelona"],
                "generation_models": [
                    "Meta-Llama-3-8B-Instruct",
                    "Mixtral-8x7B-Instruct-v0.1",
                ],
                "ratings": [5, 1],
            }
        ]
    )
)
  • 或者您可以使用 Argilla 手动标注数据,并将其转换为偏好数据集。
    • 组件:PreferenceToArgilla 步骤
    • 输入列:instructiongenerationsgeneration_modelsratings
    • 输出列:instructiongenerationsgeneration_modelsratings
to_argilla = PreferenceToArgilla(
    dataset_name="preference-dataset",
    dataset_workspace="argilla",
    api_url="https://[your-owner-name]-[your-space-name].hf.space",
    api_key="[your-api-key]",
    num_generations=2,
)

运行流水线

下面,您可以看到完整的流水线定义:

with Pipeline(name="generate-dataset") as pipeline:

    load_dataset = LoadDataFromHub(repo_id="argilla/10Kprompts-mini")

    generate_responses = [
        TextGeneration(
            llm=InferenceEndpointsLLM(
                model_id="meta-llama/Meta-Llama-3-8B-Instruct",
                tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
                generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
            )
        ),
        TextGeneration(
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
                tokenizer_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
                generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
            )
        ),
    ]

    group_responses = GroupColumns(
        columns=["generation", "model_name"],
        output_columns=["generations", "model_names"],
    )

    evaluate_responses = UltraFeedback(
        aspect="overall-rating",
        llm=InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3-70B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
            generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
        ),
    )

    format_dpo = FormatTextGenerationDPO()

    to_argilla = PreferenceToArgilla(
        dataset_name="preference-dataset",
        dataset_workspace="argilla",
        api_url="https://[your-owner-name]-[your-space-name].hf.space",
        api_key="[your-api-key]",
        num_generations=2,
    )

    for task in generate_responses:
        load_dataset.connect(task)
        task.connect(group_responses)
    group_responses.connect(evaluate_responses)
    evaluate_responses.connect(format_dpo, to_argilla)

现在让我们运行流水线并生成偏好数据集。

distiset = pipeline.run()

让我们检查一下偏好数据集!如果您已将数据加载到 Argilla,您可以在 Argilla UI 中开始标注

您可以将数据集推送到 Hub 以与社区共享,并嵌入它以浏览数据

distiset.push_to_hub("[your-owner-name]/example-preference-dataset")

结论

在本教程中,我们展示了使用 distilabel 构建生成偏好数据集流水线的详细步骤。您可以为自己的用例定制此流水线,并通过 Hugging Face Hub 与社区共享您的数据集,或使用它们来训练 DPO 或 ORPO 模型。

我们使用一个包含提示的数据集,通过无服务器 Hugging Face 推理 API 使用两个不同的模型生成响应。接下来,我们使用第三个模型,遵循 UltraFeedback 标准来评估这些响应。最后,我们将数据转换为偏好数据集,并使用 Argilla 进行进一步的整理。

< > 在 GitHub 上更新