开源 AI 食谱文档
使用 distilabel 生成偏好数据集
并获得增强的文档体验
开始使用
使用 distilabel 生成偏好数据集
作者:David Berenstein 和 Sara Han Díaz
- 库: argilla, hf-inference-endpoints
- 组件: LoadDataFromHub, TextGeneration, UltraFeedback, GroupColumns, FormatTextGenerationDPO, PreferenceToArgilla, InferenceEndpointsLLM
在本教程中,我们将使用 distilabel 生成用于 DPO、ORPO 或 RLHF 的合成偏好数据集。distilabel 是一个合成数据和 AI 反馈框架,专为需要基于已验证研究论文的快速、可靠且可扩展的管道的工程师而设计。请查看此处的文档。
为了生成响应并对其进行评估,我们将使用与 distilabel 集成的无服务器 HF Inference API。这是免费但受速率限制的,允许您通过简单的 HTTP 请求,在 Hugging Face 共享基础设施上快速推理,测试和评估超过 150,000 个公共模型或您自己的私有模型。如果您需要更多计算能力,可以使用Hugging Face Inference Endpoints部署您自己的推理端点。
最后,为了进一步管理数据,我们将使用Argilla,它允许我们对数据质量提供人工反馈。Argilla 是 AI 工程师和领域专家之间的协作工具,他们需要为其项目构建高质量的数据集。请查看此处的文档。
开始入门
安装依赖项
要完成本教程,您需要通过 pip 安装 distilabel SDK 和一些第三方库。在本教程中,我们将使用免费但受速率限制的 Hugging Face 无服务器 Inference API,因此我们需要将其作为额外的 distilabel 依赖项安装。您可以通过运行以下命令来安装它们
!pip install "distilabel[hf-inference-endpoints]"
!pip install "transformers~=4.0" "torch~=2.0"
让我们进行所需的导入
from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import (
LoadDataFromHub,
GroupColumns,
FormatTextGenerationDPO,
PreferenceToArgilla,
)
from distilabel.steps.tasks import TextGeneration, UltraFeedback
您需要一个 HF_TOKEN
才能使用 HF Inference Endpoints。登录以在本笔记本中直接使用它。
import os
from huggingface_hub import login
login(token=os.getenv("HF_TOKEN"), add_to_git_credential=True)
(可选)部署 Argilla
您可以跳过此步骤或将其替换为任何其他数据评估工具,但是您的模型质量会因缺乏数据质量而受到影响,因此我们建议您查看数据。如果您已经部署了 Argilla,则可以跳过此步骤。否则,您可以按照本指南快速部署 Argilla。
与此同时,您需要将 Argilla 作为 distilabel 额外组件安装。
!pip install "distilabel[argilla, hf-inference-endpoints]"
定义管道
为了生成我们的偏好数据集,我们将需要定义一个包含所有必要步骤的 Pipeline
。下面,我们将详细介绍每个步骤。
加载数据集
我们将使用 Hugging Face Hub 中的 argilla/10Kprompts-mini
数据集作为源数据。
- 组件:
LoadDataFromHub
- 输入列:
instruction
和topic
,与加载的数据集相同 - 输出列:
instruction
和topic
load_dataset = LoadDataFromHub(
repo_id="argilla/10Kprompts-mini",
num_examples=1,
pipeline=Pipeline(name="showcase-pipeline"),
)
load_dataset.load()
next(load_dataset.process())
生成响应
我们需要为给定的指令生成响应。我们将使用 Hugging Face Hub 上通过无服务器 Inference API 提供的两个不同的模型:meta-llama/Meta-Llama-3-8B-Instruct
和 mistralai/Mixtral-8x7B-Instruct-v0.1
。我们还将指示每个模型的生成参数。
- 组件:使用
InferenceEndpointsLLM
的 LLMTextGeneration
任务 - 输入列:
instruction
- 输出列:每个模型的
generation
、distilabel_metadata
、model_name
对于您的用例并为了改进结果,您可以使用任何您选择的其他 LLM。
>>> generate_responses = [
... TextGeneration(
... llm=InferenceEndpointsLLM(
... model_id="meta-llama/Meta-Llama-3-8B-Instruct",
... tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
... generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
... ),
... pipeline=Pipeline(name="showcase-pipeline"),
... ),
... TextGeneration(
... llm=InferenceEndpointsLLM(
... model_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
... tokenizer_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
... generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
... ),
... pipeline=Pipeline(name="showcase-pipeline"),
... ),
... ]
>>> for task in generate_responses:
... task.load()
... print(next(task.process([{"instruction": "Which are the top cities in Spain?"}])))
[{'instruction': 'Which are the top cities in Spain?', 'generation': 'Spain is a country with a rich culture, history, and architecture, and it has many great cities to visit. Here are some of the top cities in Spain:\n\n1. **Madrid**: The capital city of Spain, known for its vibrant nightlife, museums, and historic landmarks like the Royal Palace and Prado Museum.\n2. **Barcelona**: The second-largest city in Spain, famous for its modernist architecture, beaches, and iconic landmarks like La Sagrada Família and Park Güell, designed by Antoni Gaudí.\n3. **Valencia**: Located on the Mediterranean coast, Valencia is known for its beautiful beaches, City of Arts and Sciences, and delicious local cuisine, such as paella.\n4. **Seville**: The capital of Andalusia, Seville is famous for its stunning cathedral, Royal Alcázar Palace, and lively flamenco music scene.\n5. **Málaga**: A coastal city in southern Spain, Málaga is known for its rich history, beautiful beaches, and being the birthplace of Pablo Picasso.\n6. **Zaragoza**: Located in the northeastern region of Aragon, Zaragoza is a city with a rich history, known for its Roman ruins, Gothic cathedral, and beautiful parks.\n7. **Granada**: A city in the Andalusian region, Granada is famous for its stunning Alhambra palace and generalife gardens, a UNESCO World Heritage Site.\n8. **Bilbao**: A city in the Basque Country, Bilbao is known for its modern architecture, including the Guggenheim Museum, and its rich cultural heritage.\n9. **Alicante**: A coastal city in the Valencia region, Alicante is famous for its beautiful beaches, historic castle, and lively nightlife.\n10. **San Sebastián**: A city in the Basque Country, San Sebastián is known for its stunning beaches, gastronomic scene, and cultural events like the San Sebastián International Film Festival.\n\nThese are just a few of the many great cities in Spain, each with its own unique character and attractions.', 'distilabel_metadata': {'raw_output_text_generation_0': 'Spain is a country with a rich culture, history, and architecture, and it has many great cities to visit. Here are some of the top cities in Spain:\n\n1. **Madrid**: The capital city of Spain, known for its vibrant nightlife, museums, and historic landmarks like the Royal Palace and Prado Museum.\n2. **Barcelona**: The second-largest city in Spain, famous for its modernist architecture, beaches, and iconic landmarks like La Sagrada Família and Park Güell, designed by Antoni Gaudí.\n3. **Valencia**: Located on the Mediterranean coast, Valencia is known for its beautiful beaches, City of Arts and Sciences, and delicious local cuisine, such as paella.\n4. **Seville**: The capital of Andalusia, Seville is famous for its stunning cathedral, Royal Alcázar Palace, and lively flamenco music scene.\n5. **Málaga**: A coastal city in southern Spain, Málaga is known for its rich history, beautiful beaches, and being the birthplace of Pablo Picasso.\n6. **Zaragoza**: Located in the northeastern region of Aragon, Zaragoza is a city with a rich history, known for its Roman ruins, Gothic cathedral, and beautiful parks.\n7. **Granada**: A city in the Andalusian region, Granada is famous for its stunning Alhambra palace and generalife gardens, a UNESCO World Heritage Site.\n8. **Bilbao**: A city in the Basque Country, Bilbao is known for its modern architecture, including the Guggenheim Museum, and its rich cultural heritage.\n9. **Alicante**: A coastal city in the Valencia region, Alicante is famous for its beautiful beaches, historic castle, and lively nightlife.\n10. **San Sebastián**: A city in the Basque Country, San Sebastián is known for its stunning beaches, gastronomic scene, and cultural events like the San Sebastián International Film Festival.\n\nThese are just a few of the many great cities in Spain, each with its own unique character and attractions.'}, 'model_name': 'meta-llama/Meta-Llama-3-8B-Instruct'}] [{'instruction': 'Which are the top cities in Spain?', 'generation': ' Here are some of the top cities in Spain based on various factors such as tourism, culture, history, and quality of life:\n\n1. Madrid: The capital and largest city in Spain, Madrid is known for its vibrant nightlife, world-class museums (such as the Prado Museum and Reina Sofia Museum), stunning parks (such as the Retiro Park), and delicious food.\n\n2. Barcelona: Famous for its unique architecture, Barcelona is home to several UNESCO World Heritage sites designed by Antoni Gaudí, including the Sagrada Familia and Park Güell. The city also boasts beautiful beaches, a lively arts scene, and delicious Catalan cuisine.\n\n3. Valencia: A coastal city located in the east of Spain, Valencia is known for its City of Arts and Sciences, a modern architectural complex that includes a planetarium, opera house, and museum of interactive science. The city is also famous for its paella, a traditional Spanish dish made with rice, vegetables, and seafood.\n\n4. Seville: The capital of Andalusia, Seville is famous for its flamenco dancing, stunning cathedral (the largest Gothic cathedral in the world), and the Alcázar, a beautiful palace made up of a series of rooms and courtyards.\n\n5. Granada: Located in the foothills of the Sierra Nevada mountains, Granada is known for its stunning Alhambra palace, a Moorish fortress that dates back to the 9th century. The city is also famous for its tapas, a traditional Spanish dish that is often served for free with drinks.\n\n6. Bilbao: A city in the Basque Country, Bilbao is famous for its modern architecture, including the Guggenheim Museum, a contemporary art museum designed by Frank Gehry. The city is also known for its pintxos, a type of Basque tapas that are served in bars and restaurants.\n\n7. Málaga: A coastal city in Andalusia, Málaga is known for its beautiful beaches, historic sites (including the Alcazaba and Gibralfaro castles), and the Picasso Museum, which is dedicated to the famous Spanish artist who was born in the city.\n\nThese are just a few of the many wonderful cities in Spain.', 'distilabel_metadata': {'raw_output_text_generation_0': ' Here are some of the top cities in Spain based on various factors such as tourism, culture, history, and quality of life:\n\n1. Madrid: The capital and largest city in Spain, Madrid is known for its vibrant nightlife, world-class museums (such as the Prado Museum and Reina Sofia Museum), stunning parks (such as the Retiro Park), and delicious food.\n\n2. Barcelona: Famous for its unique architecture, Barcelona is home to several UNESCO World Heritage sites designed by Antoni Gaudí, including the Sagrada Familia and Park Güell. The city also boasts beautiful beaches, a lively arts scene, and delicious Catalan cuisine.\n\n3. Valencia: A coastal city located in the east of Spain, Valencia is known for its City of Arts and Sciences, a modern architectural complex that includes a planetarium, opera house, and museum of interactive science. The city is also famous for its paella, a traditional Spanish dish made with rice, vegetables, and seafood.\n\n4. Seville: The capital of Andalusia, Seville is famous for its flamenco dancing, stunning cathedral (the largest Gothic cathedral in the world), and the Alcázar, a beautiful palace made up of a series of rooms and courtyards.\n\n5. Granada: Located in the foothills of the Sierra Nevada mountains, Granada is known for its stunning Alhambra palace, a Moorish fortress that dates back to the 9th century. The city is also famous for its tapas, a traditional Spanish dish that is often served for free with drinks.\n\n6. Bilbao: A city in the Basque Country, Bilbao is famous for its modern architecture, including the Guggenheim Museum, a contemporary art museum designed by Frank Gehry. The city is also known for its pintxos, a type of Basque tapas that are served in bars and restaurants.\n\n7. Málaga: A coastal city in Andalusia, Málaga is known for its beautiful beaches, historic sites (including the Alcazaba and Gibralfaro castles), and the Picasso Museum, which is dedicated to the famous Spanish artist who was born in the city.\n\nThese are just a few of the many wonderful cities in Spain.'}, 'model_name': 'mistralai/Mixtral-8x7B-Instruct-v0.1'}]
分组响应
评估响应的任务需要一个生成列表作为输入。但是,每个模型响应都保存在子集 text_generation_0
和 text_generation_1
的 generation 列中。我们将把这两列合并为单列和 default
子集。
- 组件:
GroupColumns
- 输入列:来自
text_generation_0
和text_generation_1
的generation
和model_name
- 输出列:
generations
和model_names
group_responses = GroupColumns(
columns=["generation", "model_name"],
output_columns=["generations", "model_names"],
pipeline=Pipeline(name="showcase-pipeline"),
)
next(
group_responses.process(
[
{
"generation": "Madrid",
"model_name": "meta-llama/Meta-Llama-3-8B-Instruct",
},
],
[
{
"generation": "Barcelona",
"model_name": "mistralai/Mixtral-8x7B-Instruct-v0.1",
}
],
)
)
评估响应
为了构建我们的偏好数据集,我们需要评估模型生成的响应。我们将使用 meta-llama/Meta-Llama-3-70B-Instruct
来实现这一点,应用 UltraFeedback
任务,该任务根据不同的维度(帮助性、诚实性、指令遵循性、真实性)来判断响应。
- 组件:使用
InferenceEndpointsLLM
的 LLMUltraFeedback
任务 - 输入列:
instruction
、generations
- 输出列:
ratings
、rationales
、distilabel_metadata
、model_name
对于您的用例并为了改进结果,您可以使用任何您选择的其他 LLM。
evaluate_responses = UltraFeedback(
aspect="overall-rating",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
),
pipeline=Pipeline(name="showcase-pipeline"),
)
evaluate_responses.load()
next(
evaluate_responses.process(
[
{
"instruction": "What's the capital of Spain?",
"generations": ["Madrid", "Barcelona"],
}
]
)
)
转换为偏好数据集
- 您可以使用
chosen
和rejected
列自动将其转换为偏好数据集。- 组件:
FormatTextGenerationDPO
步骤 - 输入列:
instruction
、generations
、generation_models
、ratings
- 输出列:
prompt
、prompt_id
、chosen
、chosen_model
、chosen_rating
、rejected
、rejected_model
、rejected_rating
- 组件:
format_dpo = FormatTextGenerationDPO(pipeline=Pipeline(name="showcase-pipeline"))
format_dpo.load()
next(
format_dpo.process(
[
{
"instruction": "What's the capital of Spain?",
"generations": ["Madrid", "Barcelona"],
"generation_models": [
"Meta-Llama-3-8B-Instruct",
"Mixtral-8x7B-Instruct-v0.1",
],
"ratings": [5, 1],
}
]
)
)
- 或者您可以使用 Argilla 手动标记数据并将其转换为偏好数据集。
- 组件:
PreferenceToArgilla
步骤 - 输入列:
instruction
、generations
、generation_models
、ratings
- 输出列:
instruction
、generations
、generation_models
、ratings
- 组件:
to_argilla = PreferenceToArgilla(
dataset_name="preference-dataset",
dataset_workspace="argilla",
api_url="https://[your-owner-name]-[your-space-name].hf.space",
api_key="[your-api-key]",
num_generations=2,
)
运行管道
下面,您可以看到完整的管道定义
with Pipeline(name="generate-dataset") as pipeline:
load_dataset = LoadDataFromHub(repo_id="argilla/10Kprompts-mini")
generate_responses = [
TextGeneration(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-8B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
)
),
TextGeneration(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
tokenizer_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
)
),
]
group_responses = GroupColumns(
columns=["generation", "model_name"],
output_columns=["generations", "model_names"],
)
evaluate_responses = UltraFeedback(
aspect="overall-rating",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
),
)
format_dpo = FormatTextGenerationDPO()
to_argilla = PreferenceToArgilla(
dataset_name="preference-dataset",
dataset_workspace="argilla",
api_url="https://[your-owner-name]-[your-space-name].hf.space",
api_key="[your-api-key]",
num_generations=2,
)
for task in generate_responses:
load_dataset.connect(task)
task.connect(group_responses)
group_responses.connect(evaluate_responses)
evaluate_responses.connect(format_dpo, to_argilla)
现在让我们运行管道并生成偏好数据集。
distiset = pipeline.run()
让我们检查一下偏好数据集!如果您已将数据加载到 Argilla,则可以在 Argilla UI 中开始标注。
您可以将数据集推送到 Hub 以与社区共享,并嵌入它以浏览数据。
distiset.push_to_hub("[your-owner-name]/example-preference-dataset")
结论
在本教程中,我们展示了使用 distilabel 构建用于生成偏好数据集的管道的详细步骤。您可以为自己的用例自定义此管道,并通过 Hugging Face Hub 与社区共享您的数据集,或者使用它们来训练用于 DPO 或 ORPO 的模型。
我们使用包含 prompts 的数据集,通过无服务器 Hugging Face Inference API 使用两个不同的模型生成响应。接下来,我们使用第三个模型,按照 UltraFeedback 标准评估了响应。最后,我们将数据转换为偏好数据集,并使用 Argilla 进行进一步的管理。
< > 在 GitHub 上更新