开源 AI 食谱文档

使用 Argilla Spaces 进行数据标注

Hugging Face's logo
加入 Hugging Face 社区

并获得增强的文档体验

开始使用

Open In Colab

使用 Argilla Spaces 进行数据标注

作者:Moritz Laurer

本笔记演示了系统评估 LLM 输出和创建 LLM 训练数据的工作流程。你可以首先使用本笔记评估你最喜欢的 LLM 在你的任务上的零样本性能,而无需进行任何微调。如果你想提高性能,你可以轻松地重复使用此工作流程来创建训练数据。

示例用例:代码生成。在本教程中,我们演示了如何为代码生成任务创建高质量的测试和训练数据。然而,相同的工作流程可以适用于任何其他与你的特定用例相关的任务。

在本笔记中,我们:

  1. 下载示例任务的数据。
  2. 提示两个 LLM 回答这些任务。这将生成“合成数据”以加速手动数据创建。
  3. 在 HF Spaces 上创建 Argilla 标注界面,以比较和评估两个 LLM 的输出。
  4. 将示例数据和零样本 LLM 响应上传到 Argilla 标注界面。
  5. 下载已标注的数据。

你可以根据自己的需求调整本笔记,例如,在步骤 (2) 中使用不同的 LLM 和 API 提供商,或在步骤 (3) 中调整标注任务。

安装所需软件包并连接到 HF Hub

!pip install argilla~=2.0.0
!pip install transformers~=4.40.0
!pip install datasets~=2.19.0
!pip install huggingface_hub~=0.23.2
# Login to the HF Hub. We recommend using this login method 
# to avoid the need to explicitly store your HF token in variables 
import huggingface_hub
!git config --global credential.helper store
huggingface_hub.login(add_to_git_credential=True)

下载示例任务数据

首先,我们下载一个包含 LLM 代码生成任务的示例数据集。我们希望评估两个不同的 LLM 在这些代码生成任务上的表现。我们使用来自 bigcode/self-oss-instruct-sc2-exec-filter-50k 数据集的指令,该数据集曾用于训练 StarCoder2-Instruct 模型。

>>> from datasets import load_dataset

>>> # Small sample for faster testing
>>> dataset_codetask = load_dataset("bigcode/self-oss-instruct-sc2-exec-filter-50k", split="train[:3]")
>>> print("Dataset structure:\n", dataset_codetask, "\n")

>>> # We are only interested in the instructions/prompts provided in the dataset
>>> instructions_lst = dataset_codetask["instruction"]
>>> print("Example instructions:\n", instructions_lst[:2])
Dataset structure:
 Dataset({
    features: ['fingerprint', 'sha1', 'seed', 'response', 'concepts', 'prompt', 'instruction', 'id'],
    num_rows: 3
}) 

Example instructions:
 ['Write a Python function named `get_value` that takes a matrix (represented by a list of lists) and a tuple of indices, and returns the value at that index in the matrix. The function should handle index out of range errors by returning None.', 'Write a Python function `check_collision` that takes a list of `rectangles` as input and checks if there are any collisions between any two rectangles. A rectangle is represented as a tuple (x, y, w, h) where (x, y) is the top-left corner of the rectangle, `w` is the width, and `h` is the height.\n\nThe function should return True if any pair of rectangles collide, and False otherwise. Use an iterative approach and check for collisions based on the bounding box collision detection algorithm. If a collision is found, return True immediately without checking for more collisions.']

对示例任务提示两个 LLM

使用 chat_template 格式化指令

在将指令发送到 LLM API 之前,我们需要使用正确的 `chat_template` 格式化指令,以便我们想要评估的每个模型都能正确处理。这实际上涉及到在指令周围添加一些特殊标记。有关聊天模板的详细信息,请参阅 文档

>>> # Apply correct chat formatting to instructions from the dataset
>>> from transformers import AutoTokenizer

>>> models_to_compare = ["mistralai/Mixtral-8x7B-Instruct-v0.1", "meta-llama/Meta-Llama-3-70B-Instruct"]


>>> def format_prompt(prompt, tokenizer):
...     messages = [{"role": "user", "content": prompt}]
...     messages_tokenized = tokenizer.apply_chat_template(
...         messages, tokenize=False, add_generation_prompt=True, return_tensors="pt"
...     )
...     return messages_tokenized


>>> prompts_formatted_dic = {}
>>> for model in models_to_compare:
...     tokenizer = AutoTokenizer.from_pretrained(model)

...     prompt_formatted = []
...     for instruction in instructions_lst:
...         prompt_formatted.append(format_prompt(instruction, tokenizer))

...     prompts_formatted_dic.update({model: prompt_formatted})


>>> print(
...     f"\nFirst prompt formatted for {models_to_compare[0]}:\n\n",
...     prompts_formatted_dic[models_to_compare[0]][0],
...     "\n\n",
... )
>>> print(
...     f"First prompt formatted for {models_to_compare[1]}:\n\n",
...     prompts_formatted_dic[models_to_compare[1]][0],
...     "\n\n",
... )
First prompt formatted for mistralai/Mixtral-8x7B-Instruct-v0.1:

 [INST] Write a Python function named `get_value` that takes a matrix (represented by a list of lists) and a tuple of indices, and returns the value at that index in the matrix. The function should handle index out of range errors by returning None. [/INST] 


First prompt formatted for meta-llama/Meta-Llama-3-70B-Instruct:

 <|begin_of_text|><|start_header_id|>user<|end_header_id|>

Write a Python function named `get_value` that takes a matrix (represented by a list of lists) and a tuple of indices, and returns the value at that index in the matrix. The function should handle index out of range errors by returning None.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

将指令发送到 HF 推理 API

现在,我们可以将指令发送到两个 LLM 的 API 以获取可评估的输出。我们首先定义一些参数以正确生成响应。Hugging Face 的 LLM API 由 Text Generation Inference (TGI) 容器提供支持。有关 TGI OpenAPI 规范,请参见 此处;有关 Transformers 生成参数的不同参数解释,请参见 文档

generation_params = dict(
    # we use low temperature and top_p to reduce creativity and increase likelihood of highly probable tokens
    temperature=0.2,
    top_p=0.60,
    top_k=None,
    repetition_penalty=1.0,
    do_sample=True,
    max_new_tokens=512 * 2,
    return_full_text=False,
    seed=42,
    # details=True,
    # stop=["<|END_OF_TURN_TOKEN|>"],
    # grammar={"type": "json"}
    max_time=None,
    stream=False,
    use_cache=False,
    wait_for_model=False,
)

现在,我们可以向无服务器推理 API (文档) 发出标准 API 请求。请注意,无服务器推理 API 主要用于测试,并且受到速率限制。对于不受速率限制的测试,你可以通过 HF 专用端点 (文档) 创建自己的 API。另请参阅我们开源 AI 食谱中相应的教程。

推理 API 食谱完成后,以下代码将更新。

>>> import requests
>>> from tqdm.auto import tqdm


>>> # Hint: use asynchronous API calls (and dedicated endpoints) to increase speed
>>> def query(payload=None, api_url=None):
...     response = requests.post(api_url, headers=headers, json=payload)
...     return response.json()


>>> headers = {"Authorization": f"Bearer {huggingface_hub.get_token()}"}

>>> output_dic = {}
>>> for model in models_to_compare:
...     # Create API urls for each model
...     # When using dedicated endpoints, you can reuse the same code and simply replace this URL
...     api_url = "https://api-inference.huggingface.co/models/" + model

...     # send requests to API
...     output_lst = []
...     for prompt in tqdm(prompt_formatted):
...         output = query(payload={"inputs": prompt, "parameters": {**generation_params}}, api_url=api_url)
...         output_lst.append(output[0]["generated_text"])

...     output_dic.update({model: output_lst})

>>> print(f"---First generation of {models_to_compare[0]}:\n{output_dic[models_to_compare[0]][0]}\n\n")
>>> print(f"---First generation of {models_to_compare[1]}:\n{output_dic[models_to_compare[1]][0]}")
---First generation of mistralai/Mixtral-8x7B-Instruct-v0.1:
Here's a Python function that meets your requirements:

```python
def get_value(matrix, indices):
    try:
        return matrix[indices[0]][indices[1]]
    except IndexError:
        return None
```

This function takes a matrix (represented by a list of lists) and a tuple of indices as input. It first tries to access the value at the given indices in the matrix. If the indices are out of range, it catches the `IndexError` exception and returns `None`.


---First generation of meta-llama/Meta-Llama-3-70B-Instruct:
Here is a Python function that does what you described:
```
def get_value(matrix, indices):
    try:
        row, col = indices
        return matrix[row][col]
    except IndexError:
        return None
```
Here's an explanation of how the function works:

1. The function takes two arguments: `matrix` (a list of lists) and `indices` (a tuple of two integers, representing the row and column indices).
2. The function tries to access the value at the specified indices using `matrix[row][col]`.
3. If the indices are out of range (i.e., `row` or `col` is greater than the length of the corresponding dimension of the matrix), an `IndexError` exception is raised.
4. The `except` block catches the `IndexError` exception and returns `None` instead of raising an error.

Here's an example usage of the function:
```
matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

print(get_value(matrix, (0, 0)))  # prints 1
print(get_value(matrix, (1, 1)))  # prints 5
print(get_value(matrix, (3, 0)))  # prints None (out of range)
print(get_value(matrix, (0, 3)))  # prints None (out of range)
```
I hope this helps! Let me know if you have any questions.

将 LLM 输出存储到数据集中

现在我们可以将 LLM 输出连同原始指令一起存储到数据集中。

# create a HF dataset with the instructions and model outputs
from datasets import Dataset

dataset = Dataset.from_dict(
    {
        "instructions": instructions_lst,
        "response_model_1": output_dic[models_to_compare[0]],
        "response_model_2": output_dic[models_to_compare[1]],
    }
)

dataset

创建和配置你的 Argilla 数据集

我们使用 Argilla,这是一个面向 AI 工程师和领域专家的协作工具,他们需要为其项目构建高质量的数据集。

我们通过 HF Spaces 运行 Argilla,你只需点击几下即可设置,无需任何本地设置。你可以按照这些说明创建 HF Argilla Space。有关 HF Argilla Spaces 的进一步配置,另请参阅详细的文档。如果你愿意,你还可以通过 Argilla 的 Docker 容器在本地运行 Argilla(请参阅Argilla 文档)。

Argilla login screen

以编程方式与 Argilla 交互

在我们能够根据特定任务定制数据集并上传将在用户界面中显示的数据之前,我们首先需要设置一些事项。

将此笔记本连接到 Argilla:我们现在可以将此笔记本连接到 Argilla,以便以编程方式配置你的数据集并上传/下载数据。

# After starting the Argilla Space (or local docker container) you can connect to the Space with the code below.
import argilla as rg

client = rg.Argilla(
    api_url="https://username-spacename.hf.space",  # Locally: "https://:6900"
    api_key="your-apikey",  # You'll find it in the UI "My Settings > API key"
    # To use a private HF Argilla Space, also pass your HF token
    headers={"Authorization": f"Bearer {huggingface_hub.get_token()}"},
)
user = client.me
user

撰写良好的标注指南

为人工标注员撰写良好的指南与撰写良好的训练代码同样重要(且困难)。良好的指令应满足以下标准:

  • 简单明了:指南应该简单明了,以便对你的任务一无所知的人也能理解。请务必请至少一位同事重新阅读指南,以确保没有歧义。
  • 可复现和明确:完成标注任务所需的所有信息都应包含在指南中。一个常见的错误是在与选定的标注员对话期间创建对指南的非正式解释。未来的标注员将没有这些信息,如果指南中没有明确说明,他们可能会以与预期不同的方式完成任务。
  • 简短而全面:指南应尽可能简短,同时包含所有必要信息。标注员往往不会仔细阅读冗长的指南,因此请尽量保持其简短,同时保持全面。

请注意,创建标注指南是一个迭代过程。在将任务分配给其他人之前,最好自己先进行几十个标注,并根据从数据中获得的经验教训完善指南。随着任务的演变,对指南进行版本控制也有助于此。更多提示请参阅此博客文章

annotator_guidelines = """\
Your task is to evaluate the responses of two LLMs to code generation tasks. 

First, you need to score each response on a scale from 0 to 7. You add points to your final score based on the following criteria:
- Add up to +2 points, if the code is properly commented, with inline comments and doc strings for functions.
- Add up to +2 points, if the code contains a good example for testing. 
- Add up to +3 points, if the code runs and works correctly. Copy the code into an IDE and test it with at least two different inputs. Attribute one point if the code is overall correct, but has some issues. Attribute three points if the code is fully correct and robust against different scenarios. 
Your resulting final score can be any value between 0 to 7. 

If both responses have a final score of <= 4, select one response and correct it manually in the text field. 
The corrected response must fulfill all criteria from above. 
"""

rating_tooltip = """\
- Add up to +2 points, if the code is properly commented, with inline comments and doc strings for functions.
- Add up to +2 points, if the code contains a good example for testing. 
- Add up to +3 points, if the code runs and works correctly. Copy the code into an IDE and test it with at least two different inputs. Attribute one point if the code works mostly correctly, but has some issues. Attribute three points if the code is fully correct and robust against different scenarios. 
"""

累积评分与李克特量表:请注意,以上指南要求标注员通过为明确的标准加分来完成累积评分。另一种方法是“李克特量表”,其中要求标注员在连续量表上对响应进行评分,例如从 1(非常差)到 3(一般)到 5(非常好)。我们通常推荐累积评分,因为它强制你和标注员明确质量标准,而仅仅将响应评为“4”(好)则模糊不清,不同的标注员会对其有不同的解释。

根据你的特定任务定制 Argilla 数据集

现在我们可以创建自己的 `code-llm` 任务,并包含标注所需的字段、问题和元数据。有关配置 Argilla 数据集的更多信息,请参阅 Argilla 文档

dataset_argilla_name = "code-llm"
workspace_name = "argilla"
reuse_existing_dataset = False  # for easier iterative testing

# Configure your dataset settings
settings = rg.Settings(
    # The overall annotation guidelines, which human annotators can refer back to inside of the interface
    guidelines="my guidelines",
    fields=[
        rg.TextField(name="instruction", title="Instruction:", use_markdown=True, required=True),
        rg.TextField(
            name="generation_1",
            title="Response model 1:",
            use_markdown=True,
            required=True,
        ),
        rg.TextField(
            name="generation_2",
            title="Response model 2:",
            use_markdown=True,
            required=True,
        ),
    ],
    # These are the questions we ask annotators about the fields in the dataset
    questions=[
        rg.RatingQuestion(
            name="score_response_1",
            title="Your score for the response of model 1:",
            description="0=very bad, 7=very good",
            values=[0, 1, 2, 3, 4, 5, 6, 7],
            required=True,
        ),
        rg.RatingQuestion(
            name="score_response_2",
            title="Your score for the response of model 2:",
            description="0=very bad, 7=very good",
            values=[0, 1, 2, 3, 4, 5, 6, 7],
            required=True,
        ),
        rg.LabelQuestion(
            name="which_response_corrected",
            title="If both responses score below 4, select a response to correct:",
            description="Select the response you will correct in the text field below.",
            labels=["Response 1", "Response 2", "Combination of both", "Neither"],
            required=False,
        ),
        rg.TextQuestion(
            name="correction",
            title="Paste the selected response below and correct it manually:",
            description="Your corrected response must fulfill all criteria from the annotation guidelines.",
            use_markdown=True,
            required=False,
        ),
        rg.TextQuestion(
            name="comments",
            title="Annotator Comments",
            description="Add any additional comments here. E.g.: edge cases, issues with the interface etc.",
            use_markdown=True,
            required=False,
        ),
    ],
    metadata=[
        rg.TermsMetadataProperty(
            name="source-dataset",
            title="Original dataset source",
        ),
    ],
    allow_extra_metadata=False,
)

if reuse_existing_dataset:
    dataset_argilla = client.datasets(dataset_argilla_name, workspace=workspace_name)
else:
    dataset_argilla = rg.Dataset(
        name=dataset_argilla_name,
        settings=settings,
        workspace=workspace_name,
    )
    if client.datasets(dataset_argilla_name, workspace=workspace_name) is not None:
        client.datasets(dataset_argilla_name, workspace=workspace_name).delete()
    dataset_argilla = dataset_argilla.create()

dataset_argilla

运行上述代码后,你将在 Argilla 中看到新的自定义 `code-llm` 数据集(以及你之前可能创建的任何其他数据集)。

将数据加载到 Argilla

目前,数据集仍为空。让我们用下面的代码加载一些数据。

# Iterate over the samples in the dataset
records = [
    rg.Record(
        fields={
            "instruction": example["instructions"],
            "generation_1": example["response_model_1"],
            "generation_2": example["response_model_2"],
        },
        metadata={
            "source-dataset": "bigcode/self-oss-instruct-sc2-exec-filter-50k",
        },
        # Optional: add suggestions from an LLM-as-a-judge system
        # They will be indicated with a sparkle icon and shown as pre-filled responses
        # It will speed up manual annotation
        # suggestions=[
        #     rg.Suggestion(
        #         question_name="score_response_1",
        #         value=example["llm_judge_rating"],
        #         agent="llama-3-70b-instruct",
        #     ),
        # ],
    )
    for example in dataset
]

try:
    dataset_argilla.records.log(records)
except Exception as e:
    print("Exception:", e)

Argilla 标注用户界面将类似于:

Argilla UI

标注

就是这样,我们已经创建了 Argilla 数据集,现在我们可以在 UI 中开始标注了!默认情况下,记录将在获得 1 个标注后完成。请查看这些指南,了解如何自动分配标注任务在 Argilla 中进行标注

重要提示:如果你在 HF Space 中使用 Argilla,你需要激活持久存储,以便你的数据安全存储且不会在一段时间后自动删除。对于生产设置,请务必在进行任何标注之前激活持久存储,以避免数据丢失。

下载已标注的数据

标注完成后,你可以从 Argilla 中拉取数据,并简单地将其存储和处理为任何表格格式(请参阅此处的文档)。你还可以下载数据集的过滤版本(文档)。

annotated_dataset = client.datasets(dataset_argilla_name, workspace=workspace_name)

hf_dataset = annotated_dataset.records.to_datasets()

# This HF dataset can then be formatted, stored and processed into any tabular data format
hf_dataset.to_pandas()
# Store the dataset locally
hf_dataset.to_csv("argilla-dataset-local.csv")  # Save as CSV
# hf_dataset.to_json("argilla-dataset-local.json")  # Save as JSON
# hf_dataset.save_to_disk("argilla-dataset-local")  # Save as a `datasets.Dataset` in the local filesystem
# hf_dataset.to_parquet()  # Save as Parquet

下一步

就这样!你已经使用 HF 推理 API 创建了合成 LLM 数据,在 Argilla 中创建了一个数据集,将 LLM 数据上传到 Argilla,评估/修正了数据,并在标注后以简单的表格格式下载了数据以供后续使用。

我们专门为两个主要用例设计了管道和界面:

  1. 评估:你现在可以简单地使用 `score_response_1` 和 `score_response_2` 列中的数值分数来计算哪个模型总体上更好。你还可以检查评分非常低或高的响应,进行详细的错误分析。当你测试或训练不同的模型时,你可以重复使用此管道并跟踪不同模型的改进。
  2. 训练:标注足够数据后,你可以从数据中创建训练集和测试集,并微调你自己的模型。你可以使用高评分的响应文本通过 TRL SFTTrainer 进行监督微调,或者你可以直接使用评分通过 TRL DPOTrainer 进行偏好微调技术,如 DPO。有关不同 LLM 微调技术的优缺点,请参阅 TRL 文档

调整和改进:为了使此管道适应你的特定用例,可以改进许多方面。例如,你可以提示 LLM 评估两个 LLM 的输出,其指令与人类标注员的指南非常相似(“LLM-as-a-judge”方法)。这有助于进一步加快你的评估管道。有关 LLM-as-a-judge 的示例实现,请参阅我们的 LLM-as-a-judge 食谱,有关其他许多想法,请参阅我们的整体 开源 AI 食谱

< > 在 GitHub 上更新