开源 AI 食谱文档

使用 Argilla Spaces 进行数据标注

Hugging Face's logo
加入 Hugging Face 社区

并获取增强文档体验

开始使用

Open In Colab

使用 Argilla Spaces 进行数据标注

作者:Moritz Laurer

此笔记本演示了系统评估大型语言模型输出并创建大型语言模型训练数据的流程。您可以首先使用此笔记本评估您最喜欢的 LLM 在您的任务上的零样本性能,而无需任何微调。如果您想提高性能,则可以轻松地重复使用此工作流程来创建训练数据。

示例用例:代码生成。在本教程中,我们演示了如何为代码生成任务创建高质量的测试和训练数据。但是,相同的工作流程可以适用于与您的特定用例相关的任何其他任务。

在本笔记本中,我们将

  1. 下载示例任务的数据。
  2. 提示两个大型语言模型对这些任务做出响应。这将生成“合成数据”以加快手动数据创建速度。
  3. 在 HF Spaces 上创建 Argilla 标注界面,以比较和评估来自这两个大型语言模型的输出。
  4. 将示例数据和零样本大型语言模型响应上传到 Argilla 标注界面。
  5. 下载标注后的数据。

您可以根据需要调整此笔记本,例如,为步骤 (2) 使用不同的 LLM 和 API 提供商,或调整步骤 (3) 中的标注任务。

安装所需的软件包并连接到 HF Hub

!pip install argilla~=2.0.0
!pip install transformers~=4.40.0
!pip install datasets~=2.19.0
!pip install huggingface_hub~=0.23.2
# Login to the HF Hub. We recommend using this login method 
# to avoid the need to explicitly store your HF token in variables 
import huggingface_hub
!git config --global credential.helper store
huggingface_hub.login(add_to_git_credential=True)

下载示例任务数据

首先,我们下载一个包含大型语言模型代码生成任务的示例数据集。我们希望评估两个不同的大型语言模型在这些代码生成任务上的表现。我们使用来自bigcode/self-oss-instruct-sc2-exec-filter-50k数据集的指令,该数据集用于训练StarCoder2-Instruct模型。

>>> from datasets import load_dataset

>>> # Small sample for faster testing
>>> dataset_codetask = load_dataset("bigcode/self-oss-instruct-sc2-exec-filter-50k", split="train[:3]")
>>> print("Dataset structure:\n", dataset_codetask, "\n")

>>> # We are only interested in the instructions/prompts provided in the dataset
>>> instructions_lst = dataset_codetask["instruction"]
>>> print("Example instructions:\n", instructions_lst[:2])
Dataset structure:
 Dataset({
    features: ['fingerprint', 'sha1', 'seed', 'response', 'concepts', 'prompt', 'instruction', 'id'],
    num_rows: 3
}) 

Example instructions:
 ['Write a Python function named `get_value` that takes a matrix (represented by a list of lists) and a tuple of indices, and returns the value at that index in the matrix. The function should handle index out of range errors by returning None.', 'Write a Python function `check_collision` that takes a list of `rectangles` as input and checks if there are any collisions between any two rectangles. A rectangle is represented as a tuple (x, y, w, h) where (x, y) is the top-left corner of the rectangle, `w` is the width, and `h` is the height.\n\nThe function should return True if any pair of rectangles collide, and False otherwise. Use an iterative approach and check for collisions based on the bounding box collision detection algorithm. If a collision is found, return True immediately without checking for more collisions.']

提示两个大型语言模型处理示例任务

使用 chat_template 格式化指令

在将指令发送到大型语言模型 API 之前,我们需要使用每个要评估模型的正确chat_template格式化指令。这实质上需要在指令周围包装一些特殊标记。有关详细信息,请参阅有关聊天模板的文档

>>> # Apply correct chat formatting to instructions from the dataset
>>> from transformers import AutoTokenizer

>>> models_to_compare = ["mistralai/Mixtral-8x7B-Instruct-v0.1", "meta-llama/Meta-Llama-3-70B-Instruct"]


>>> def format_prompt(prompt, tokenizer):
...     messages = [{"role": "user", "content": prompt}]
...     messages_tokenized = tokenizer.apply_chat_template(
...         messages, tokenize=False, add_generation_prompt=True, return_tensors="pt"
...     )
...     return messages_tokenized


>>> prompts_formatted_dic = {}
>>> for model in models_to_compare:
...     tokenizer = AutoTokenizer.from_pretrained(model)

...     prompt_formatted = []
...     for instruction in instructions_lst:
...         prompt_formatted.append(format_prompt(instruction, tokenizer))

...     prompts_formatted_dic.update({model: prompt_formatted})


>>> print(
...     f"\nFirst prompt formatted for {models_to_compare[0]}:\n\n",
...     prompts_formatted_dic[models_to_compare[0]][0],
...     "\n\n",
... )
>>> print(
...     f"First prompt formatted for {models_to_compare[1]}:\n\n",
...     prompts_formatted_dic[models_to_compare[1]][0],
...     "\n\n",
... )
First prompt formatted for mistralai/Mixtral-8x7B-Instruct-v0.1:

 [INST] Write a Python function named `get_value` that takes a matrix (represented by a list of lists) and a tuple of indices, and returns the value at that index in the matrix. The function should handle index out of range errors by returning None. [/INST] 


First prompt formatted for meta-llama/Meta-Llama-3-70B-Instruct:

 <|begin_of_text|><|start_header_id|>user<|end_header_id|>

Write a Python function named `get_value` that takes a matrix (represented by a list of lists) and a tuple of indices, and returns the value at that index in the matrix. The function should handle index out of range errors by returning None.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

将指令发送到 HF 推理 API

现在,我们可以将指令发送到两个大型语言模型的 API 以获取我们可以评估的输出。我们首先定义一些参数以正确生成响应。Hugging Face 的大型语言模型 API 由文本生成推理 (TGI) 容器提供支持。请参阅此处的 TGI OpenAPI 规范此处,以及 Transformers 生成参数文档中不同参数的说明。

generation_params = dict(
    # we use low temperature and top_p to reduce creativity and increase likelihood of highly probable tokens
    temperature=0.2,
    top_p=0.60,
    top_k=None,
    repetition_penalty=1.0,
    do_sample=True,
    max_new_tokens=512 * 2,
    return_full_text=False,
    seed=42,
    # details=True,
    # stop=["<|END_OF_TURN_TOKEN|>"],
    # grammar={"type": "json"}
    max_time=None,
    stream=False,
    use_cache=False,
    wait_for_model=False,
)

现在,我们可以向无服务器推理 API(文档)发出标准的 API 请求。请注意,无服务器推理 API 主要用于测试,并且具有速率限制。对于无需速率限制的测试,您可以通过 HF 专用端点(文档)创建自己的 API。另请参阅我们在开源 AI 食谱中的相应教程。

推理 API 食谱完成后,以下代码将更新。

>>> import requests
>>> from tqdm.auto import tqdm


>>> # Hint: use asynchronous API calls (and dedicated endpoints) to increase speed
>>> def query(payload=None, api_url=None):
...     response = requests.post(api_url, headers=headers, json=payload)
...     return response.json()


>>> headers = {"Authorization": f"Bearer {huggingface_hub.get_token()}"}

>>> output_dic = {}
>>> for model in models_to_compare:
...     # Create API urls for each model
...     # When using dedicated endpoints, you can reuse the same code and simply replace this URL
...     api_url = "https://api-inference.huggingface.co/models/" + model

...     # send requests to API
...     output_lst = []
...     for prompt in tqdm(prompt_formatted):
...         output = query(payload={"inputs": prompt, "parameters": {**generation_params}}, api_url=api_url)
...         output_lst.append(output[0]["generated_text"])

...     output_dic.update({model: output_lst})

>>> print(f"---First generation of {models_to_compare[0]}:\n{output_dic[models_to_compare[0]][0]}\n\n")
>>> print(f"---First generation of {models_to_compare[1]}:\n{output_dic[models_to_compare[1]][0]}")
---First generation of mistralai/Mixtral-8x7B-Instruct-v0.1:
Here's a Python function that meets your requirements:

```python
def get_value(matrix, indices):
    try:
        return matrix[indices[0]][indices[1]]
    except IndexError:
        return None
```

This function takes a matrix (represented by a list of lists) and a tuple of indices as input. It first tries to access the value at the given indices in the matrix. If the indices are out of range, it catches the `IndexError` exception and returns `None`.


---First generation of meta-llama/Meta-Llama-3-70B-Instruct:
Here is a Python function that does what you described:
```
def get_value(matrix, indices):
    try:
        row, col = indices
        return matrix[row][col]
    except IndexError:
        return None
```
Here's an explanation of how the function works:

1. The function takes two arguments: `matrix` (a list of lists) and `indices` (a tuple of two integers, representing the row and column indices).
2. The function tries to access the value at the specified indices using `matrix[row][col]`.
3. If the indices are out of range (i.e., `row` or `col` is greater than the length of the corresponding dimension of the matrix), an `IndexError` exception is raised.
4. The `except` block catches the `IndexError` exception and returns `None` instead of raising an error.

Here's an example usage of the function:
```
matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

print(get_value(matrix, (0, 0)))  # prints 1
print(get_value(matrix, (1, 1)))  # prints 5
print(get_value(matrix, (3, 0)))  # prints None (out of range)
print(get_value(matrix, (0, 3)))  # prints None (out of range)
```
I hope this helps! Let me know if you have any questions.

将大型语言模型输出存储在数据集中

现在,我们可以将大型语言模型输出与原始指令一起存储在数据集中。

# create a HF dataset with the instructions and model outputs
from datasets import Dataset

dataset = Dataset.from_dict(
    {
        "instructions": instructions_lst,
        "response_model_1": output_dic[models_to_compare[0]],
        "response_model_2": output_dic[models_to_compare[1]],
    }
)

dataset

创建并配置您的 Argilla 数据集

我们使用 Argilla,这是一个面向 AI 工程师和领域专家协作工具,他们需要为其项目构建高质量数据集。

我们通过 HF Space 运行 Argilla,您可以只需点击几下即可设置,无需任何本地设置。您可以按照 这些说明 创建 HF Argilla Space。有关 HF Argilla Space 的更多配置,还可以参考详细的 文档。如果需要,您也可以通过 Argilla 的 Docker 容器在本地运行 Argilla(请参阅 Argilla 文档)。

Argilla login screen

以编程方式与 Argilla 交互

在我们可以根据我们的特定任务定制数据集并上传将在 UI 中显示的数据之前,我们需要先设置一些内容。

将此笔记本连接到 Argilla:我们现在可以将此笔记本连接到 Argilla,以编程方式配置您的数据集并上传/下载数据。

# After starting the Argilla Space (or local docker container) you can connect to the Space with the code below.
import argilla as rg

client = rg.Argilla(
    api_url="https://username-spacename.hf.space",  # Locally: "https://127.0.0.1:6900"
    api_key="your-apikey",  # You'll find it in the UI "My Settings > API key"
    # To use a private HF Argilla Space, also pass your HF token
    headers={"Authorization": f"Bearer {huggingface_hub.get_token()}"},
)
user = client.me
user

编写良好的标注指南

为您的人工标注人员编写良好的指南与编写良好的训练代码一样重要(且困难)。良好的说明应满足以下标准

  • 简单明了:指南应简单明了,易于那些对您的任务一无所知的人理解。始终至少请一位同事重新阅读指南,以确保没有歧义。
  • 可复现且明确:执行标注任务的所有信息都应包含在指南中。一个常见的错误是在与选定的标注人员交谈时创建非正式的指南解释。未来的标注人员将没有这些信息,如果指南中没有明确说明,他们可能会以与预期不同的方式执行任务。
  • 简短且全面:指南应尽可能简短,同时包含所有必要信息。标注人员往往不会正确阅读冗长的指南,因此请尽量保持简短,同时保持全面性。

请注意,创建标注指南是一个迭代过程。在将任务分配给他人之前,最好自己进行几十次标注,并根据从数据中学到的知识改进指南。随着任务的不断发展,对指南进行版本控制也有助于改进。请参阅此 博文 中的更多提示。

annotator_guidelines = """\
Your task is to evaluate the responses of two LLMs to code generation tasks. 

First, you need to score each response on a scale from 0 to 7. You add points to your final score based on the following criteria:
- Add up to +2 points, if the code is properly commented, with inline comments and doc strings for functions.
- Add up to +2 points, if the code contains a good example for testing. 
- Add up to +3 points, if the code runs and works correctly. Copy the code into an IDE and test it with at least two different inputs. Attribute one point if the code is overall correct, but has some issues. Attribute three points if the code is fully correct and robust against different scenarios. 
Your resulting final score can be any value between 0 to 7. 

If both responses have a final score of <= 4, select one response and correct it manually in the text field. 
The corrected response must fulfill all criteria from above. 
"""

rating_tooltip = """\
- Add up to +2 points, if the code is properly commented, with inline comments and doc strings for functions.
- Add up to +2 points, if the code contains a good example for testing. 
- Add up to +3 points, if the code runs and works correctly. Copy the code into an IDE and test it with at least two different inputs. Attribute one point if the code works mostly correctly, but has some issues. Attribute three points if the code is fully correct and robust against different scenarios. 
"""

累积评分与李克特量表:请注意,以上指南要求标注人员通过为明确的标准添加分数来进行累积评分。“李克特量表”是一种替代方法,其中标注人员被要求在一个连续的尺度上(例如,从 1(非常差)到 3(中等)到 5(非常好))对响应进行评分。我们通常建议使用累积评分,因为它们迫使您和标注人员明确质量标准,而仅仅将响应评为“4”(良好)则含糊不清,并且会因不同的标注人员而被解释为不同的含义。

根据您的特定任务定制您的 Argilla 数据集

我们现在可以使用所需的字段、问题和元数据创建我们自己的 code-llm 任务以进行标注。有关配置 Argilla 数据集的更多信息,请参阅 Argilla 文档

dataset_argilla_name = "code-llm"
workspace_name = "argilla"
reuse_existing_dataset = False  # for easier iterative testing

# Configure your dataset settings
settings = rg.Settings(
    # The overall annotation guidelines, which human annotators can refer back to inside of the interface
    guidelines="my guidelines",
    fields=[
        rg.TextField(name="instruction", title="Instruction:", use_markdown=True, required=True),
        rg.TextField(
            name="generation_1",
            title="Response model 1:",
            use_markdown=True,
            required=True,
        ),
        rg.TextField(
            name="generation_2",
            title="Response model 2:",
            use_markdown=True,
            required=True,
        ),
    ],
    # These are the questions we ask annotators about the fields in the dataset
    questions=[
        rg.RatingQuestion(
            name="score_response_1",
            title="Your score for the response of model 1:",
            description="0=very bad, 7=very good",
            values=[0, 1, 2, 3, 4, 5, 6, 7],
            required=True,
        ),
        rg.RatingQuestion(
            name="score_response_2",
            title="Your score for the response of model 2:",
            description="0=very bad, 7=very good",
            values=[0, 1, 2, 3, 4, 5, 6, 7],
            required=True,
        ),
        rg.LabelQuestion(
            name="which_response_corrected",
            title="If both responses score below 4, select a response to correct:",
            description="Select the response you will correct in the text field below.",
            labels=["Response 1", "Response 2", "Combination of both", "Neither"],
            required=False,
        ),
        rg.TextQuestion(
            name="correction",
            title="Paste the selected response below and correct it manually:",
            description="Your corrected response must fulfill all criteria from the annotation guidelines.",
            use_markdown=True,
            required=False,
        ),
        rg.TextQuestion(
            name="comments",
            title="Annotator Comments",
            description="Add any additional comments here. E.g.: edge cases, issues with the interface etc.",
            use_markdown=True,
            required=False,
        ),
    ],
    metadata=[
        rg.TermsMetadataProperty(
            name="source-dataset",
            title="Original dataset source",
        ),
    ],
    allow_extra_metadata=False,
)

if reuse_existing_dataset:
    dataset_argilla = client.datasets(dataset_argilla_name, workspace=workspace_name)
else:
    dataset_argilla = rg.Dataset(
        name=dataset_argilla_name,
        settings=settings,
        workspace=workspace_name,
    )
    if client.datasets(dataset_argilla_name, workspace=workspace_name) is not None:
        client.datasets(dataset_argilla_name, workspace=workspace_name).delete()
    dataset_argilla = dataset_argilla.create()

dataset_argilla

运行以上代码后,您将在 Argilla 中看到新的自定义 code-llm 数据集(以及您之前可能创建的任何其他数据集)。

将数据加载到 Argilla

此时,数据集仍然为空。让我们使用下面的代码加载一些数据。

# Iterate over the samples in the dataset
records = [
    rg.Record(
        fields={
            "instruction": example["instructions"],
            "generation_1": example["response_model_1"],
            "generation_2": example["response_model_2"],
        },
        metadata={
            "source-dataset": "bigcode/self-oss-instruct-sc2-exec-filter-50k",
        },
        # Optional: add suggestions from an LLM-as-a-judge system
        # They will be indicated with a sparkle icon and shown as pre-filled responses
        # It will speed up manual annotation
        # suggestions=[
        #     rg.Suggestion(
        #         question_name="score_response_1",
        #         value=example["llm_judge_rating"],
        #         agent="llama-3-70b-instruct",
        #     ),
        # ],
    )
    for example in dataset
]

try:
    dataset_argilla.records.log(records)
except Exception as e:
    print("Exception:", e)

Argilla 的 UI 用于标注将类似于此

Argilla UI

标注

就是这样,我们已经创建了 Argilla 数据集,现在我们可以开始在 UI 中进行标注了!默认情况下,当记录获得 1 个标注时,它们将被完成。请查看这些指南,了解如何 自动分配标注任务在 Argilla 中进行标注

重要事项:如果您在 HF Space 中使用 Argilla,则需要激活持久存储,以便您的数据安全存储,并且不会在一段时间后自动删除。对于生产环境,请确保在进行任何标注之前激活持久存储,以避免数据丢失。

下载标注后的数据

标注后,您可以从 Argilla 中提取数据,并简单地以任何表格格式将其存储和本地处理(请参阅 此处的文档)。您还可以下载数据集的过滤版本(文档)。

annotated_dataset = client.datasets(dataset_argilla_name, workspace=workspace_name)

hf_dataset = annotated_dataset.records.to_datasets()

# This HF dataset can then be formatted, stored and processed into any tabular data format
hf_dataset.to_pandas()
# Store the dataset locally
hf_dataset.to_csv("argilla-dataset-local.csv")  # Save as CSV
# hf_dataset.to_json("argilla-dataset-local.json")  # Save as JSON
# hf_dataset.save_to_disk("argilla-dataset-local")  # Save as a `datasets.Dataset` in the local filesystem
# hf_dataset.to_parquet()  # Save as Parquet

后续步骤

就是这样!您已经使用 HF 推理 API 生成了合成 LLM 数据,在 Argilla 中创建了一个数据集,将 LLM 数据上传到 Argilla,评估/校正了数据,并且在标注后,您已将数据以简单的表格格式下载以供后续使用。

我们专门为**两个主要用例**设计了管道和界面

  1. 评估:您现在可以简单地使用 score_response_1score_response_2 列中的数值分数来计算哪个模型总体上更好。您还可以检查评分非常低或非常高的响应,以进行详细的错误分析。当您测试或训练不同的模型时,您可以重用此管道并跟踪不同模型随时间的改进。
  2. 训练:在标注足够的数据后,您可以从数据中创建训练-测试分割并微调您自己的模型。您可以使用评分很高的响应文本,使用TRL SFTTrainer进行监督微调,或者您可以直接使用评分进行偏好微调技术(如 DPO),使用TRL DPOTrainer。请参阅TRL 文档,了解不同 LLM 微调技术的优缺点。

调整和改进:可以改进很多方面,以使此管道适合您的特定用例。例如,您可以提示 LLM 使用与人类标注者指南非常相似的说明来评估两个 LLM 的输出(“LLM 作为评判者”方法)。这可以帮助进一步加快您的评估管道。请参阅我们的LLM 作为评判者示例,了解 LLM 作为评判者的示例实现,以及我们整体的开源 AI 食谱,以获取更多想法。

< > 在 GitHub 上更新