使用 Argilla Spaces 进行数据标注

本笔记本演示了系统评估 LLM 输出和创建 LLM 训练数据的工作流程。您可以首先使用本笔记本评估您最喜欢的 LLM 在您的任务上的零样本性能，而无需进行任何微调。如果您想提高性能，您可以轻松地重复使用此工作流程来创建训练数据。

示例用例：代码生成。 在本教程中，我们将演示如何为代码生成任务创建高质量的测试和训练数据。然而，相同的工作流程可以适用于与您的特定用例相关的任何其他任务。

在本笔记本中，我们：

下载示例任务的数据。
提示两个 LLM 来响应这些任务。这将产生“合成数据”，以加速手动数据创建。
在 HF Spaces 上创建一个 Argilla 标注界面，以比较和评估来自两个 LLM 的输出。
将示例数据和零样本 LLM 响应上传到 Argilla 标注界面。
下载已标注的数据。

您可以根据自己的需求调整此笔记本，例如，在步骤 (2) 中使用不同的 LLM 和 API 提供商，或在步骤 (3) 中调整标注任务。

安装必需的软件包并连接到 HF Hub

!pip install argilla~=2.0.0
!pip install transformers~=4.40.0
!pip install datasets~=2.19.0
!pip install huggingface_hub~=0.23.2

# Login to the HF Hub. We recommend using this login method 
# to avoid the need to explicitly store your HF token in variables 
import huggingface_hub
!git config --global credential.helper store
huggingface_hub.login(add_to_git_credential=True)

下载示例任务数据

首先，我们下载一个示例数据集，其中包含 LLM 的代码生成任务。我们想要评估两个不同的 LLM 在这些代码生成任务上的表现如何。我们使用来自 bigcode/self-oss-instruct-sc2-exec-filter-50k 数据集的指令，该数据集用于训练 StarCoder2-Instruct 模型。

>>> from datasets import load_dataset

>>> # Small sample for faster testing
>>> dataset_codetask = load_dataset("bigcode/self-oss-instruct-sc2-exec-filter-50k", split="train[:3]")
>>> print("Dataset structure:\n", dataset_codetask, "\n")

>>> # We are only interested in the instructions/prompts provided in the dataset
>>> instructions_lst = dataset_codetask["instruction"]
>>> print("Example instructions:\n", instructions_lst[:2])

Dataset structure:
 Dataset(&#123;
    features: ['fingerprint', 'sha1', 'seed', 'response', 'concepts', 'prompt', 'instruction', 'id'],
    num_rows: 3
}) 

Example instructions:
 ['Write a Python function named `get_value` that takes a matrix (represented by a list of lists) and a tuple of indices, and returns the value at that index in the matrix. The function should handle index out of range errors by returning None.', 'Write a Python function `check_collision` that takes a list of `rectangles` as input and checks if there are any collisions between any two rectangles. A rectangle is represented as a tuple (x, y, w, h) where (x, y) is the top-left corner of the rectangle, `w` is the width, and `h` is the height.\n\nThe function should return True if any pair of rectangles collide, and False otherwise. Use an iterative approach and check for collisions based on the bounding box collision detection algorithm. If a collision is found, return True immediately without checking for more collisions.']

在示例任务上提示两个 LLM

使用 chat_template 格式化指令

在将指令发送到 LLM API 之前，我们需要使用正确的 chat_template 格式化指令，以便用于我们想要评估的每个模型。这本质上需要在指令周围包裹一些特殊标记。有关详细信息，请参阅关于聊天模板的文档。

>>> # Apply correct chat formatting to instructions from the dataset
>>> from transformers import AutoTokenizer

>>> models_to_compare = ["mistralai/Mixtral-8x7B-Instruct-v0.1", "meta-llama/Meta-Llama-3-70B-Instruct"]


>>> def format_prompt(prompt, tokenizer):
...     messages = [{"role": "user", "content": prompt}]
...     messages_tokenized = tokenizer.apply_chat_template(
...         messages, tokenize=False, add_generation_prompt=True, return_tensors="pt"
...     )
...     return messages_tokenized


>>> prompts_formatted_dic = {}
>>> for model in models_to_compare:
...     tokenizer = AutoTokenizer.from_pretrained(model)

...     prompt_formatted = []
...     for instruction in instructions_lst:
...         prompt_formatted.append(format_prompt(instruction, tokenizer))

...     prompts_formatted_dic.update({model: prompt_formatted})


>>> print(
...     f"\nFirst prompt formatted for {models_to_compare[0]}:\n\n",
...     prompts_formatted_dic[models_to_compare[0]][0],
...     "\n\n",
... )
>>> print(
...     f"First prompt formatted for {models_to_compare[1]}:\n\n",
...     prompts_formatted_dic[models_to_compare[1]][0],
...     "\n\n",
... )

First prompt formatted for mistralai/Mixtral-8x7B-Instruct-v0.1:

 [INST] Write a Python function named `get_value` that takes a matrix (represented by a list of lists) and a tuple of indices, and returns the value at that index in the matrix. The function should handle index out of range errors by returning None. [/INST] 


First prompt formatted for meta-llama/Meta-Llama-3-70B-Instruct:

 <|begin_of_text|><|start_header_id|>user<|end_header_id|>

Write a Python function named `get_value` that takes a matrix (represented by a list of lists) and a tuple of indices, and returns the value at that index in the matrix. The function should handle index out of range errors by returning None.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

将指令发送到 HF 推理 API

现在，我们可以将指令发送到两个 LLM 的 API，以获取我们可以评估的输出。我们首先定义一些参数，以便正确生成响应。Hugging Face 的 LLM API 由文本生成推理 (TGI) 容器提供支持。请参阅 TGI OpenAPI 规范此处和 Transformers 生成参数文档中对不同参数的解释。

generation_params = dict(
    # we use low temperature and top_p to reduce creativity and increase likelihood of highly probable tokens
    temperature=0.2,
    top_p=0.60,
    top_k=None,
    repetition_penalty=1.0,
    do_sample=True,
    max_new_tokens=512 * 2,
    return_full_text=False,
    seed=42,
    # details=True,
    # stop=["<|END_OF_TURN_TOKEN|>"],
    # grammar={"type": "json"}
    max_time=None,
    stream=False,
    use_cache=False,
    wait_for_model=False,
)

现在，我们可以向无服务器推理 API 发出标准 API 请求 (文档)。请注意，无服务器推理 API 主要用于测试，并且受到速率限制。为了进行无速率限制的测试，您可以通过 HF 专用端点 (文档) 创建自己的 API。另请参阅我们的开源 AI 食谱中的相应教程。

一旦推理 API 食谱完成，以下代码将更新。

>>> import requests
>>> from tqdm.auto import tqdm


>>> # Hint: use asynchronous API calls (and dedicated endpoints) to increase speed
>>> def query(payload=None, api_url=None):
...     response = requests.post(api_url, headers=headers, json=payload)
...     return response.json()


>>> headers = {"Authorization": f"Bearer {huggingface_hub.get_token()}"}

>>> output_dic = {}
>>> for model in models_to_compare:
...     # Create API urls for each model
...     # When using dedicated endpoints, you can reuse the same code and simply replace this URL
...     api_url = "https://api-inference.huggingface.co/models/" + model

...     # send requests to API
...     output_lst = []
...     for prompt in tqdm(prompt_formatted):
...         output = query(payload={"inputs": prompt, "parameters": {**generation_params}}, api_url=api_url)
...         output_lst.append(output[0]["generated_text"])

...     output_dic.update({model: output_lst})

>>> print(f"---First generation of {models_to_compare[0]}:\n{output_dic[models_to_compare[0]][0]}\n\n")
>>> print(f"---First generation of {models_to_compare[1]}:\n{output_dic[models_to_compare[1]][0]}")

---First generation of mistralai/Mixtral-8x7B-Instruct-v0.1:
Here's a Python function that meets your requirements:

```python
def get_value(matrix, indices):
    try:
        return matrix[indices[0]][indices[1]]
    except IndexError:
        return None
```

This function takes a matrix (represented by a list of lists) and a tuple of indices as input. It first tries to access the value at the given indices in the matrix. If the indices are out of range, it catches the `IndexError` exception and returns `None`.


---First generation of meta-llama/Meta-Llama-3-70B-Instruct:
Here is a Python function that does what you described:
```
def get_value(matrix, indices):
    try:
        row, col = indices
        return matrix[row][col]
    except IndexError:
        return None
```
Here's an explanation of how the function works:

1. The function takes two arguments: `matrix` (a list of lists) and `indices` (a tuple of two integers, representing the row and column indices).
2. The function tries to access the value at the specified indices using `matrix[row][col]`.
3. If the indices are out of range (i.e., `row` or `col` is greater than the length of the corresponding dimension of the matrix), an `IndexError` exception is raised.
4. The `except` block catches the `IndexError` exception and returns `None` instead of raising an error.

Here's an example usage of the function:
```
matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

print(get_value(matrix, (0, 0)))  # prints 1
print(get_value(matrix, (1, 1)))  # prints 5
print(get_value(matrix, (3, 0)))  # prints None (out of range)
print(get_value(matrix, (0, 3)))  # prints None (out of range)
```
I hope this helps! Let me know if you have any questions.

将 LLM 输出存储在数据集中

现在我们可以将 LLM 输出与原始指令一起存储在数据集中。

# create a HF dataset with the instructions and model outputs
from datasets import Dataset

dataset = Dataset.from_dict(
    {
        "instructions": instructions_lst,
        "response_model_1": output_dic[models_to_compare[0]],
        "response_model_2": output_dic[models_to_compare[1]],
    }
)

dataset

创建和配置您的 Argilla 数据集

我们使用 Argilla，这是一个面向 AI 工程师和领域专家的协作工具，他们需要为其项目构建高质量的数据集。

我们通过 HF Space 运行 Argilla，您只需点击几下即可设置，无需任何本地设置。您可以通过遵循这些说明来创建 HF Argilla Space。有关 HF Argilla Spaces 的进一步配置，另请参阅详细的文档。如果您愿意，您也可以通过 Argilla 的 docker 容器在本地运行 Argilla（请参阅 Argilla 文档）。

Argilla login screen

以编程方式与 Argilla 交互

在我们根据特定任务定制数据集并上传将在 UI 中显示的数据之前，我们需要首先设置一些内容。

将此笔记本连接到 Argilla： 我们现在可以将此笔记本连接到 Argilla，以编程方式配置您的数据集并上传/下载数据。

# After starting the Argilla Space (or local docker container) you can connect to the Space with the code below.
import argilla as rg

client = rg.Argilla(
    api_url="https://username-spacename.hf.space",  # Locally: "https://:6900"
    api_key="your-apikey",  # You'll find it in the UI "My Settings > API key"
    # To use a private HF Argilla Space, also pass your HF token
    headers={"Authorization": f"Bearer {huggingface_hub.get_token()}"},
)

user = client.me
user

编写良好的标注员指南

为人工标注员编写良好的指南与编写良好的训练代码同样重要（且困难）。良好的说明应满足以下标准：

简单明了：指南应该简单明了，以便不了解您的任务的人也能理解。始终至少请一位同事重读指南，以确保没有歧义。
可重现且明确：执行标注任务的所有信息都应包含在指南中。一个常见的错误是在与选定的标注员对话期间创建对指南的非正式解释。未来的标注员将没有这些信息，如果指南中没有明确说明，则可能会以与预期不同的方式执行任务。
简短而全面：指南应尽可能简短，同时包含所有必要的信息。标注员往往不会认真阅读冗长的指南，因此请尽量保持指南简短，同时保持全面性。

请注意，创建标注员指南是一个迭代过程。在将任务分配给其他人之前，最好自己进行几十个标注，并根据您从数据中获得的经验改进指南。随着任务随时间推移而演变，对指南进行版本控制也有帮助。有关更多提示，请参阅此博客文章。

annotator_guidelines = """\
Your task is to evaluate the responses of two LLMs to code generation tasks. 

First, you need to score each response on a scale from 0 to 7. You add points to your final score based on the following criteria:
- Add up to +2 points, if the code is properly commented, with inline comments and doc strings for functions.
- Add up to +2 points, if the code contains a good example for testing. 
- Add up to +3 points, if the code runs and works correctly. Copy the code into an IDE and test it with at least two different inputs. Attribute one point if the code is overall correct, but has some issues. Attribute three points if the code is fully correct and robust against different scenarios. 
Your resulting final score can be any value between 0 to 7. 

If both responses have a final score of <= 4, select one response and correct it manually in the text field. 
The corrected response must fulfill all criteria from above. 
"""

rating_tooltip = """\
- Add up to +2 points, if the code is properly commented, with inline comments and doc strings for functions.
- Add up to +2 points, if the code contains a good example for testing. 
- Add up to +3 points, if the code runs and works correctly. Copy the code into an IDE and test it with at least two different inputs. Attribute one point if the code works mostly correctly, but has some issues. Attribute three points if the code is fully correct and robust against different scenarios. 
"""

累积评分与李克特量表： 请注意，上述指南要求标注员通过为明确的标准添加点数来进行累积评分。另一种方法是“李克特量表”，其中要求标注员在连续量表上对响应进行评分，例如从 1（非常差）到 3（平庸）到 5（非常好）。我们通常建议使用累积评分，因为它们迫使您和标注员明确质量标准，而仅将响应评为“4”（良好）是模棱两可的，并且会被不同的标注员以不同的方式解释。

根据您的特定任务定制 Argilla 数据集

我们现在可以使用标注所需的字段、问题和元数据创建我们自己的 code-llm 任务。有关配置 Argilla 数据集的更多信息，请参阅 Argilla 文档。

dataset_argilla_name = "code-llm"
workspace_name = "argilla"
reuse_existing_dataset = False  # for easier iterative testing

# Configure your dataset settings
settings = rg.Settings(
    # The overall annotation guidelines, which human annotators can refer back to inside of the interface
    guidelines="my guidelines",
    fields=[
        rg.TextField(name="instruction", title="Instruction:", use_markdown=True, required=True),
        rg.TextField(
            name="generation_1",
            title="Response model 1:",
            use_markdown=True,
            required=True,
        ),
        rg.TextField(
            name="generation_2",
            title="Response model 2:",
            use_markdown=True,
            required=True,
        ),
    ],
    # These are the questions we ask annotators about the fields in the dataset
    questions=[
        rg.RatingQuestion(
            name="score_response_1",
            title="Your score for the response of model 1:",
            description="0=very bad, 7=very good",
            values=[0, 1, 2, 3, 4, 5, 6, 7],
            required=True,
        ),
        rg.RatingQuestion(
            name="score_response_2",
            title="Your score for the response of model 2:",
            description="0=very bad, 7=very good",
            values=[0, 1, 2, 3, 4, 5, 6, 7],
            required=True,
        ),
        rg.LabelQuestion(
            name="which_response_corrected",
            title="If both responses score below 4, select a response to correct:",
            description="Select the response you will correct in the text field below.",
            labels=["Response 1", "Response 2", "Combination of both", "Neither"],
            required=False,
        ),
        rg.TextQuestion(
            name="correction",
            title="Paste the selected response below and correct it manually:",
            description="Your corrected response must fulfill all criteria from the annotation guidelines.",
            use_markdown=True,
            required=False,
        ),
        rg.TextQuestion(
            name="comments",
            title="Annotator Comments",
            description="Add any additional comments here. E.g.: edge cases, issues with the interface etc.",
            use_markdown=True,
            required=False,
        ),
    ],
    metadata=[
        rg.TermsMetadataProperty(
            name="source-dataset",
            title="Original dataset source",
        ),
    ],
    allow_extra_metadata=False,
)

if reuse_existing_dataset:
    dataset_argilla = client.datasets(dataset_argilla_name, workspace=workspace_name)
else:
    dataset_argilla = rg.Dataset(
        name=dataset_argilla_name,
        settings=settings,
        workspace=workspace_name,
    )
    if client.datasets(dataset_argilla_name, workspace=workspace_name) is not None:
        client.datasets(dataset_argilla_name, workspace=workspace_name).delete()
    dataset_argilla = dataset_argilla.create()

dataset_argilla

运行上述代码后，您将在 Argilla 中看到新的自定义 code-llm 数据集（以及您之前可能创建的任何其他数据集）。

将数据加载到 Argilla

此时，数据集仍然为空。让我们使用以下代码加载一些数据。

# Iterate over the samples in the dataset
records = [
    rg.Record(
        fields={
            "instruction": example["instructions"],
            "generation_1": example["response_model_1"],
            "generation_2": example["response_model_2"],
        },
        metadata={
            "source-dataset": "bigcode/self-oss-instruct-sc2-exec-filter-50k",
        },
        # Optional: add suggestions from an LLM-as-a-judge system
        # They will be indicated with a sparkle icon and shown as pre-filled responses
        # It will speed up manual annotation
        # suggestions=[
        #     rg.Suggestion(
        #         question_name="score_response_1",
        #         value=example["llm_judge_rating"],
        #         agent="llama-3-70b-instruct",
        #     ),
        # ],
    )
    for example in dataset
]

try:
    dataset_argilla.records.log(records)
except Exception as e:
    print("Exception:", e)

用于标注的 Argilla UI 将类似于这样

Argilla UI

标注

就是这样，我们已经创建了 Argilla 数据集，现在我们可以开始在 UI 中进行标注了！默认情况下，记录在完成 1 个标注后将被标记为已完成。查看这些指南，了解如何自动分配标注任务和在 Argilla 中进行标注。

重要提示：如果您在 HF Space 中使用 Argilla，则需要激活持久存储，以便安全存储您的数据，而不是在一段时间后自动删除。对于生产设置，请务必在进行任何标注之前激活持久存储，以避免数据丢失。

下载已标注的数据

标注后，您可以从 Argilla 中提取数据，并以任何表格格式在本地简单地存储和处理它们（请参阅此处的文档）。您还可以下载数据集的过滤版本（文档）。

annotated_dataset = client.datasets(dataset_argilla_name, workspace=workspace_name)

hf_dataset = annotated_dataset.records.to_datasets()

# This HF dataset can then be formatted, stored and processed into any tabular data format
hf_dataset.to_pandas()

# Store the dataset locally
hf_dataset.to_csv("argilla-dataset-local.csv")  # Save as CSV
# hf_dataset.to_json("argilla-dataset-local.json")  # Save as JSON
# hf_dataset.save_to_disk("argilla-dataset-local")  # Save as a `datasets.Dataset` in the local filesystem
# hf_dataset.to_parquet()  # Save as Parquet

后续步骤

就是这样！您已经使用 HF 推理 API 创建了合成 LLM 数据，在 Argilla 中创建了数据集，将 LLM 数据上传到 Argilla，评估/更正了数据，并在标注后以简单的表格格式下载了数据，以供下游使用。

我们专门为 两个主要用例 设计了管道和界面：

评估：您现在可以简单地使用 score_response_1 和 score_response_2 列中的数值分数来计算哪个模型总体上更好。您还可以检查评分非常低或非常高的响应，以进行详细的错误分析。当您测试或训练不同的模型时，您可以重复使用此管道并跟踪不同模型随时间的改进。
训练：在标注了足够的数据后，您可以从数据中创建一个训练-测试分割，并微调您自己的模型。您可以使用高评分的响应文本进行监督微调，使用 TRL SFTTrainer，或者您可以直接使用评分进行偏好调整技术，例如使用 TRL DPOTrainer 的 DPO。有关不同 LLM 微调技术的优缺点，请参阅 TRL 文档。

调整和改进： 可以改进许多方面，以根据您的特定用例定制此管道。例如，您可以提示 LLM 使用与人工标注员指南非常相似的指令（“LLM-as-a-judge”方法）来评估两个 LLM 的输出。这可以帮助进一步加速您的评估管道。有关 LLM-as-a-judge 的示例实现，请参阅我们的 LLM-as-a-judge 食谱，有关更多想法，请参阅我们的总体开源 AI 食谱。

< > 在 GitHub 上更新