格式化数据集以兼容聊天模板

社区文章发布于 2024年6月28日

在使用数据集对会话模型进行微调时，确保数据格式正确以与任何聊天模板无缝协作至关重要。本文将探讨一个 Python 函数，该函数可将 Hugging Face 上的 nroggendorff/mayo 数据集转换为兼容格式。

`format_prompts` 函数

以下是 format_prompts 函数的详细说明：

def format_prompts(examples):
    texts = []
    for text in examples['text']:
        conversation = []
        parts = text.split('<|end|>')
        for i in range(0, len(parts) - 1, 2):
            prompt = parts[i].replace("<|user|>", "")
            response = parts[i + 1].replace("<|bot|>", "")
            conversation.append({"role": "user", "content": prompt})
            conversation.append({"role": "assistant", "content": response})
        formatted_conversation = tokenizer.apply_chat_template(conversation, tokenize=False)
        texts.append(formatted_conversation)
    return {"text": texts}

该函数接受一个 examples 参数，该参数预计是一个包含“text”键和会话字符串列表的字典。

我们初始化一个名为 texts 的空列表，用于存储格式化的会话。
我们遍历 examples['text'] 中的每个 text。
- 我们使用分隔符 '<|end|>' 拆分 text，将会话分成多个部分。
- 我们以步长为 2 迭代 parts，假设偶数索引表示用户提示，奇数索引表示机器人响应。
- 我们分别通过删除 "<|user|>" 和 "<|bot|>" 标签来提取 prompt 和 response。
- 我们将 prompt 和 response 作为带有“role”和“content”键的字典附加到 conversation 列表中。
处理完所有部分后，我们使用 tokenizer.apply_chat_template() 将聊天模板应用于 conversation，其中 tokenize 设置为 False 以避免在此阶段进行分词。
我们将 formatted_conversation 附加到 texts 列表中。
最后，我们创建一个 output 字典，其中包含一个“text”键，其中包含格式化会话的列表并返回它。

使用方法

要使用 format_prompts 函数，您可以将数据集示例传递给它。

from datasets import load_dataset

dataset = load_dataset("nroggendorff/mayo", split="train")
dataset = dataset.map(format_prompts, batched=True)

dataset['text'][2] # Check to see if the fields were formatted correctly

通过应用此格式化步骤，您可以确保数据集与各种聊天模板兼容，从而更轻松地为不同用例微调会话模型。

社区

通过拖放到文本输入框、粘贴或点击此处上传图片、音频和视频。

点击或粘贴此处以上传图片

· 注册或登录发表评论

格式化数据集以兼容聊天模板

format_prompts 函数

使用方法

社区

`format_prompts` 函数