模板

聊天 pipeline 指南介绍了 TextGenerationPipeline 以及用于与模型对话的聊天提示或聊天模板的概念。在这个高层 pipeline 的底层是 apply_chat_template 方法。聊天模板是分词器的一部分，它指定了如何将对话转换为预期模型格式的单个可分词字符串。

在下面的示例中，Mistral-7B-Instruct 和 Zephyr-7B 是从同一个基础模型微调而来的，但它们使用不同的聊天格式进行训练。如果没有聊天模板，您必须手动为每个模型编写格式化代码，即使是微小的错误也可能损害性能。聊天模板提供了一种通用的方式来格式化任何模型的聊天输入。

Mistral

Zephyr

本指南更详细地探讨了 apply_chat_template 和聊天模板。

apply_chat_template

聊天应结构化为字典列表，其中包含 role 和 content 键。 role 键指定说话者（通常在您和系统之间），而 content 键包含您的消息。对于系统，content 是模型在与您聊天时应如何表现和响应的高级描述。

将您的消息传递给 apply_chat_template 以对其进行分词和格式化。您可以将 add_generation_prompt 设置为 True 以指示消息的开始。

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta", device_map="auto", torch_dtype=torch.bfloat16)

messages = [
    {"role": "system", "content": "You are a friendly chatbot who always responds in the style of a pirate",},
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
 ]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
print(tokenizer.decode(tokenized_chat[0]))

<|system|>
You are a friendly chatbot who always responds in the style of a pirate</s>
<|user|>
How many helicopters can a human eat in one sitting?</s>
<|assistant|>

现在将分词后的聊天内容传递给 generate() 以生成响应。

outputs = model.generate(tokenized_chat, max_new_tokens=128) 
print(tokenizer.decode(outputs[0]))

<|system|>
You are a friendly chatbot who always responds in the style of a pirate</s>
<|user|>
How many helicopters can a human eat in one sitting?</s>
<|assistant|>
Matey, I'm afraid I must inform ye that humans cannot eat helicopters. Helicopters are not food, they are flying machines. Food is meant to be eaten, like a hearty plate o' grog, a savory bowl o' stew, or a delicious loaf o' bread. But helicopters, they be for transportin' and movin' around, not for eatin'. So, I'd say none, me hearties. None at all.

add_generation_prompt

add_generation_prompt 参数添加了指示响应开始的 tokens。这确保了聊天模型生成系统响应，而不是继续用户的消息。

并非所有模型都需要生成提示，有些模型（如 Llama）在系统响应之前没有任何特殊 tokens。在这种情况下，add_generation_prompt 没有效果。

tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
tokenized_chat

<|im_start|>user
Hi there!<|im_end|>
<|im_start|>assistant
Nice to meet you!<|im_end|>
<|im_start|>user
Can I ask a question?<|im_end|>

continue_final_message

continue_final_message 参数控制聊天中的最后一条消息是应该继续还是不应该继续，而不是开始一条新消息。它会删除序列结束 tokens，以便模型从最后一条消息继续生成。

这对于“预填充”模型响应非常有用。在下面的示例中，模型生成的文本继续 JSON 字符串，而不是开始新消息。当您知道如何开始回复时，这对于提高指令遵循的准确性非常有用。

chat = [
    {"role": "user", "content": "Can you format the answer in JSON?"},
    {"role": "assistant", "content": '{"name": "'},
]

formatted_chat = tokenizer.apply_chat_template(chat, tokenize=True, return_dict=True, continue_final_message=True)
model.generate(**formatted_chat)

您不应同时使用 add_generation_prompt 和 continue_final_message。前者添加启动新消息的 tokens，而后者删除序列结束 tokens。一起使用它们会返回错误。

TextGenerationPipeline 默认将 add_generation_prompt 设置为 True 以启动新消息。但是，如果聊天中的最后一条消息具有“assistant”角色，它会假定该消息是预填充，并切换到 continue_final_message=True。这是因为大多数模型不支持多个连续的 assistant 消息。要覆盖此行为，请显式地将 continue_final_message 传递给 pipeline。

多个模板

一个模型可能针对不同的用例有几个不同的模板。例如，一个模型可能有一个用于常规聊天、工具使用和 RAG 的模板。

当有多个模板时，聊天模板是一个字典。每个键对应一个模板的名称。 apply_chat_template 根据模板名称处理多个模板。在大多数情况下，它会查找名为 default 的模板，如果找不到，则会引发错误。

对于工具调用模板，如果用户传递了 tools 参数并且存在 tool_use 模板，则会使用工具调用模板而不是 default。

要访问其他名称的模板，请将模板名称传递给 apply_chat_template 中的 chat_template 参数。例如，如果您正在使用 RAG 模板，则设置 chat_template="rag"。

但是，管理多个模板可能会令人困惑，因此我们建议为所有用例使用单个模板。使用 Jinja 语句（如 if tools is defined）和 {% macro %} 定义将多个代码路径包装在单个模板中。

模板选择

设置与模型预训练模板格式匹配的聊天模板格式非常重要，否则性能可能会受到影响。即使您要进一步训练模型，如果聊天 tokens 保持不变，性能也是最佳的。

但是，如果您从头开始训练模型或为聊天微调模型，您有更多选择来选择模板。例如，ChatML 是一种流行的格式，它足够灵活，可以处理许多用例。它甚至包括对生成提示的支持，但它不添加字符串开头 (BOS) 或字符串结尾 (EOS) tokens。如果您的模型期望 BOS 和 EOS tokens，请设置 add_special_tokens=True 并确保将它们添加到您的模板中。

{%- for message in messages %}
    {{- '<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n' }}
{%- endfor %}

使用以下逻辑设置模板以支持生成提示。该模板用 <|im_start|> 和 <|im_end|> tokens 包裹每条消息，并将角色写为字符串。这允许您轻松自定义要训练的角色。

tokenizer.chat_template = "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"

user、system 和 assistant 角色是聊天模板中的标准角色。我们建议在有意义时使用这些角色，特别是当您将模型与 TextGenerationPipeline 一起使用时。

<|im_start|>system
You are a helpful chatbot that will do its best not to say anything so stupid that people tweet about it.<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
I'm doing great!<|im_end|>

模型训练

使用聊天模板训练模型是确保聊天模板与模型训练时使用的 tokens 相匹配的好方法。将聊天模板作为预处理步骤应用于您的数据集。设置 add_generation_prompt=False，因为提示 assistant 响应的附加 tokens 在训练期间没有帮助。

下面显示了使用聊天模板预处理数据集的示例。

from transformers import AutoTokenizer
from datasets import Dataset

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")

chat1 = [
    {"role": "user", "content": "Which is bigger, the moon or the sun?"},
    {"role": "assistant", "content": "The sun."}
]
chat2 = [
    {"role": "user", "content": "Which is bigger, a virus or a bacterium?"},
    {"role": "assistant", "content": "A bacterium."}
]

dataset = Dataset.from_dict({"chat": [chat1, chat2]})
dataset = dataset.map(lambda x: {"formatted_chat": tokenizer.apply_chat_template(x["chat"], tokenize=False, add_generation_prompt=False)})
print(dataset['formatted_chat'][0])

<|user|>
Which is bigger, the moon or the sun?</s>
<|assistant|>
The sun.</s>

在此步骤之后，您可以继续按照因果语言模型的训练配方，使用 formatted_chat 列。

一些分词器添加了特殊的 <bos> 和 <eos> tokens。聊天模板应已包含所有必要的特殊 tokens，添加额外的特殊 tokens 通常是不正确或重复的，会损害模型性能。当您使用 apply_chat_template(tokenize=False) 格式化文本时，请确保同时设置 add_special_tokens=False 以避免重复它们。

apply_chat_template(messages, tokenize=False, add_special_tokens=False)

如果 apply_chat_template(tokenize=True)，则这不是问题。

< > 在 GitHub 上更新