分词器行为中的陷阱:每个开发者都应了解的细节

社区文章 发布于 2025 年 4 月 18 日

这是一篇关于使用分词器开发时需要了解的持续更新的博文,这些都是我在实践中总结的经验,希望能帮助您避免我犯过的错误。

BOS 标记

1. 并非所有分词器都有 BOS 标记

例如,Qwen/Qwen2.5-0.5B 没有 bos_token

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
>>> tokenizer.bos_token is not None
False

microsoft/Phi-3-mini-128k-instruct 则有。

>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")
>>> tokenizer.bos_token is not None
True
>>> tokenizer.bos_token
'<s>'

2. 分词器可能有 BOS 标记但未使用它

例如,microsoft/Phi-3-mini-128k-instructCohereLabs/aya-expanse-8b 都有 bos_token,但只有 CohereLabs/aya-expanse-8b 使用了它。

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")
>>> tokenizer.bos_token, tokenizer.bos_token_id
('<s>', 1)
>>> input_ids = tokenizer("Beautiful is better than ugly")["input_ids"]
>>> input_ids
[25685, 338, 2253, 1135, 22769]
>>> tokenizer.bos_token_id in input_ids
False

>>> input_ids = tokenizer.apply_chat_template([{"role": "user", "content": "What is better than ugly?"}, {"role": "assistant", "content": "Beautiful."}])
>>> input_ids
[32010, 1724, 338, 2253, 1135, 22769, 29973, 32007, 32001, 25685, 29889, 32007, 32000]
>>> tokenizer.bos_token_id in input_ids
False
>>> tokenizer = AutoTokenizer.from_pretrained("CohereLabs/aya-expanse-8b")
>>> tokenizer.bos_token, tokenizer.bos_token_id
('<BOS_TOKEN>', 5)
>>> input_ids = tokenizer("Beautiful is better than ugly")["input_ids"]
>>> input_ids
[5, 82653, 1801, 5329, 2924, 82092]
>>> tokenizer.bos_token_id in input_ids
True

>>> input_ids = tokenizer.apply_chat_template([{"role": "user", "content": "What is better than ugly?"}, {"role": "assistant", "content": "Beautiful."}])
>>> input_ids
[5, 255000, 255006, 11214, 1801, 5329, 2924, 82092, 38, 255001, 255000, 255007, 82653, 21, 255001]
>>> tokenizer.bos_token_id in input_ids
True

EOS 标记

3. 分词不会添加 EOS 标记

当您对字符串进行分词时,它不会自动添加 EOS 标记。

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
>>> tokenizer.eos_token, tokenizer.eos_token_id
('<|endoftext|>', 151643)
>>> input_ids = tokenizer("Beautiful is better than ugly")["input_ids"]
>>> input_ids
[46518, 374, 2664, 1091, 27261]
>>> input_ids[-1] == tokenizer.eos_token_id
False

4. 应用聊天模板可能会添加 EOS 标记,但有时不会,有时会添加但不在末尾

应用聊天模板可能会添加 EOS 标记——但并非总是如此,也并非总是在末尾。不同模型的行为各异。

  • 有些模板在末尾添加 EOS,如 meta-llama/Llama-3.2-1B-Instruct

    >>> from transformers import AutoTokenizer
    >>> messages = [
    ...     {"role": "user", "content": "What is better than ugly?"},
    ...     {"role": "assistant", "content": "Beautiful."},
    ... ]
    >>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
    >>> tokenizer.eos_token, tokenizer.eos_token_id
    ('<|eot_id|>', 128009)
    >>> input_ids = tokenizer.apply_chat_template(messages)
    >>> input_ids
    [128000, 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 198, 15724, 2696, 25, 220, 972, 5186, 220, 2366, 20, 271, 128009, 128006, 882, 128007, 271, 3923, 374, 2731, 1109, 28360, 30, 128009, 128006, 78191, 128007, 271, 47618, 13, 128009]
    >>> input_ids[-1] == tokenizer.eos_token_id
    True
    
  • 有些则完全不添加 EOS,如 databricks/dbrx-instruct

    >>> tokenizer = AutoTokenizer.from_pretrained("databricks/dbrx-instruct")
    >>> tokenizer.eos_token, tokenizer.eos_token_id
    ('<|endoftext|>', 100257)
    >>> input_ids = tokenizer.apply_chat_template(messages)
    >>> input_ids
    [100278, 9125, 198, 2675, 527, 6078, 46913, 11, 3549, 555, 423, 2143, 78889, 13, 1472, 1051, 1566, 6177, 304, 6790, 220, 2366, 18, 13, 1472, 4320, 4860, 3196, 389, 2038, 2561, 709, 311, 430, 1486, 627, 57489, 15843, 36, 66024, 77273, 50, 5257, 66024, 57828, 43486, 2794, 23233, 29863, 11, 719, 3493, 17879, 14847, 311, 810, 6485, 323, 1825, 84175, 4860, 627, 2675, 7945, 449, 5370, 9256, 11, 505, 4477, 311, 11058, 320, 985, 51594, 369, 2082, 10215, 2001, 6227, 311, 1005, 55375, 449, 2082, 11, 4823, 11, 323, 12920, 4390, 7, 2675, 656, 539, 617, 1972, 7394, 828, 2680, 477, 2082, 11572, 17357, 13, 1472, 5766, 23473, 67247, 323, 3493, 24770, 39555, 389, 20733, 13650, 13, 1472, 656, 539, 3493, 5609, 24142, 11, 45319, 11, 477, 3754, 9908, 323, 656, 539, 82791, 713, 3649, 315, 701, 4967, 828, 29275, 2028, 374, 701, 1887, 10137, 11, 51346, 701, 14847, 13, 3234, 539, 5905, 433, 11, 1120, 6013, 311, 279, 1217, 13, 1442, 499, 1505, 6261, 7556, 922, 420, 1984, 11, 3009, 13, 1472, 1288, 387, 30438, 36001, 323, 6118, 430, 3445, 539, 45391, 420, 627, 57489, 9503, 4276, 386, 72983, 4230, 3083, 10245, 45613, 52912, 21592, 66873, 6781, 38873, 3247, 45613, 3507, 20843, 9109, 393, 3481, 691, 1863, 5257, 3247, 14194, 13575, 68235, 13, 100279, 198, 100278, 882, 198, 3923, 374, 2731, 1109, 28360, 30, 100279, 198, 100278, 78191, 198, 47618, 13, 100279]
    >>> input_ids[-1] == tokenizer.eos_token_id
    False
    
  • 有些会添加 EOS,但不在最末尾,如 Qwen/Qwen2.5-0.5B-Instruct

    >>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
    >>> tokenizer.eos_token, tokenizer.eos_token_id
    ('<|im_end|>', 151645)
    >>> input_ids = tokenizer.apply_chat_template(messages)
    >>> input_ids
    [151644, 8948, 198, 2610, 525, 1207, 16948, 11, 3465, 553, 54364, 14817, 13, 1446, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 3838, 374, 2664, 1091, 27261, 30, 151645, 198, 151644, 77091, 198, 46518, 13, 151645, 198]
    >>> input_ids[-1] == tokenizer.eos_token_id
    False
    >>> input_ids[-2] == tokenizer.eos_token_id
    True
    

PAD 标记

5. 当 pad_token 等于 eos_token

pad_token 设置为与 eos_token 相同的值很常见,但这在掩码或准备标签时需要格外小心。

例如:

labels = input_ids.clone()
labels[input_ids == tokenizer.pad_token_id] = -100  # ⚠️ Not safe if PAD == EOS

如果 pad_token_id == eos_token_id,这也将掩盖实际的 eos_token,这些标记通常是有意义的,不应被忽略。当使用相同的 ID 时,请确保您的掩码逻辑不会无意中移除有效的 eos_token

聊天模板

6. 聊天模板的应用不是关于连接的同态映射

换句话说,您不能单独对提示和完成应用模板,然后将它们连接起来——这不会产生正确的结果。

这意味着您永远不应该对独立的完成应用聊天模板。

completion = tokenizer.apply_chat_template(completion)  # ❌ No

对于提示,您应该使用 continue_final_message=Trueadd_generation_prompt=True

prompt = tokenizer.apply_chat_template(prompt, continue_final_message=True)  # ✅ OK
prompt = tokenizer.apply_chat_template(prompt, add_generation_prompt=True)   # ✅ OK
prompt = tokenizer.apply_chat_template(prompt)                               # ❌ NO

7. 聊天模板和分词由于特殊标记而无法组合

换句话说,由于特殊标记(尤其是 BOS 标记),您不能连续应用聊天模板和分词。

如果您尝试这样做

>>> text = tokenizer.apply_chat_template(messages, tokenize=False)
>>> tokenizer(text)  # ❌ NO

它不会给出预期结果,因为 apply_chat_templatetokenizer 都添加了特殊标记。相反,在分词时禁用特殊标记的添加。

>>> text = tokenizer.apply_chat_template(messages, tokenize=False)
>>> tokenizer(text, add_special_tokens=False)  # ✅ OK

使用 CohereLabs/aya-expanse-8b 的示例

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("CohereLabs/aya-expanse-8b")
>>> messages = [
...     {"role": "user", "content": "What is better than ugly?"},
...     {"role": "assistant", "content": "Beautiful."},
... ]
>>> text = tokenizer.apply_chat_template(messages, tokenize=False)
>>> tokenizer(text)["input_ids"]  # ❌ No
[5, 5, 255000, 255006, 11214, 1801, 5329, 2924, 82092, 38, 255001, 255000, 255007, 82653, 21, 255001]
tokenizer(text, add_special_tokens=False)["input_ids"]  # ✅ OK
[5, 255000, 255006, 11214, 1801, 5329, 2924, 82092, 38, 255001, 255000, 255007, 82653, 21, 255001]

8. 添加聊天模板还不够 — 您还需要更新 EOS 标记

在微调基础模型并添加聊天模板时,该模板通常包含一个特殊的“对话结束”标记——例如,在 Qwen/Qwen2.5-0.5B-Instruct 的情况下是 <|im_end|>。此标记用于表示聊天格式中消息的结束。

这是一个使用 Jinja 语法简化后的聊天模板示例:

{%- for message in messages %}
    {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
{%- endfor %}

更新模型的 EOS 标记以匹配模板中使用的“对话结束”标记至关重要。如果它们不匹配,可能会导致无限生成等问题。

tokenizer.chat_template = """\
{%- for message in messages %}
    {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
{%- endfor %}
"""
tokenizer.eos_token = "<|im_end|>"  # ⚠️ Critical step

社区

太酷了!

非常棒的文章,所有这些陷阱都很有趣且值得了解。

不错的信息

注册登录 以评论