WordPiece 分词

WordPiece 是 Google 开发的用于预训练 BERT 的分词算法。它已被用于许多基于 BERT 的 Transformer 模型，例如 DistilBERT、MobileBERT、Funnel Transformers 和 MPNET。它在训练方面与 BPE 非常相似，但实际的分词过程有所不同。

💡 本节深入探讨 WordPiece，甚至展示了完整的实现。如果您只想了解分词算法的概述，可以跳到最后。

训练算法

⚠️ Google 从未开源其 WordPiece 训练算法的实现，因此以下内容是我们根据已发布文献的最佳猜测。它可能不完全准确。

与 BPE 一样，WordPiece 从一个小词汇表开始，其中包括模型使用的特殊标记和初始字母表。由于它通过添加前缀（如 BERT 的 ##）来识别子词，因此每个单词最初都是通过将该前缀添加到单词中的所有字符来分割的。例如，"word" 的分割方式如下

w ##o ##r ##d

因此，初始字母表包含单词开头出现的所有字符，以及单词内部以 WordPiece 前缀开头的字符。

然后，与 BPE 一样，WordPiece 学习合并规则。主要区别在于选择要合并的配对的方式。WordPiece 不会选择出现频率最高的配对，而是使用以下公式计算每个配对的分数 $\mathrm{score} = (\mathrm{freq\_of\_pair}) / (\mathrm{freq\_of\_first\_element} \times \mathrm{freq\_of\_second\_element})$

通过将配对的频率除以其各个部分的频率乘积，该算法优先考虑合并单个部分在词汇表中出现频率较低的配对。例如，即使配对 ("un", "##able") 在词汇表中出现频率很高，它也不会一定合并，因为配对 "un" 和 "##able" 可能分别出现在许多其他单词中，并且具有较高的频率。相反，像 ("hu", "##gging") 这样的配对可能更快地合并（假设单词“hugging”在词汇表中经常出现），因为 "hu" 和 "##gging" 可能分别出现频率较低。

让我们看一下我们在 BPE 训练示例中使用的相同词汇表

("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)

这里的分割将是

("h" "##u" "##g", 10), ("p" "##u" "##g", 5), ("p" "##u" "##n", 12), ("b" "##u" "##n", 4), ("h" "##u" "##g" "##s", 5)

因此，初始词汇表将是 ["b", "h", "p", "##g", "##n", "##s", "##u"]（如果我们暂时忽略特殊标记）。出现频率最高的配对是 ("##u", "##g")（出现 20 次），但 "##u" 的单个频率非常高，因此它的分数不是最高（为 1 / 36）。所有包含 "##u" 的配对实际上都具有相同的分数（1 / 36），因此最高的分数属于配对 ("##g", "##s") - 唯一不包含 "##u" 的配对 - 为 1 / 20，第一个学习到的合并规则是 ("##g", "##s") -> ("##gs")。

请注意，当我们合并时，会删除两个标记之间的 ##，因此我们将 "##gs" 添加到词汇表中，并在语料库的单词中应用合并。

Vocabulary: ["b", "h", "p", "##g", "##n", "##s", "##u", "##gs"]
Corpus: ("h" "##u" "##g", 10), ("p" "##u" "##g", 5), ("p" "##u" "##n", 12), ("b" "##u" "##n", 4), ("h" "##u" "##gs", 5)

此时，"##u" 位于所有可能的配对中，因此它们都具有相同的分数。假设在这种情况下，第一个配对被合并，因此 ("h", "##u") -> "hu"。这将我们带到

Vocabulary: ["b", "h", "p", "##g", "##n", "##s", "##u", "##gs", "hu"]
Corpus: ("hu" "##g", 10), ("p" "##u" "##g", 5), ("p" "##u" "##n", 12), ("b" "##u" "##n", 4), ("hu" "##gs", 5)

然后下一个最佳分数由 ("hu", "##g") 和 ("hu", "##gs") 共享（分数为 1/15，相比之下，所有其他配对的分数为 1/21），因此具有最高分数的第一个配对被合并

Vocabulary: ["b", "h", "p", "##g", "##n", "##s", "##u", "##gs", "hu", "hug"]
Corpus: ("hug", 10), ("p" "##u" "##g", 5), ("p" "##u" "##n", 12), ("b" "##u" "##n", 4), ("hu" "##gs", 5)

我们将继续这样，直到达到所需的词汇表大小。

✏️ **现在轮到你了！** 下一个合并规则是什么？

分词算法

WordPiece 和 BPE 在词语切分方面有所不同，WordPiece 仅保存最终的词汇表，而不保存学习到的合并规则。从要切分的词语开始，WordPiece 会找到词汇表中从开头开始的最长子词，然后进行切分。例如，如果我们使用上面示例中学习到的词汇表，对于词语 "hugs"，从开头开始的词汇表中最长子词是 "hug"，因此我们在那里进行切分并得到 ["hug", "##s"]。然后我们继续使用 "##s"，它在词汇表中，因此 "hugs" 的词语切分结果为 ["hug", "##s"]。

使用 BPE，我们会按照学习到的合并顺序进行合并并将其切分为 ["hu", "##gs"]，因此编码不同。

再举一个例子，我们来看看词语 "bugs" 如何进行切分。"b" 是从词语开头开始的词汇表中最长子词，因此我们在那里进行切分并得到 ["b", "##ugs"]。然后 "##u" 是从 "##ugs" 开头开始的词汇表中最长子词，因此我们在那里进行切分并得到 ["b", "##u, "##gs"]。最后，"##gs" 在词汇表中，因此最后一个列表就是 "bugs" 的词语切分结果。

当词语切分到无法在词汇表中找到子词的阶段时，整个词语将被切分为未知词 — 例如，"mug" 将被切分为 ["[UNK]"]，"bum" 也是如此（即使我们可以从 "b" 和 "##u" 开始，"##m" 不在词汇表中，最终的切分结果将是 ["[UNK]"]，而不是 ["b", "##u", "[UNK]"]）。这是与 BPE 的另一个区别，BPE 只会将词汇表中没有的单个字符归类为未知词。

✏️ 现在轮到你了！ 词语 "pugs" 将如何进行切分？

实现 WordPiece

现在让我们来看看 WordPiece 算法的实现。与 BPE 一样，这只是一个教学示例，你无法在大型语料库上使用它。

我们将使用与 BPE 示例相同的语料库。

corpus = [
    "This is the Hugging Face Course.",
    "This chapter is about tokenization.",
    "This section shows several tokenizer algorithms.",
    "Hopefully, you will be able to understand how they are trained and generate tokens.",
]

首先，我们需要将语料库预切分为词语。由于我们正在复制 WordPiece 切分器（例如 BERT），因此我们将使用 bert-base-cased 切分器进行预切分。

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

然后我们计算语料库中每个词语的频率，就像我们进行预切分一样。

from collections import defaultdict

word_freqs = defaultdict(int)
for text in corpus:
    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
    new_words = [word for word, offset in words_with_offsets]
    for word in new_words:
        word_freqs[word] += 1

word_freqs

defaultdict(
    int, {'This': 3, 'is': 2, 'the': 1, 'Hugging': 1, 'Face': 1, 'Course': 1, '.': 4, 'chapter': 1, 'about': 1,
    'tokenization': 1, 'section': 1, 'shows': 1, 'several': 1, 'tokenizer': 1, 'algorithms': 1, 'Hopefully': 1,
    ',': 1, 'you': 1, 'will': 1, 'be': 1, 'able': 1, 'to': 1, 'understand': 1, 'how': 1, 'they': 1, 'are': 1,
    'trained': 1, 'and': 1, 'generate': 1, 'tokens': 1})

正如我们之前所见，字母表是由所有词语的首字母以及以 ## 为前缀的词语中出现的其他字母组成的唯一集合。

alphabet = []
for word in word_freqs.keys():
    if word[0] not in alphabet:
        alphabet.append(word[0])
    for letter in word[1:]:
        if f"##{letter}" not in alphabet:
            alphabet.append(f"##{letter}")

alphabet.sort()
alphabet

print(alphabet)

['##a', '##b', '##c', '##d', '##e', '##f', '##g', '##h', '##i', '##k', '##l', '##m', '##n', '##o', '##p', '##r', '##s',
 '##t', '##u', '##v', '##w', '##y', '##z', ',', '.', 'C', 'F', 'H', 'T', 'a', 'b', 'c', 'g', 'h', 'i', 's', 't', 'u',
 'w', 'y']

我们还在该词汇表的开头添加模型使用的特殊词元。对于 BERT，它是 ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"] 列表。

vocab = ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"] + alphabet.copy()

接下来，我们需要将每个词语进行切分，并将所有非首字母用 ## 作为前缀。

splits = {
    word: [c if i == 0 else f"##{c}" for i, c in enumerate(word)]
    for word in word_freqs.keys()
}

现在我们已经准备好进行训练，让我们编写一个函数来计算每对词语的分数。我们需要在训练的每个步骤中使用它。

def compute_pair_scores(splits):
    letter_freqs = defaultdict(int)
    pair_freqs = defaultdict(int)
    for word, freq in word_freqs.items():
        split = splits[word]
        if len(split) == 1:
            letter_freqs[split[0]] += freq
            continue
        for i in range(len(split) - 1):
            pair = (split[i], split[i + 1])
            letter_freqs[split[i]] += freq
            pair_freqs[pair] += freq
        letter_freqs[split[-1]] += freq

    scores = {
        pair: freq / (letter_freqs[pair[0]] * letter_freqs[pair[1]])
        for pair, freq in pair_freqs.items()
    }
    return scores

让我们看看初始切分后该字典的一部分。

pair_scores = compute_pair_scores(splits)
for i, key in enumerate(pair_scores.keys()):
    print(f"{key}: {pair_scores[key]}")
    if i >= 5:
        break

('T', '##h'): 0.125
('##h', '##i'): 0.03409090909090909
('##i', '##s'): 0.02727272727272727
('i', '##s'): 0.1
('t', '##h'): 0.03571428571428571
('##h', '##e'): 0.011904761904761904

现在，找到得分最高的词对只需要一个快速循环。

best_pair = ""
max_score = None
for pair, score in pair_scores.items():
    if max_score is None or max_score < score:
        best_pair = pair
        max_score = score

print(best_pair, max_score)

('a', '##b') 0.2

因此，要学习的第一个合并是 ('a', '##b') -> 'ab'，我们将 'ab' 添加到词汇表中。

vocab.append("ab")

为了继续，我们需要在 splits 字典中应用该合并。让我们为此编写另一个函数。

def merge_pair(a, b, splits):
    for word in word_freqs:
        split = splits[word]
        if len(split) == 1:
            continue
        i = 0
        while i < len(split) - 1:
            if split[i] == a and split[i + 1] == b:
                merge = a + b[2:] if b.startswith("##") else a + b
                split = split[:i] + [merge] + split[i + 2 :]
            else:
                i += 1
        splits[word] = split
    return splits

我们可以看看第一个合并的结果。

splits = merge_pair("a", "##b", splits)
splits["about"]

['ab', '##o', '##u', '##t']

现在我们拥有了所有必需的工具，可以循环执行，直到我们学习了所有想要的合并。让我们目标词汇表大小为 70。

vocab_size = 70
while len(vocab) < vocab_size:
    scores = compute_pair_scores(splits)
    best_pair, max_score = "", None
    for pair, score in scores.items():
        if max_score is None or max_score < score:
            best_pair = pair
            max_score = score
    splits = merge_pair(*best_pair, splits)
    new_token = (
        best_pair[0] + best_pair[1][2:]
        if best_pair[1].startswith("##")
        else best_pair[0] + best_pair[1]
    )
    vocab.append(new_token)

然后我们可以查看生成的词汇表。

print(vocab)

['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]', '##a', '##b', '##c', '##d', '##e', '##f', '##g', '##h', '##i', '##k',
 '##l', '##m', '##n', '##o', '##p', '##r', '##s', '##t', '##u', '##v', '##w', '##y', '##z', ',', '.', 'C', 'F', 'H',
 'T', 'a', 'b', 'c', 'g', 'h', 'i', 's', 't', 'u', 'w', 'y', 'ab', '##fu', 'Fa', 'Fac', '##ct', '##ful', '##full', '##fully',
 'Th', 'ch', '##hm', 'cha', 'chap', 'chapt', '##thm', 'Hu', 'Hug', 'Hugg', 'sh', 'th', 'is', '##thms', '##za', '##zat',
 '##ut']

正如我们所见，与 BPE 相比，此切分器更快地学习了词语的一部分作为词元。

💡 在相同语料库上使用 train_new_from_iterator() 不会产生完全相同的词汇表。这是因为 🤗 Tokenizers 库没有实现 WordPiece 进行训练（因为我们不完全确定其内部机制），而是使用 BPE 代替。

要对新文本进行切分，我们先对其进行预切分，然后进行切分，最后对每个词语应用切分算法。也就是说，我们查找第一个词语开头开始的最大子词并进行切分，然后对第二部分重复该过程，依此类推，直到处理完该词语的剩余部分以及文本中的后续词语。

def encode_word(word):
    tokens = []
    while len(word) > 0:
        i = len(word)
        while i > 0 and word[:i] not in vocab:
            i -= 1
        if i == 0:
            return ["[UNK]"]
        tokens.append(word[:i])
        word = word[i:]
        if len(word) > 0:
            word = f"##{word}"
    return tokens

让我们在一个在词汇表中的词语和另一个不在词汇表中的词语上进行测试。

print(encode_word("Hugging"))
print(encode_word("HOgging"))

['Hugg', '##i', '##n', '##g']
['[UNK]']

现在，让我们编写一个函数来对文本进行切分。

def tokenize(text):
    pre_tokenize_result = tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(text)
    pre_tokenized_text = [word for word, offset in pre_tokenize_result]
    encoded_words = [encode_word(word) for word in pre_tokenized_text]
    return sum(encoded_words, [])

我们可以对任何文本进行尝试。

tokenize("This is the Hugging Face course!")

['Th', '##i', '##s', 'is', 'th', '##e', 'Hugg', '##i', '##n', '##g', 'Fac', '##e', 'c', '##o', '##u', '##r', '##s',
 '##e', '[UNK]']

这就是 WordPiece 算法！现在让我们来看看 Unigram。

NLP 课程

WordPiece 分词

训练算法

分词算法

实现 WordPiece