问答管道中的快速分词器

现在我们将深入研究`问答`管道，看看如何利用偏移量从上下文中获取问题的答案，这有点像我们在上一节中对分组实体所做的那样。然后我们将看到如何处理最终被截断的超长上下文。如果您对问答任务不感兴趣，可以跳过本节。

使用问答管道

正如我们在第一章中看到的，我们可以像这样使用`问答`管道来获取问题的答案

from transformers import pipeline

question_answerer = pipeline("question-answering")
context = """
🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch, and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question = "Which deep learning libraries back 🤗 Transformers?"
question_answerer(question=question, context=context)

{'score': 0.97773,
 'start': 78,
 'end': 105,
 'answer': 'Jax, PyTorch and TensorFlow'}

与其他无法截断和分割超出模型接受的最大长度的文本（因此可能会遗漏文档末尾的信息）的管道不同，此管道可以处理超长上下文，即使答案在文档末尾也能返回问题的答案。

long_context = """
🤗 Transformers: State of the Art NLP

🤗 Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction,
question answering, summarization, translation, text generation and more in over 100 languages.
Its aim is to make cutting-edge NLP easier to use for everyone.

🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and
then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and
can be modified to enable quick research experiments.

Why should I use transformers?

1. Easy-to-use state-of-the-art models:
  - High performance on NLU and NLG tasks.
  - Low barrier to entry for educators and practitioners.
  - Few user-facing abstractions with just three classes to learn.
  - A unified API for using all our pretrained models.
  - Lower compute costs, smaller carbon footprint:

2. Researchers can share trained models instead of always retraining.
  - Practitioners can reduce compute time and production costs.
  - Dozens of architectures with over 10,000 pretrained models, some in more than 100 languages.

3. Choose the right framework for every part of a model's lifetime:
  - Train state-of-the-art models in 3 lines of code.
  - Move a single model between TF2.0/PyTorch frameworks at will.
  - Seamlessly pick the right framework for training, evaluation and production.

4. Easily customize a model or an example to your needs:
  - We provide examples for each architecture to reproduce the results published by its original authors.
  - Model internals are exposed as consistently as possible.
  - Model files can be used independently of the library for quick experiments.

🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question_answerer(question=question, context=long_context)

{'score': 0.97149,
 'start': 1892,
 'end': 1919,
 'answer': 'Jax, PyTorch and TensorFlow'}

让我们看看它是如何做到这一切的！

使用模型进行问答

与其他管道一样，我们首先对输入进行分词，然后将其送入模型。`问答`管道默认使用的检查点是`distilbert-base-cased-distilled-squad`（名称中的“squad”来自模型微调时使用的数据集；我们将在第七章中详细讨论 SQuAD 数据集）。

from transformers import AutoTokenizer, AutoModelForQuestionAnswering

model_checkpoint = "distilbert-base-cased-distilled-squad"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

inputs = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)

请注意，我们成对地对问题和上下文进行分词，问题在前。

An example of tokenization of question and context

用于问答的模型与我们目前看到的模型略有不同。以上图为例，模型已经过训练，可以预测答案的起始标记索引（此处为 21）和答案结束的标记索引（此处为 24）。这就是为什么这些模型不返回一个 logits 张量，而是两个：一个用于与答案起始标记对应的 logits，另一个用于与答案结束标记对应的 logits。由于在这种情况下，我们只有一个包含 66 个标记的输入，所以我们得到

start_logits = outputs.start_logits
end_logits = outputs.end_logits
print(start_logits.shape, end_logits.shape)

torch.Size([1, 66]) torch.Size([1, 66])

为了将这些 logits 转换为概率，我们将应用 softmax 函数——但在此之前，我们需要确保屏蔽不属于上下文的索引。我们的输入是 `[CLS] question [SEP] context [SEP]`，因此我们需要屏蔽问题标记以及 `[SEP]` 标记。但是，我们将保留 `[CLS]` 标记，因为有些模型使用它来指示答案不在上下文中。

由于我们之后会应用 softmax，我们只需要将我们想要屏蔽的 logits 替换为一个很大的负数。这里，我们使用 `—10000`

import torch

sequence_ids = inputs.sequence_ids()
# Mask everything apart from the tokens of the context
mask = [i != 1 for i in sequence_ids]
# Unmask the [CLS] token
mask[0] = False
mask = torch.tensor(mask)[None]

start_logits[mask] = -10000
end_logits[mask] = -10000

现在我们已经正确地屏蔽了我们不想预测的位置所对应的 logits，我们可以应用 softmax

start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)[0]
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)[0]

在此阶段，我们可以取起始和结束概率的 argmax——但我们可能会得到一个起始索引大于结束索引的结果，因此我们需要采取一些额外的预防措施。我们将计算所有可能的 `start_index` 和 `end_index` 的概率，其中 `start_index <= end_index`，然后选择具有最高概率的元组 `(start_index, end_index)`。

假设事件“答案从start_index开始”和“答案在end_index结束”是独立的，那么答案从start_index开始并在end_index结束的概率为 $\mathrm{start\_probabilities}[\mathrm{start\_index}] \times \mathrm{end\_probabilities}[\mathrm{end\_index}]$

因此，要计算所有分数，我们只需计算所有乘积 $\mathrm{start\_probabilities}[\mathrm{start\_index}] \times \mathrm{end\_probabilities}[\mathrm{end\_index}]$ 其中 start_index <= end_index。

首先我们计算所有可能的乘积

scores = start_probabilities[:, None] * end_probabilities[None, :]

然后我们将 `start_index > end_index` 的值屏蔽掉，将其设置为 `0`（其他概率都是正数）。`torch.triu()` 函数返回作为参数传递的 2D 张量的上三角部分，因此它将为我们完成此屏蔽操作

scores = torch.triu(scores)

现在我们只需获取最大值的索引。由于 PyTorch 将返回扁平张量中的索引，我们需要使用整除 `//` 和模数 `%` 运算来获取 `start_index` 和 `end_index`

max_index = scores.argmax().item()
start_index = max_index // scores.shape[1]
end_index = max_index % scores.shape[1]
print(scores[start_index, end_index])

我们还没完全完成，但至少我们已经得到了答案的正确分数（您可以将其与上一节的第一个结果进行比较以进行验证）

0.97773

✏️ 尝试一下！ 计算五个最有可能答案的起始和结束索引。

我们得到了答案的 `start_index` 和 `end_index`（以 token 为单位），所以现在我们只需要将其转换为上下文中的字符索引。这就是偏移量非常有用的地方。我们可以获取它们并像在 token 分类任务中那样使用它们

inputs_with_offsets = tokenizer(question, context, return_offsets_mapping=True)
offsets = inputs_with_offsets["offset_mapping"]

start_char, _ = offsets[start_index]
_, end_char = offsets[end_index]
answer = context[start_char:end_char]

现在我们只需格式化所有内容即可得到结果

result = {
    "answer": answer,
    "start": start_char,
    "end": end_char,
    "score": scores[start_index, end_index],
}
print(result)

{'answer': 'Jax, PyTorch and TensorFlow',
 'start': 78,
 'end': 105,
 'score': 0.97773}

太棒了！这和我们的第一个例子一样！

✏️ 尝试一下！ 使用您之前计算的最佳分数来显示五个最可能的答案。要检查您的结果，请回到第一个管道并在调用它时传入 `top_k=5`。

处理长上下文

如果我们尝试对之前用作示例的问题和长上下文进行分词，我们将得到比问答管道中使用的最大长度（384）更高的 token 数量

inputs = tokenizer(question, long_context)
print(len(inputs["input_ids"]))

因此，我们需要将输入截断到那个最大长度。有几种方法可以做到这一点，但我们不想截断问题，只截断上下文。由于上下文是第二个句子，我们将使用`"only_second"`截断策略。随之而来的问题是，问题的答案可能不在截断的上下文中。例如，我们选择了一个答案在上下文末尾的问题，当我们截断它时，答案就不存在了。

inputs = tokenizer(question, long_context, max_length=384, truncation="only_second")
print(tokenizer.decode(inputs["input_ids"]))

"""
[CLS] Which deep learning libraries back [UNK] Transformers? [SEP] [UNK] Transformers : State of the Art NLP

[UNK] Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction,
question answering, summarization, translation, text generation and more in over 100 languages.
Its aim is to make cutting-edge NLP easier to use for everyone.

[UNK] Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and
then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and
can be modified to enable quick research experiments.

Why should I use transformers?

1. Easy-to-use state-of-the-art models:
  - High performance on NLU and NLG tasks.
  - Low barrier to entry for educators and practitioners.
  - Few user-facing abstractions with just three classes to learn.
  - A unified API for using all our pretrained models.
  - Lower compute costs, smaller carbon footprint:

2. Researchers can share trained models instead of always retraining.
  - Practitioners can reduce compute time and production costs.
  - Dozens of architectures with over 10,000 pretrained models, some in more than 100 languages.

3. Choose the right framework for every part of a model's lifetime:
  - Train state-of-the-art models in 3 lines of code.
  - Move a single model between TF2.0/PyTorch frameworks at will.
  - Seamlessly pick the right framework for training, evaluation and production.

4. Easily customize a model or an example to your needs:
  - We provide examples for each architecture to reproduce the results published by its original authors.
  - Model internal [SEP]
"""

这意味着模型将很难选择正确的答案。为了解决这个问题，`问答`管道允许我们将上下文分割成更小的块，并指定最大长度。为了确保我们不会在错误的地方分割上下文导致找不到答案，它还包括块之间的一些重叠。

我们可以让分词器（快速或慢速）通过添加 `return_overflowing_tokens=True` 来为我们完成此操作，并且我们可以使用 `stride` 参数指定我们想要的重叠。下面是一个使用较短句子的示例

sentence = "This sentence is not too long but we are going to split it anyway."
inputs = tokenizer(
    sentence, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)

for ids in inputs["input_ids"]:
    print(tokenizer.decode(ids))

'[CLS] This sentence is not [SEP]'
'[CLS] is not too long [SEP]'
'[CLS] too long but we [SEP]'
'[CLS] but we are going [SEP]'
'[CLS] are going to split [SEP]'
'[CLS] to split it anyway [SEP]'
'[CLS] it anyway. [SEP]'

正如我们所看到的，句子被分割成块，使得 `inputs["input_ids"]` 中的每个条目最多包含 6 个标记（我们需要添加填充以使最后一个条目与其他条目大小相同），并且每个条目之间有 2 个标记的重叠。

让我们仔细看看分词结果

print(inputs.keys())

dict_keys(['input_ids', 'attention_mask', 'overflow_to_sample_mapping'])

正如预期的那样，我们得到了输入 ID 和注意力掩码。最后一个键 `overflow_to_sample_mapping` 是一个映射，它告诉我们每个结果对应哪个句子——这里我们有 7 个结果都来自我们传递给分词器的（唯一）句子

print(inputs["overflow_to_sample_mapping"])

[0, 0, 0, 0, 0, 0, 0]

当我们同时对多个句子进行分词时，这会更有用。例如，这个

sentences = [
    "This sentence is not too long but we are going to split it anyway.",
    "This sentence is shorter but will still get split.",
]
inputs = tokenizer(
    sentences, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)

print(inputs["overflow_to_sample_mapping"])

我们得到

[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]

这意味着第一个句子被分成 7 个块，如前所述，接下来的 4 个块来自第二个句子。

现在让我们回到我们的长上下文。默认情况下，`问答`管道使用最大长度 384（如我们之前提到的）和步长 128，这与模型微调的方式相对应（您可以通过在调用管道时传递 `max_seq_len` 和 `stride` 参数来调整这些参数）。因此，我们在分词时将使用这些参数。我们还将添加填充（以使样本长度相同，以便我们可以构建张量）并请求偏移量。

inputs = tokenizer(
    question,
    long_context,
    stride=128,
    max_length=384,
    padding="longest",
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)

这些`inputs`将包含模型期望的输入ID和注意力掩码，以及我们刚刚讨论的偏移量和`overflow_to_sample_mapping`。由于这两个不是模型使用的参数，我们会在将其转换为张量之前将它们从`inputs`中弹出（我们不会存储映射，因为它在这里没用）

_ = inputs.pop("overflow_to_sample_mapping")
offsets = inputs.pop("offset_mapping")

inputs = inputs.convert_to_tensors("pt")
print(inputs["input_ids"].shape)

torch.Size([2, 384])

我们的长上下文被分成两部分，这意味着在它通过我们的模型后，我们将有两组起始和结束 logits

outputs = model(**inputs)

start_logits = outputs.start_logits
end_logits = outputs.end_logits
print(start_logits.shape, end_logits.shape)

torch.Size([2, 384]) torch.Size([2, 384])

和之前一样，我们首先在计算 softmax 之前屏蔽掉不属于上下文的标记。我们还会屏蔽所有填充标记（由注意力掩码标记）。

sequence_ids = inputs.sequence_ids()
# Mask everything apart from the tokens of the context
mask = [i != 1 for i in sequence_ids]
# Unmask the [CLS] token
mask[0] = False
# Mask all the [PAD] tokens
mask = torch.logical_or(torch.tensor(mask)[None], (inputs["attention_mask"] == 0))

start_logits[mask] = -10000
end_logits[mask] = -10000

然后我们可以使用 softmax 将我们的 logits 转换为概率

start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)

下一步与我们对短上下文所做的类似，但我们对我们的两个块中的每一个重复此操作。我们为所有可能的答案跨度分配分数，然后选择分数最高的跨度。

candidates = []
for start_probs, end_probs in zip(start_probabilities, end_probabilities):
    scores = start_probs[:, None] * end_probs[None, :]
    idx = torch.triu(scores).argmax().item()

    start_idx = idx // scores.shape[1]
    end_idx = idx % scores.shape[1]
    score = scores[start_idx, end_idx].item()
    candidates.append((start_idx, end_idx, score))

print(candidates)

[(0, 18, 0.33867), (173, 184, 0.97149)]

这两个候选答案对应于模型在每个块中能够找到的最佳答案。模型对正确答案位于第二部分更有信心（这是一个好兆头！）。现在我们只需要将这两个 token 跨度映射到上下文中的字符跨度（我们只需要映射第二个来获取我们的答案，但看看模型在第一个块中选择了什么也很有趣）。

✏️ 尝试一下！ 修改上面的代码以返回五个最可能答案的分数和跨度（总共，而不是每个块）。

我们之前获得的`offsets`实际上是一个偏移量列表，每个文本块对应一个列表。

for candidate, offset in zip(candidates, offsets):
    start_token, end_token, score = candidate
    start_char, _ = offset[start_token]
    _, end_char = offset[end_token]
    answer = long_context[start_char:end_char]
    result = {"answer": answer, "start": start_char, "end": end_char, "score": score}
    print(result)

{'answer': '\n🤗 Transformers: State of the Art NLP', 'start': 0, 'end': 37, 'score': 0.33867}
{'answer': 'Jax, PyTorch and TensorFlow', 'start': 1892, 'end': 1919, 'score': 0.97149}

如果我们忽略第一个结果，我们得到的结果与我们管道中这个长上下文的结果相同——太棒了！

✏️ 尝试一下！ 使用您之前计算的最佳分数来显示五个最可能的答案（针对整个上下文，而不是每个块）。要检查您的结果，请回到第一个管道并在调用它时传入 `top_k=5`。

至此，我们对分词器功能的深入探讨就结束了。我们将在下一章中再次将所有这些付诸实践，届时我们将向您展示如何针对一系列常见的 NLP 任务微调模型。

< > 在 GitHub 上更新

LLM 课程

问答管道中的快速分词器

使用问答管道

使用模型进行问答

处理长上下文