NLP 课程文档

QA 管道中的快速分词器

Hugging Face's logo
加入 Hugging Face 社区

并获得增强文档体验的访问权限

以开始

QA 管道中的快速分词器

Ask a Question Open In Colab Open In Studio Lab

我们现在将深入研究 问答 管道,看看如何利用偏移量从上下文中获取问题的答案,有点像我们在上一节中对分组实体所做的那样。然后我们将看到如何处理最终被截断的非常长的上下文。如果你对问答任务不感兴趣,可以跳过本节。

使用问答管道

正如我们在 第 1 章 中看到的,我们可以像这样使用 问答 管道来获取问题的答案

from transformers import pipeline

question_answerer = pipeline("question-answering")
context = """
🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch, and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question = "Which deep learning libraries back 🤗 Transformers?"
question_answerer(question=question, context=context)
{'score': 0.97773,
 'start': 78,
 'end': 105,
 'answer': 'Jax, PyTorch and TensorFlow'}

与其他管道不同的是,其他管道无法截断和拆分超过模型可接受的最大长度的文本(因此可能会错过文档末尾的信息),而此管道可以处理非常长的上下文,即使答案在最后也会返回答案

long_context = """
🤗 Transformers: State of the Art NLP

🤗 Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction,
question answering, summarization, translation, text generation and more in over 100 languages.
Its aim is to make cutting-edge NLP easier to use for everyone.

🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and
then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and
can be modified to enable quick research experiments.

Why should I use transformers?

1. Easy-to-use state-of-the-art models:
  - High performance on NLU and NLG tasks.
  - Low barrier to entry for educators and practitioners.
  - Few user-facing abstractions with just three classes to learn.
  - A unified API for using all our pretrained models.
  - Lower compute costs, smaller carbon footprint:

2. Researchers can share trained models instead of always retraining.
  - Practitioners can reduce compute time and production costs.
  - Dozens of architectures with over 10,000 pretrained models, some in more than 100 languages.

3. Choose the right framework for every part of a model's lifetime:
  - Train state-of-the-art models in 3 lines of code.
  - Move a single model between TF2.0/PyTorch frameworks at will.
  - Seamlessly pick the right framework for training, evaluation and production.

4. Easily customize a model or an example to your needs:
  - We provide examples for each architecture to reproduce the results published by its original authors.
  - Model internals are exposed as consistently as possible.
  - Model files can be used independently of the library for quick experiments.

🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question_answerer(question=question, context=long_context)
{'score': 0.97149,
 'start': 1892,
 'end': 1919,
 'answer': 'Jax, PyTorch and TensorFlow'}

让我们看看它是如何做到的!

使用问答模型

与任何其他管道一样,我们首先对输入进行分词,然后将其发送到模型。问答 管道默认使用的检查点是 distilbert-base-cased-distilled-squad(名称中的“squad”来自模型经过微调的数据集;我们将在 第 7 章 中详细介绍 SQuAD 数据集)

from transformers import AutoTokenizer, AutoModelForQuestionAnswering

model_checkpoint = "distilbert-base-cased-distilled-squad"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

inputs = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)

请注意,我们将问题和上下文作为一对进行分词,问题在前。

An example of tokenization of question and context

用于问答的模型的工作方式与我们迄今为止看到的模型略有不同。以上面的图片为例,该模型经过训练可以预测答案开始的标记的索引(这里是 21)和答案结束的标记的索引(这里是 24)。这就是这些模型不返回一个 logits 张量而是返回两个的原因:一个用于对应答案开始标记的 logits,另一个用于对应答案结束标记的 logits。由于在本例中,我们只有一个包含 66 个标记的输入,因此我们得到

start_logits = outputs.start_logits
end_logits = outputs.end_logits
print(start_logits.shape, end_logits.shape)
torch.Size([1, 66]) torch.Size([1, 66])

要将这些 logits 转换为概率,我们将应用 softmax 函数 - 但在此之前,我们需要确保我们屏蔽了不属于上下文的索引。我们的输入是 [CLS] question [SEP] context [SEP],因此我们需要屏蔽问题的标记以及 [SEP] 标记。但是,我们将保留 [CLS] 标记,因为某些模型使用它来指示答案不在上下文中。

由于我们将在之后应用 softmax,因此我们只需要用一个较大的负数替换我们要屏蔽的 logits。在这里,我们使用 -10000

import torch

sequence_ids = inputs.sequence_ids()
# Mask everything apart from the tokens of the context
mask = [i != 1 for i in sequence_ids]
# Unmask the [CLS] token
mask[0] = False
mask = torch.tensor(mask)[None]

start_logits[mask] = -10000
end_logits[mask] = -10000

现在我们已经正确地屏蔽了对应于我们不想预测的位置的 logits,我们可以应用 softmax

start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)[0]
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)[0]

在这个阶段,我们可以取 start 和 end 概率的 argmax - 但我们可能会得到一个 start 索引大于 end 索引的情况,因此我们需要采取更多预防措施。我们将计算每个可能的 start_indexend_index(其中 start_index <= end_index)的概率,然后取具有最高概率的元组 (start_index, end_index)

假设事件“答案从 start_index 开始”和“答案在 end_index 结束”是独立的,那么答案从 start_index 开始并在 end_index 结束的概率为 start_probabilities[start_index]×end_probabilities[end_index]\mathrm{start\_probabilities}[\mathrm{start\_index}] \times \mathrm{end\_probabilities}[\mathrm{end\_index}]

因此,要计算所有分数,我们只需要计算所有乘积。start_probabilities[start_index]×end_probabilities[end_index]\mathrm{start\_probabilities}[\mathrm{start\_index}] \times \mathrm{end\_probabilities}[\mathrm{end\_index}]其中 start_index <= end_index.

首先让我们计算所有可能的乘积。

scores = start_probabilities[:, None] * end_probabilities[None, :]

然后,我们将对start_index > end_index的值进行掩码,将它们设置为0(其他概率都是正数)。torch.triu()函数返回作为参数传递的2D张量的上三角部分,因此它将为我们执行该掩码操作。

scores = torch.triu(scores)

现在我们只需要获取最大值的索引。由于PyTorch将在扁平化的张量中返回索引,因此我们需要使用地板除法//和模运算%来获取start_indexend_index

max_index = scores.argmax().item()
start_index = max_index // scores.shape[1]
end_index = max_index % scores.shape[1]
print(scores[start_index, end_index])

我们还没有完全完成,但至少我们已经获得了答案的正确分数(您可以通过将其与上一节中的第一个结果进行比较来验证这一点)。

0.97773

✏️ 试一试! 计算五个最可能的答案的开始和结束索引。

我们获得了答案的start_indexend_index,它们以标记形式表示,因此我们现在只需要将它们转换为上下文中的字符索引。这就是偏移量非常有用的地方。我们可以获取它们并像在标记分类任务中一样使用它们。

inputs_with_offsets = tokenizer(question, context, return_offsets_mapping=True)
offsets = inputs_with_offsets["offset_mapping"]

start_char, _ = offsets[start_index]
_, end_char = offsets[end_index]
answer = context[start_char:end_char]

现在,我们只需要格式化所有内容以获得结果。

result = {
    "answer": answer,
    "start": start_char,
    "end": end_char,
    "score": scores[start_index, end_index],
}
print(result)
{'answer': 'Jax, PyTorch and TensorFlow',
 'start': 78,
 'end': 105,
 'score': 0.97773}

太棒了!这与我们第一个示例中的一样!

✏️ 试一试! 使用您之前计算的最佳分数来显示五个最可能的答案。要检查您的结果,请返回到第一个管道,并在调用它时传入top_k=5

处理长上下文

如果我们尝试对之前用作示例的疑问和长上下文进行标记化,我们将得到一个高于question-answering管道中使用的最大长度(为384)的标记数量。

inputs = tokenizer(question, long_context)
print(len(inputs["input_ids"]))
461

因此,我们需要将输入截断到该最大长度。我们可以通过多种方式做到这一点,但我们不想截断疑问,而只想截断上下文。由于上下文是第二句话,我们将使用"only_second"截断策略。然后出现的问题是,问题的答案可能不在截断的上下文中。例如,我们在这里选择了一个问题的答案位于上下文的末尾,当我们对其进行截断时,该答案就不存在了。

inputs = tokenizer(question, long_context, max_length=384, truncation="only_second")
print(tokenizer.decode(inputs["input_ids"]))
"""
[CLS] Which deep learning libraries back [UNK] Transformers? [SEP] [UNK] Transformers : State of the Art NLP

[UNK] Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction,
question answering, summarization, translation, text generation and more in over 100 languages.
Its aim is to make cutting-edge NLP easier to use for everyone.

[UNK] Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and
then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and
can be modified to enable quick research experiments.

Why should I use transformers?

1. Easy-to-use state-of-the-art models:
  - High performance on NLU and NLG tasks.
  - Low barrier to entry for educators and practitioners.
  - Few user-facing abstractions with just three classes to learn.
  - A unified API for using all our pretrained models.
  - Lower compute costs, smaller carbon footprint:

2. Researchers can share trained models instead of always retraining.
  - Practitioners can reduce compute time and production costs.
  - Dozens of architectures with over 10,000 pretrained models, some in more than 100 languages.

3. Choose the right framework for every part of a model's lifetime:
  - Train state-of-the-art models in 3 lines of code.
  - Move a single model between TF2.0/PyTorch frameworks at will.
  - Seamlessly pick the right framework for training, evaluation and production.

4. Easily customize a model or an example to your needs:
  - We provide examples for each architecture to reproduce the results published by its original authors.
  - Model internal [SEP]
"""

这意味着模型将难以选择正确的答案。为了解决这个问题,question-answering管道允许我们将上下文拆分为更小的块,并指定最大长度。为了确保我们不会在完全错误的位置拆分上下文,使其无法找到答案,它还包含块之间的某些重叠。

我们可以通过添加return_overflowing_tokens=True让标记器(快速或慢速)为我们完成此操作,并且可以使用stride参数指定所需的重叠。以下是一个使用较短句子的示例。

sentence = "This sentence is not too long but we are going to split it anyway."
inputs = tokenizer(
    sentence, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)

for ids in inputs["input_ids"]:
    print(tokenizer.decode(ids))
'[CLS] This sentence is not [SEP]'
'[CLS] is not too long [SEP]'
'[CLS] too long but we [SEP]'
'[CLS] but we are going [SEP]'
'[CLS] are going to split [SEP]'
'[CLS] to split it anyway [SEP]'
'[CLS] it anyway. [SEP]'

正如我们所见,句子已被拆分为块,inputs["input_ids"]中的每个条目最多包含6个标记(我们需要添加填充以使最后一个条目与其他条目的大小相同),并且每个条目之间有2个标记的重叠。

让我们仔细看看标记化结果。

print(inputs.keys())
dict_keys(['input_ids', 'attention_mask', 'overflow_to_sample_mapping'])

正如预期的那样,我们得到了输入ID和注意力掩码。最后一个键overflow_to_sample_mapping是一个映射,它告诉我们每个结果对应于哪句话 - 在这里,我们有7个结果都来自我们传递给标记器的(唯一)句子。

print(inputs["overflow_to_sample_mapping"])
[0, 0, 0, 0, 0, 0, 0]

当我们一起标记化多个句子时,这更有用。例如,这个。

sentences = [
    "This sentence is not too long but we are going to split it anyway.",
    "This sentence is shorter but will still get split.",
]
inputs = tokenizer(
    sentences, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)

print(inputs["overflow_to_sample_mapping"])

会得到我们。

[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]

这意味着第一个句子被拆分为7个块,就像之前一样,接下来的4个块来自第二个句子。

现在让我们回到我们的长上下文。默认情况下,question-answering管道使用最大长度为384,正如我们之前提到的,步长为128,这对应于模型的微调方式(您可以通过在调用管道时传递max_seq_lenstride参数来调整这些参数)。因此,我们在标记化时将使用这些参数。我们还将添加填充(使样本具有相同的长度,以便我们可以构建张量),并请求偏移量。

inputs = tokenizer(
    question,
    long_context,
    stride=128,
    max_length=384,
    padding="longest",
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)

这些inputs将包含模型期望的输入ID和注意力掩码,以及我们刚刚讨论过的偏移量和overflow_to_sample_mapping。由于这两个不是模型使用的参数,因此我们将它们从inputs中弹出(并且我们不会存储映射,因为它在这里没有用),然后再将其转换为张量。

_ = inputs.pop("overflow_to_sample_mapping")
offsets = inputs.pop("offset_mapping")

inputs = inputs.convert_to_tensors("pt")
print(inputs["input_ids"].shape)
torch.Size([2, 384])

我们的长上下文被拆分为两部分,这意味着它经过模型后,我们将获得两组开始和结束logits。

outputs = model(**inputs)

start_logits = outputs.start_logits
end_logits = outputs.end_logits
print(start_logits.shape, end_logits.shape)
torch.Size([2, 384]) torch.Size([2, 384])

与之前一样,我们首先对不在上下文中的标记进行掩码,然后进行softmax。我们还对所有填充标记(由注意力掩码标记)进行掩码。

sequence_ids = inputs.sequence_ids()
# Mask everything apart from the tokens of the context
mask = [i != 1 for i in sequence_ids]
# Unmask the [CLS] token
mask[0] = False
# Mask all the [PAD] tokens
mask = torch.logical_or(torch.tensor(mask)[None], (inputs["attention_mask"] == 0))

start_logits[mask] = -10000
end_logits[mask] = -10000

然后,我们可以使用softmax将我们的logits转换为概率。

start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)

下一步类似于我们对小型上下文所做的操作,但我们对我们的两个块中的每一个都重复此操作。我们将分数分配给所有可能的答案跨度,然后取分数最高的跨度。

candidates = []
for start_probs, end_probs in zip(start_probabilities, end_probabilities):
    scores = start_probs[:, None] * end_probs[None, :]
    idx = torch.triu(scores).argmax().item()

    start_idx = idx // scores.shape[1]
    end_idx = idx % scores.shape[1]
    score = scores[start_idx, end_idx].item()
    candidates.append((start_idx, end_idx, score))

print(candidates)
[(0, 18, 0.33867), (173, 184, 0.97149)]

这两个候选对应于模型能够在每个块中找到的最佳答案。模型对正确的答案在第二部分中的信心更高(这是一个好兆头!)。现在,我们只需要将这两个标记跨度映射到上下文中字符的跨度(我们只需要映射第二个跨度以获得答案,但看看模型在第一个块中选择了什么很有意思)。

✏️ 试一试! 调整上面的代码以返回五个最可能答案(总共,而不是每个块)的分数和跨度。

我们之前获取的offsets实际上是一个偏移量列表,每个文本块对应一个列表。

for candidate, offset in zip(candidates, offsets):
    start_token, end_token, score = candidate
    start_char, _ = offset[start_token]
    _, end_char = offset[end_token]
    answer = long_context[start_char:end_char]
    result = {"answer": answer, "start": start_char, "end": end_char, "score": score}
    print(result)
{'answer': '\n🤗 Transformers: State of the Art NLP', 'start': 0, 'end': 37, 'score': 0.33867}
{'answer': 'Jax, PyTorch and TensorFlow', 'start': 1892, 'end': 1919, 'score': 0.97149}

如果我们忽略第一个结果,我们会得到与针对此长上下文的管道相同的结果 - 太棒了!

✏️ 试一试! 使用您之前计算的最佳分数来显示五个最可能的答案(针对整个上下文,而不是每个块)。要检查您的结果,请返回到第一个管道,并在调用它时传入top_k=5

这结束了我们对标记器功能的深入探讨。我们将在下一章中再次将所有这些付诸实践,向您展示如何在一个系列的常见NLP任务上微调模型。