缓存

想象一下你正在和某人对话，他们每次在你回应时都必须从头开始，而不是记住他们之前说过的话。这将是缓慢且低效的，对吧？

你可以将这个类比扩展到 Transformer 模型。自回归模型生成可能很慢，因为它每次预测一个 token。每个新的预测都依赖于之前的所有上下文。

为了预测第 1000 个 token，模型需要来自前 999 个 token 的信息。这些信息通过 token 表示之间的矩阵乘法来表示。

为了预测第 1001 个 token，除了第 1000 个 token 的信息外，你还需要来自前 999 个 token 的相同信息。对于每个 token，模型都必须一遍又一遍地计算大量的矩阵乘法！

键值 (KV) 缓存通过存储从先前处理的 token 的注意力层派生的 kv 对来消除这种低效性。存储的 kv 对从缓存中检索并重用于后续 token，从而避免了重新计算的需要。

缓存应仅用于推理。如果在训练期间启用缓存，可能会导致意外错误。

Cache 类

当你使用 Transformers 的 Cache 类时，自注意力模块会执行几个关键步骤来整合过去和现在的信息。

注意力模块将当前的 kv 对与缓存中存储的过去的 kv 对连接起来。这会创建形状为 (new_tokens_length, past_kv_length + new_tokens_length) 的注意力权重。当前和过去的 kv 对本质上是组合在一起以计算注意力分数，从而确保模型意识到之前的上下文和当前的输入。
当迭代调用 forward 方法时，关键是注意力掩码的形状要与过去和当前的 kv 对的组合长度相匹配。注意力掩码应具有形状 (batch_size, past_kv_length + new_tokens_length)。这通常在 generate() 中内部处理，但是如果你想使用 Cache 实现自己的生成循环，请记住这一点！注意力掩码应包含过去和当前的 token 值。
了解 cache_position 也很重要。如果你想使用 forward 方法重用预填充的 Cache，这很重要，因为你必须传递一个有效的 cache_position 值。这表示序列中的输入位置。 cache_position 不受 padding 的影响，并且它总是为每个 token 添加一个位置。例如，如果 kv 缓存包含 10 个 token（无论是否为 pad token），则下一个 token 的缓存位置应为 torch.tensor([10])。

下面的示例演示了如何使用 DynamicCache 创建生成循环。如前所述，注意力掩码是过去和当前 token 值的串联，并且为下一个 token 的缓存位置加 1。

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache

model_id = "meta-llama/Llama-2-7b-chat-hf"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="cuda:0")
tokenizer = AutoTokenizer.from_pretrained(model_id)

past_key_values = DynamicCache()
messages = [{"role": "user", "content": "Hello, what's your name."}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", return_dict=True).to("cuda:0")

generated_ids = inputs.input_ids
cache_position = torch.arange(inputs.input_ids.shape[1], dtype=torch.int64, device="cuda:0")
max_new_tokens = 10

for _ in range(max_new_tokens):
    outputs = model(**inputs, cache_position=cache_position, past_key_values=past_key_values, use_cache=True)
    # Greedily sample one next token
    next_token_ids = outputs.logits[:, -1:].argmax(-1)
    generated_ids = torch.cat([generated_ids, next_token_ids], dim=-1)
    # Prepare inputs for the next generation step by leaaving unprocessed tokens, in our case we have only one new token
    # and expanding attn mask for the new token, as explained above
    attention_mask = inputs["attention_mask"]
    attention_mask = torch.cat([attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))], dim=-1)
    inputs = {"input_ids": next_token_ids, "attention_mask": attention_mask}
    cache_position = cache_position[-1:] + 1 # add one more position for the next token

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])
"[INST] Hello, what's your name. [/INST]  Hello! My name is LLaMA,"

传统缓存格式

在 Cache 类之前，缓存曾经以 tensor 元组的元组的形式存储。这种格式是动态的，因为它会随着文本的生成而增长，类似于 DynamicCache。

如果你的项目依赖于这种传统格式，你可以使用 from_legacy_cache() 和 DynamicCache.to_legacy_cache() 函数在 DynamicCache 和元组的元组之间进行转换，如下所示。如果你有用于以特定格式操作缓存的自定义逻辑，这将很有帮助。

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16, device_map="auto")
inputs = tokenizer("Hello, my name is", return_tensors="pt").to(model.device)

# `return_dict_in_generate=True` is required to return the cache and `return_legacy_cache` forces the returned cache
# in the the legacy format
generation_outputs = model.generate(**inputs, return_dict_in_generate=True, return_legacy_cache=True, max_new_tokens=5)

cache = DynamicCache.from_legacy_cache(generation_outputs.past_key_values)
legacy_format_cache = cache.to_legacy_cache()

< > 在 GitHub 上更新