Transformers 文档

RAG

Hugging Face's logo
加入 Hugging Face 社区

并获得增强的文档体验

开始使用

该模型于 2020-05-22 发布,并于 2020-11-16 添加到 Hugging Face Transformers。

RAG

PyTorch FlashAttention

检索增强生成 (RAG) 将预训练语言模型(参数化记忆)与通过预训练神经检索器访问的外部数据源(非参数化记忆)相结合。RAG 在推理过程中会检索相关段落并以它们为条件进行生成。这通常使答案更具事实性,并且允许您通过更改索引来更新知识,而无需重新训练整个模型。

您可以在 AI at Meta 组织下找到所有原始 RAG 检查点。

此模型由 ola13 贡献。

单击侧边栏中的 RAG 模型,了解如何将 RAG 应用于不同语言任务的更多示例。

以下示例演示了如何使用 AutoModel 生成文本。

自动模型
import torch
from transformers import RagTokenizer, RagRetriever, RagSequenceForGeneration

tokenizer = RagTokenizer.from_pretrained("facebook/rag-sequence-nq")
retriever = RagRetriever.from_pretrained(
    "facebook/dpr-ctx_encoder-single-nq-base", dataset="wiki_dpr", index_name="compressed"
)

model = RagSequenceForGeneration.from_pretrained(
    "facebook/rag-token-nq",
    retriever=retriever,
    dtype="auto",
    attn_implementation="flash_attention_2",
)
input_dict = tokenizer.prepare_seq2seq_batch("How many people live in Paris?", return_tensors="pt")
generated = model.generate(input_ids=input_dict["input_ids"])
print(tokenizer.batch_decode(generated, skip_special_tokens=True)[0])

量化通过以较低精度存储权重来减少内存占用。有关支持的后端,请参阅 量化概述。下面的示例使用 bitsandbytes 将权重量化为 4 位。

import torch
from transformers import BitsAndBytesConfig, RagTokenizer, RagRetriever, RagSequenceForGeneration

bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)

tokenizer = RagTokenizer.from_pretrained("facebook/rag-sequence-nq")
retriever = RagRetriever.from_pretrained(
    "facebook/dpr-ctx_encoder-single-nq-base", dataset="wiki_dpr", index_name="compressed"
)

model = RagSequenceForGeneration.from_pretrained(
    "facebook/rag-token-nq",
    retriever=retriever,
    quantization_config=bnb,   # quantizes generator weights
    device_map="auto",
)
input_dict = tokenizer.prepare_seq2seq_batch("How many people live in Paris?", return_tensors="pt")
generated = model.generate(input_ids=input_dict["input_ids"])
print(tokenizer.batch_decode(generated, skip_special_tokens=True)[0])

RagConfig

class transformers.RagConfig

< >

( vocab_size = None is_encoder_decoder = True prefix = None bos_token_id = None pad_token_id = None eos_token_id = None decoder_start_token_id = None title_sep = ' / ' doc_sep = ' // ' n_docs = 5 max_combined_length = 300 retrieval_vector_size = 768 retrieval_batch_size = 8 dataset = 'wiki_dpr' dataset_split = 'train' index_name = 'compressed' index_path = None passages_path = None use_dummy_dataset = False reduce_loss = False label_smoothing = 0.0 do_deduplication = True exclude_bos_score = False do_marginalize = False output_retrieved = False use_cache = True dataset_revision = None **kwargs )

参数

  • title_sep (str, optional, defaults to " / ") — 在调用 RagRetriever 时,插入到检索到的文档的标题和文本之间的分隔符。
  • doc_sep (str, optional, defaults to " // ") — 在调用 RagRetriever 时,插入到检索到的文档文本和原始输入之间的分隔符。
  • n_docs (int, optional, defaults to 5) — 要检索的文档数量。
  • max_combined_length (int, optional, defaults to 300) — RagRetriever__call__() 方法返回的上下文输入的 max 长度。
  • retrieval_vector_size (int, optional, defaults to 768) — RagRetriever 索引的文档嵌入的维度。
  • retrieval_batch_size (int, optional, defaults to 8) — 检索批次大小,定义为同时提交给 RagRetriever 封装的 faiss 索引的查询数量。
  • dataset (str, optional, defaults to "wiki_dpr") — HuggingFace Datasets 中索引数据集的标识符(使用 datasets.list_datasets() 列出所有可用的数据集和 ID)。
  • dataset_split (str, optional, defaults to "train") — 要加载的 dataset 的哪个分割。
  • index_name (str, optional, defaults to "compressed") — 与 dataset 关联的索引的索引名称。可以选择 "legacy""exact""compressed"
  • index_path (str, optional) — faiss 索引在磁盘上序列化的路径。
  • passages_path (str, optional) — 与 faiss 索引兼容的文本段落路径。如果使用 LegacyIndex 则必需。
  • use_dummy_dataset (bool, optional, defaults to False) — 是否加载由 dataset 指定的数据集的“dummy”变体。
  • label_smoothing (float, optional, defaults to 0.0) — 仅当 return_loss 设置为 True 时相关。控制损失计算中标签平滑的 epsilon 参数值。如果设置为 0,则不执行标签平滑。
  • do_marginalize (bool, optional, defaults to False) — 如果为 True,则通过使用 torch.nn.functional.log_softmax 来对所有文档的 logit 进行边际化。
  • reduce_loss (bool, optional, defaults to False) — 是否使用 torch.Tensor.sum 操作来减少 NLL 损失。
  • do_deduplication (bool, optional, defaults to True) — 是否对来自给定输入的多个上下文文档的生成进行去重。如果在使用分布式后端进行训练时,必须设置为 False
  • exclude_bos_score (bool, optional, defaults to False) — 是否在计算损失时忽略 BOS token。
  • output_retrieved(bool, optional, defaults to False) — 如果设置为 True,则返回 retrieved_doc_embedsretrieved_doc_idscontext_input_idscontext_attention_mask。有关详细信息,请参阅返回的张量。
  • use_cache (bool, optional, defaults to True) — 模型是否应返回最后一个 key/values 注意力(并非所有模型都使用)。

RagConfig 存储 RagModel 的配置。配置对象继承自 PreTrainedConfig,可用于控制模型输出。有关更多信息,请阅读 PreTrainedConfig 的文档。

from_question_encoder_generator_configs

< >

( question_encoder_config: PreTrainedConfig generator_config: PreTrainedConfig **kwargs ) EncoderDecoderConfig

返回

EncoderDecoderConfig

一个配置对象的实例

从预训练的编码器模型配置和解码器模型配置实例化一个 EncoderDecoderConfig(或其派生类)。

RagTokenizer

class transformers.RagTokenizer

< >

( question_encoder generator )

Rag specific outputs

class transformers.models.rag.modeling_rag.RetrievAugLMMarginOutput

< >

( loss: torch.FloatTensor | None = None logits: torch.FloatTensor | None = None doc_scores: torch.FloatTensor | None = None past_key_values: transformers.cache_utils.Cache | None = None retrieved_doc_embeds: torch.FloatTensor | None = None retrieved_doc_ids: torch.LongTensor | None = None context_input_ids: torch.LongTensor | None = None context_attention_mask: torch.LongTensor | None = None question_encoder_last_hidden_state: torch.FloatTensor | None = None question_enc_hidden_states: tuple[torch.FloatTensor, ...] | None = None question_enc_attentions: tuple[torch.FloatTensor, ...] | None = None generator_enc_last_hidden_state: torch.FloatTensor | None = None generator_enc_hidden_states: tuple[torch.FloatTensor, ...] | None = None generator_enc_attentions: tuple[torch.FloatTensor, ...] | None = None generator_dec_hidden_states: tuple[torch.FloatTensor, ...] | None = None generator_dec_attentions: tuple[torch.FloatTensor, ...] | None = None generator_cross_attentions: tuple[torch.FloatTensor, ...] | None = None )

参数

  • loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Language modeling loss.
  • logits (torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head. The score is possibly marginalized over all documents for each vocabulary token.
  • doc_scores (torch.FloatTensor of shape (batch_size, config.n_docs)) — Score between each retrieved document embeddings (see retrieved_doc_embeds) and question_encoder_last_hidden_state.
  • past_key_values (Cache, optional, returned when use_cache=True is passed or when config.use_cache=True) — It is a Cache instance. For more details, see our kv cache guide.

    Contains precomputed hidden-states (key and values in the attention blocks) of the decoder that can be used (see past_key_values input) to speed up sequential decoding.

  • retrieved_doc_embeds (torch.FloatTensor of shape (batch_size, config.n_docs, hidden_size), optional, returned when output_retrieved=True) — Embedded documents retrieved by the retriever. Is used with question_encoder_last_hidden_state to compute the doc_scores.
  • retrieved_doc_ids (torch.LongTensor of shape (batch_size, config.n_docs), optional, returned when output_retrieved=True) — The indexes of the embedded documents retrieved by the retriever.
  • context_input_ids (torch.LongTensor of shape (batch_size * config.n_docs, config.max_combined_length), optional, returned when output_retrieved=True) — Input ids post-processed from the retrieved documents and the question encoder input_ids by the retriever.
  • context_attention_mask (torch.LongTensor of shape (batch_size * config.n_docs, config.max_combined_length), optional, returned when output_retrieved=True) — Attention mask post-processed from the retrieved documents and the question encoder input_ids by the retriever.
  • question_encoder_last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — Sequence of hidden states at the output of the last layer of the question encoder pooled output of the model.
  • question_enc_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings and one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden states of the question encoder at the output of each layer plus the initial embedding outputs.

  • question_enc_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the question encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

  • generator_enc_last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the generator encoder of the model.
  • generator_enc_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings and one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden states of the generator encoder at the output of each layer plus the initial embedding outputs.

  • generator_enc_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the generator encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

  • generator_dec_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings and one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden states of the generator decoder at the output of each layer plus the initial embedding outputs.

  • generator_dec_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the generator decoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

  • generator_cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Cross-attentions weights of the generator decoder, after the attention softmax, used to compute the weighted average in the cross-attention heads.

Base class for retriever augmented marginalized models outputs.

class transformers.models.rag.modeling_rag.RetrievAugLMOutput

< >

( logits: torch.FloatTensor | None = None doc_scores: torch.FloatTensor | None = None past_key_values: transformers.cache_utils.Cache | None = None retrieved_doc_embeds: torch.FloatTensor | None = None retrieved_doc_ids: torch.LongTensor | None = None context_input_ids: torch.LongTensor | None = None context_attention_mask: torch.LongTensor | None = None question_encoder_last_hidden_state: torch.FloatTensor | None = None question_enc_hidden_states: tuple[torch.FloatTensor, ...] | None = None question_enc_attentions: tuple[torch.FloatTensor, ...] | None = None generator_enc_last_hidden_state: torch.FloatTensor | None = None generator_enc_hidden_states: tuple[torch.FloatTensor, ...] | None = None generator_enc_attentions: tuple[torch.FloatTensor, ...] | None = None generator_dec_hidden_states: tuple[torch.FloatTensor, ...] | None = None generator_dec_attentions: tuple[torch.FloatTensor, ...] | None = None generator_cross_attentions: tuple[torch.FloatTensor, ...] | None = None )

参数

  • logits (torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head. The score is possibly marginalized over all documents for each vocabulary token.
  • doc_scores (torch.FloatTensor of shape (batch_size, config.n_docs)) — Score between each retrieved document embeddings (see retrieved_doc_embeds) and question_encoder_last_hidden_state.
  • past_key_values (Cache, optional, returned when use_cache=True is passed or when config.use_cache=True) — It is a Cache instance. For more details, see our kv cache guide.

    Contains precomputed hidden-states (key and values in the attention blocks) of the decoder that can be used (see past_key_values input) to speed up sequential decoding.

  • retrieved_doc_embeds (torch.FloatTensor of shape (batch_size, config.n_docs, hidden_size), optional, returned when output_retrieved=True) — Embedded documents retrieved by the retriever. Is used with question_encoder_last_hidden_state to compute the doc_scores.
  • retrieved_doc_ids (torch.LongTensor of shape (batch_size, config.n_docs), optional, returned when output_retrieved=True) — The indexes of the embedded documents retrieved by the retriever.
  • context_input_ids (torch.LongTensor of shape (batch_size * config.n_docs, config.max_combined_length), optional, returned when output_retrieved=True) — Input ids post-processed from the retrieved documents and the question encoder input_ids by the retriever.
  • context_attention_mask (torch.LongTensor of shape (batch_size * config.n_docs, config.max_combined_length), optional, returned when output_retrieved=True) — Attention mask post-processed from the retrieved documents and the question encoder input_ids by the retriever.
  • question_encoder_last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — Sequence of hidden states at the output of the last layer of the question encoder pooled output of the model.
  • question_enc_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings and one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden states of the question encoder at the output of each layer plus the initial embedding outputs.

  • question_enc_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the question encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

  • generator_enc_last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the generator encoder of the model.
  • generator_enc_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings and one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden states of the generator encoder at the output of each layer plus the initial embedding outputs.

  • generator_enc_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the generator encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

  • generator_dec_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings and one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden states of the generator decoder at the output of each layer plus the initial embedding outputs.

  • generator_dec_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the generator decoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

  • generator_cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Cross-attentions weights of the generator decoder, after the attention softmax, used to compute the weighted average in the cross-attention heads.

RagRetriever

class transformers.RagRetriever

< >

( config question_encoder_tokenizer generator_tokenizer index = None init_retrieval = True )

参数

  • config (RagConfig) — The configuration of the RAG model this Retriever is used with. Contains parameters indicating which Index to build. You can load your own custom dataset with config.index_name="custom" or use a canonical one (default) from the datasets library with config.index_name="wiki_dpr" for example.
  • question_encoder_tokenizer (PreTrainedTokenizer) — The tokenizer that was used to tokenize the question. It is used to decode the question and then use the generator_tokenizer.
  • generator_tokenizer (PreTrainedTokenizer) — The tokenizer used for the generator part of the RagModel.
  • index (Index, optional, defaults to the one defined by the configuration) — If specified, use this index instead of the one built using the configuration

Retriever used to get documents from vector queries. It retrieves the documents embeddings as well as the documents contents, and it formats them to be used with a RagModel.

示例

>>> # To load the default "wiki_dpr" dataset with 21M passages from wikipedia (index name is 'compressed' or 'exact')
>>> from transformers import RagRetriever

>>> retriever = RagRetriever.from_pretrained(
...     "facebook/dpr-ctx_encoder-single-nq-base", dataset="wiki_dpr", index_name="compressed"
... )

>>> # To load your own indexed dataset built with the datasets library. More info on how to build the indexed dataset in examples/rag/use_own_knowledge_dataset.py
>>> from transformers import RagRetriever

>>> dataset = (
...     ...
... )  # dataset must be a datasets.Datasets object with columns "title", "text" and "embeddings", and it must have a supported index (e.g., Faiss or other index types depending on your setup)
>>> retriever = RagRetriever.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base", indexed_dataset=dataset)

>>> # To load your own indexed dataset built with the datasets library that was saved on disk. More info in examples/rag/use_own_knowledge_dataset.py
>>> from transformers import RagRetriever

>>> dataset_path = "path/to/my/dataset"  # dataset saved via *dataset.save_to_disk(...)*
>>> index_path = "path/to/my/index"  # index saved via *dataset.get_index("embeddings").save(...)*
>>> retriever = RagRetriever.from_pretrained(
...     "facebook/dpr-ctx_encoder-single-nq-base",
...     index_name="custom",
...     passages_path=dataset_path,
...     index_path=index_path,
... )

>>> # To load the legacy index built originally for Rag's paper
>>> from transformers import RagRetriever

>>> retriever = RagRetriever.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base", index_name="legacy")

init_retrieval

< >

( )

Retriever initialization function. It loads the index into memory.

postprocess_docs

< >

( docs input_strings prefix n_docs return_tensors = None ) tuple(tensors)

参数

  • docs (dict) — 检索到的文档。
  • input_strings (str) — 由 preprocess_query 解码的输入字符串。
  • prefix (str) — 添加到每个输入开头的前缀,通常与 T5 模型一起使用。

返回

tuple(tensors)

一个包含两个元素的元组:上下文化的 input_ids 和兼容的 attention_mask

后处理检索到的 docs 并将它们与 input_strings 结合。

retrieve

< >

( question_hidden_states: ndarray n_docs: int ) tuple[np.ndarray, np.ndarray, list[dict]]

参数

  • question_hidden_states (np.ndarray of shape (batch_size, vector_size)) — 用于检索的查询向量批次。
  • n_docs (int) — 每个查询检索的文档数量。

返回

tuple[np.ndarray, np.ndarray, list[dict]]

一个包含以下对象的元组

  • retrieved_doc_embeds (np.ndarray of shape (batch_size, n_docs, dim)) — 每个查询检索到的文档的嵌入。
  • doc_ids (np.ndarray of shape (batch_size, n_docs)) — 索引中文档的 ID
  • doc_dicts (list[dict]): 每个查询检索到的 retrieved_doc_embeds 的示例。

为指定的 question_hidden_states 检索文档。

RagModel

class transformers.RagModel

< >

( config: transformers.configuration_utils.PreTrainedConfig | None = None question_encoder: transformers.modeling_utils.PreTrainedModel | None = None generator: transformers.modeling_utils.PreTrainedModel | None = None retriever: transformers.models.rag.retrieval_rag.RagRetriever | None = None **kwargs )

参数

  • config (PreTrainedConfig, optional) — 模型配置类,包含模型的所有参数。使用 config 文件初始化不会加载与模型相关的权重,只加载配置。查看 from_pretrained() 方法来加载模型权重。
  • question_encoder (PreTrainedModel, optional) — 用于将问题编码为用于检索的隐藏状态的模型。
  • generator (PreTrainedModel, optional) — 基于检索到的文档生成文本的模型。
  • retriever (RagRetriever, optional) — 负责根据编码后的问题从知识库中检索文档的组件。

输出原始隐藏状态而没有顶部任何特定头的 RAG 模型。

此模型继承自 PreTrainedModel。查看其父类文档,了解库为所有模型实现的通用方法(例如下载或保存、调整输入嵌入大小、修剪头等)。

此模型也是一个 PyTorch torch.nn.Module 子类。像普通的 PyTorch Module 一样使用它,并参考 PyTorch 文档了解一般用法和行为的所有相关信息。

forward

< >

( input_ids: torch.LongTensor | None = None attention_mask: torch.Tensor | None = None encoder_outputs: tuple[tuple[torch.FloatTensor]] | None = None decoder_input_ids: torch.LongTensor | None = None decoder_attention_mask: torch.BoolTensor | None = None past_key_values: transformers.cache_utils.Cache | None = None doc_scores: torch.FloatTensor | None = None context_input_ids: torch.LongTensor | None = None context_attention_mask: torch.LongTensor | None = None use_cache: bool | None = None output_attentions: bool | None = None output_hidden_states: bool | None = None output_retrieved: bool | None = None n_docs: int | None = None **kwargs ) transformers.models.rag.modeling_rag.RetrievAugLMOutput or tuple(torch.FloatTensor)

参数

  • input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — 词汇表中输入序列 token 的索引。RagConfig,用于初始化模型,指定要使用的生成器,它还指定一个兼容的生成器分词器。使用该分词器类来获取索引。

    什么是 input IDs?

  • attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — 避免在 padding token 索引上执行 attention 的掩码。掩码值选择在 [0, 1] 中:

    • 1 表示未被掩码的 token,
    • 0 表示被掩码的 token。

    什么是 attention masks?

  • encoder_outputs (tuple(tuple(torch.FloatTensor), optional) — 由 (generator_enc_last_hidden_state, optional: generator_enc_hidden_states, optional: generator_enc_attentions) 组成的元组。形状为 (batch_size, n_docs * sequence_length, hidden_size)generator_enc_last_hidden_state 是生成器编码器最后一层输出的隐藏状态序列。

    由 (RagModel) 模型在解码过程中使用。

  • decoder_input_ids (torch.LongTensor of shape (batch_size, target_sequence_length), optional) — 为生成任务提供。默认值为 None,根据您与 RAG 实例使用的生成器模型的说明进行构建。
  • decoder_input_ids (torch.LongTensor of shape (batch_size, target_sequence_length), optional) — 词汇表中解码器输入序列 token 的索引。

    可以使用 AutoTokenizer 获取索引。有关详细信息,请参阅 PreTrainedTokenizer.encode()PreTrainedTokenizer.call()

    什么是 decoder input IDs?

  • decoder_attention_mask (torch.BoolTensor of shape (batch_size, target_sequence_length), optional) — 默认行为:生成一个忽略 decoder_input_ids 中 pad token 的张量。也会默认使用因果掩码。
  • past_key_values (~cache_utils.Cache, optional) — 可以用于加速顺序解码的预计算隐藏状态(自注意力块和交叉注意力块中的键和值)。这通常由模型在解码的早期阶段返回的 past_key_values 组成,当 use_cache=Trueconfig.use_cache=True 时。

    只允许 Cache 实例作为输入,请参阅我们的 kv cache 指南。如果不传递 past_key_values,默认将初始化 DynamicCache

    模型将输出与输入相同的缓存格式。

    如果使用 past_key_values,用户需要只输入未处理的 input_ids(那些没有传递其过去键值状态给该模型)的形状为 (batch_size, unprocessed_length),而不是所有 input_ids 的形状为 (batch_size, sequence_length)

  • doc_scores (torch.FloatTensor of shape (batch_size, config.n_docs)) — 在每个检索到的文档嵌入(请参阅 retrieved_doc_embeds)和 question_encoder_last_hidden_state 之间的分数。如果模型未初始化 retriever,则必须在 forward pass 中提供 doc_scoresdoc_scores 可以通过 question_encoder_last_hidden_stateretrieved_doc_embeds 计算,请参阅示例了解更多信息。
  • context_input_ids (torch.LongTensor of shape (batch_size * config.n_docs, config.max_combined_length), optional, returned when output_retrieved=True) — 由检索到的文档和问题编码器 input_ids 经检索器后处理的输入 ID。如果模型未用 retriever 初始化,则必须在 forward pass 中提供 context_input_idscontext_input_ids__call__() 返回。
  • context_attention_mask (torch.LongTensor of shape (batch_size * config.n_docs, config.max_combined_length),optional, returned when output_retrieved=True) — 由检索到的文档和问题编码器 input_ids 经检索器后处理的 attention mask。如果模型未用 retriever 初始化,则必须在 forward pass 中提供 context_attention_maskcontext_attention_mask__call__() 返回。
  • use_cache (bool, optional) — 如果设置为 True,则会返回 past_key_values 键值状态,并可用于加速解码。
  • output_attentions (bool, optional) — 是否返回所有 attention 层的 attention 张量。有关详细信息,请参阅返回的张量下的 attentions
  • output_hidden_states (bool, optional) — 是否返回所有层的隐藏状态。有关详细信息,请参阅返回的张量下的 hidden_states
  • output_retrieved (bool, optional) — 是否返回 retrieved_doc_embedsretrieved_doc_idscontext_input_idscontext_attention_mask。有关详细信息,请参阅返回的张量。
  • n_docs (int, optional) — 要检索的文档数量。

返回

transformers.models.rag.modeling_rag.RetrievAugLMOutput or tuple(torch.FloatTensor)

根据配置(RagConfig)和输入,包含各种元素的 transformers.models.rag.modeling_rag.RetrievAugLMOutputtorch.FloatTensor 元组(如果传递了 return_dict=False 或当 config.return_dict=False 时)。

  • logits (torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)) — 语言建模头的预测分数。对于每个词汇 token,分数可能被边际化了所有文档。

  • doc_scores (torch.FloatTensor of shape (batch_size, config.n_docs)) — 在每个检索到的文档嵌入(请参阅 retrieved_doc_embeds)和 question_encoder_last_hidden_state 之间的分数。

  • past_key_values (Cache, optional, 当传递 use_cache=True 或当 config.use_cache=True 时返回) — 它是 Cache 实例。更多详情,请参阅我们的 kv cache 指南

    包含解码器的预计算隐藏状态(注意力块中的键和值),可用于(请参阅 past_key_values 输入)加速顺序解码。

  • retrieved_doc_embeds (torch.FloatTensor of shape (batch_size, config.n_docs, hidden_size), optional, returned when output_retrieved=True) — 由检索器检索到的嵌入文档。与 question_encoder_last_hidden_state 一起用于计算 doc_scores

  • retrieved_doc_ids (torch.LongTensor of shape (batch_size, config.n_docs), optional, returned when output_retrieved=True) — 由检索器检索到的嵌入文档的索引。

  • context_input_ids (torch.LongTensor of shape (batch_size * config.n_docs, config.max_combined_length), optional, returned when output_retrieved=True) — 由检索器从检索到的文档和问题编码器 input_ids 后处理的输入 ID。

  • context_attention_mask (torch.LongTensor of shape (batch_size * config.n_docs, config.max_combined_length), optional, returned when output_retrieved=True) — 由检索器从检索到的文档和问题编码器 input_ids 后处理的 attention mask。

  • question_encoder_last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — 模型池化输出的问题编码器的最后一层输出处的隐藏状态序列。

  • question_enc_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — 形状为 (batch_size, sequence_length, hidden_size)torch.FloatTensor 元组(一个用于嵌入的输出,一个用于每个层的输出)。

    问题编码器的隐藏状态,在每个层的输出以及初始嵌入输出。

  • question_enc_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — 形状为 (batch_size, num_heads, sequence_length, sequence_length)torch.FloatTensor 元组(一个用于每个层)。

    问题编码器的 attention 权重,在 attention softmax 之后,用于在自注意力头中计算加权平均。

  • generator_enc_last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — 模型生成器编码器的最后一层输出处的隐藏状态序列。

  • generator_enc_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — 形状为 (batch_size, sequence_length, hidden_size)torch.FloatTensor 元组(一个用于嵌入的输出,一个用于每个层的输出)。

    生成器编码器的隐藏状态,在每个层的输出以及初始嵌入输出。

  • generator_enc_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — 形状为 (batch_size, num_heads, sequence_length, sequence_length)torch.FloatTensor 元组(一个用于每个层)。

    生成器编码器的 attention 权重,在 attention softmax 之后,用于在自注意力头中计算加权平均。

  • generator_dec_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — 形状为 (batch_size, sequence_length, hidden_size)torch.FloatTensor 元组(一个用于嵌入的输出,一个用于每个层的输出)。

    生成器解码器的隐藏状态,在每个层的输出以及初始嵌入输出。

  • generator_dec_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — 形状为 (batch_size, num_heads, sequence_length, sequence_length)torch.FloatTensor 元组(一个用于每个层)。

    生成器解码器的 attention 权重,在 attention softmax 之后,用于在自注意力头中计算加权平均。

  • generator_cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — 形状为 (batch_size, num_heads, sequence_length, sequence_length)torch.FloatTensor 元组(一个用于每个层)。

    生成器解码器的交叉 attention 权重,在 attention softmax 之后,用于在交叉注意力头中计算加权平均。

RagModel 的 forward 方法,覆盖了 __call__ 特殊方法。

虽然 forward pass 的实现需要在此函数中定义,但你应该在之后调用 Module 实例而不是这个,因为前者负责运行预处理和后处理步骤,而后者会静默地忽略它们。

示例

>>> from transformers import AutoTokenizer, RagRetriever, RagModel
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("facebook/rag-token-base")
>>> retriever = RagRetriever.from_pretrained(
...     "facebook/rag-token-base", index_name="exact", use_dummy_dataset=True
... )
>>> # initialize with RagRetriever to do everything in one forward call
>>> model = RagModel.from_pretrained("facebook/rag-token-base", retriever=retriever)

>>> inputs = tokenizer("How many people live in Paris?", return_tensors="pt")
>>> outputs = model(input_ids=inputs["input_ids"])

RagSequenceForGeneration

class transformers.RagSequenceForGeneration

< >

( config: transformers.configuration_utils.PreTrainedConfig | None = None question_encoder: transformers.modeling_utils.PreTrainedModel | None = None generator: transformers.modeling_utils.PreTrainedModel | None = None retriever: transformers.models.rag.retrieval_rag.RagRetriever | None = None **kwargs )

参数

  • config (PreTrainedConfig, optional) — 模型配置类,包含模型的所有参数。使用 config 文件初始化不会加载与模型相关的权重,只加载配置。查看 from_pretrained() 方法来加载模型权重。
  • question_encoder (PreTrainedModel, optional) — 用于将问题编码为用于检索的隐藏状态的模型。
  • generator (PreTrainedModel, optional) — 基于检索到的文档生成文本的模型。
  • retriever (RagRetriever, optional) — 负责根据编码后的问题从知识库中检索文档的组件。

RAG-sequence 模型实现。它在 forward pass 中执行 RAG-sequence 特定的边际化。

此模型继承自 PreTrainedModel。查看其父类文档,了解库为所有模型实现的通用方法(例如下载或保存、调整输入嵌入大小、修剪头等)。

此模型也是一个 PyTorch torch.nn.Module 子类。像普通的 PyTorch Module 一样使用它,并参考 PyTorch 文档了解一般用法和行为的所有相关信息。

forward

< >

( input_ids: torch.LongTensor | None = None attention_mask: torch.Tensor | None = None encoder_outputs: tuple[tuple[torch.Tensor]] | None = None decoder_input_ids: torch.LongTensor | None = None decoder_attention_mask: torch.BoolTensor | None = None past_key_values: transformers.cache_utils.Cache | None = None context_input_ids: torch.LongTensor | None = None context_attention_mask: torch.LongTensor | None = None doc_scores: torch.FloatTensor | None = None use_cache: bool | None = None output_attentions: bool | None = None output_hidden_states: bool | None = None output_retrieved: bool | None = None exclude_bos_score: bool | None = None reduce_loss: bool | None = None labels: torch.LongTensor | None = None n_docs: int | None = None **kwargs ) transformers.models.rag.modeling_rag.RetrievAugLMMarginOutput or tuple(torch.FloatTensor)

参数

  • input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — 词汇表中输入序列 token 的索引。RagConfig,用于初始化模型,指定要使用的生成器,它还指定一个兼容的生成器分词器。使用该分词器类来获取索引。

    什么是 input IDs?

  • attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — 避免在 padding token 索引上执行 attention 的掩码。掩码值选择在 [0, 1] 中:

    • 1 表示未被掩码的 token,
    • 0 表示被掩码的 token。

    什么是 attention masks?

  • encoder_outputs (tuple(tuple(torch.FloatTensor), optional) — 由 (generator_enc_last_hidden_state, optional: generator_enc_hidden_states, optional: generator_enc_attentions) 组成的元组。形状为 (batch_size, n_docs * sequence_length, hidden_size)generator_enc_last_hidden_state 是生成器编码器最后一层输出的隐藏状态序列。

    由 (RagModel) 模型在解码过程中使用。

  • decoder_input_ids (torch.LongTensor of shape (batch_size, target_sequence_length), optional) — 为生成任务提供。默认值为 None,根据您与 RAG 实例使用的生成器模型的说明进行构建。
  • decoder_input_ids (torch.LongTensor of shape (batch_size, target_sequence_length), optional) — 词汇表中解码器输入序列 token 的索引。

    可以使用 AutoTokenizer 获取索引。有关详细信息,请参阅 PreTrainedTokenizer.encode()PreTrainedTokenizer.call()

    什么是 decoder input IDs?

  • decoder_attention_mask (torch.BoolTensor of shape (batch_size, target_sequence_length), optional) — 默认行为:生成一个忽略 decoder_input_ids 中 pad token 的张量。也会默认使用因果掩码。
  • past_key_values (~cache_utils.Cache, optional) — 可以用于加速顺序解码的预计算隐藏状态(自注意力块和交叉注意力块中的键和值)。这通常由模型在解码的早期阶段返回的 past_key_values 组成,当 use_cache=Trueconfig.use_cache=True 时。

    只允许 Cache 实例作为输入,请参阅我们的 kv cache 指南。如果不传递 past_key_values,默认将初始化 DynamicCache

    模型将输出与输入相同的缓存格式。

    如果使用 past_key_values,用户需要只输入未处理的 input_ids(那些没有传递其过去键值状态给该模型)的形状为 (batch_size, unprocessed_length),而不是所有 input_ids 的形状为 (batch_size, sequence_length)

  • context_input_ids (torch.LongTensor of shape (batch_size * config.n_docs, config.max_combined_length), optional, returned when output_retrieved=True) — 由检索到的文档和问题编码器 input_ids 经检索器后处理的输入 ID。如果模型未用 retriever 初始化,则必须在 forward pass 中提供 context_input_idscontext_input_ids__call__() 返回。
  • context_attention_mask (torch.LongTensor of shape (batch_size * config.n_docs, config.max_combined_length),optional, returned when output_retrieved=True) — 由检索到的文档和问题编码器 input_ids 经检索器后处理的 attention mask。如果模型未用 retriever 初始化,则必须在 forward pass 中提供 context_attention_maskcontext_attention_mask__call__() 返回。
  • doc_scores (torch.FloatTensor of shape (batch_size, config.n_docs)) — 在每个检索到的文档嵌入(请参阅 retrieved_doc_embeds)和 question_encoder_last_hidden_state 之间的分数。如果模型未初始化 retriever,则必须在 forward pass 中提供 doc_scoresdoc_scores 可以通过 question_encoder_last_hidden_stateretrieved_doc_embeds 计算,请参阅示例了解更多信息。
  • use_cache (bool, optional) — 如果设置为 True,则返回 past_key_values 键值状态,并可用于加速解码(参见 past_key_values)。
  • output_attentions (bool, optional) — 是否返回所有注意力层的注意力张量。有关更多详细信息,请参阅返回张量下的 attentions
  • output_hidden_states (bool, optional) — 是否返回所有层的隐藏状态。有关更多详细信息,请参阅返回张量下的 hidden_states
  • output_retrieved (bool, optional) — 是否返回 retrieved_doc_embedsretrieved_doc_idscontext_input_idscontext_attention_mask。有关更多详细信息,请参阅返回张量。
  • exclude_bos_score (bool, optional) — 仅当传递了 labels 时才相关。如果为 True,则在计算损失时将忽略 BOS 标记的分数。
  • reduce_loss (bool, optional) — 仅当传递了 labels 时才相关。如果为 True,则使用 torch.Tensor.sum 操作来减小 NLL 损失。
  • labels (torch.LongTensor, shape (batch_size, sequence_length), optional) — 用于计算掩码语言建模损失的标签。索引应在 [0, ..., config.vocab_size] 或 -100 范围内(请参阅 input_ids 的文档字符串)。索引设置为 -100 的标记将被忽略(掩码),仅为标签在 [0, ..., config.vocab_size] 范围内的标记计算损失。
  • n_docs (int, optional) — 要检索的文档数量。

返回

transformers.models.rag.modeling_rag.RetrievAugLMMarginOutputtuple(torch.FloatTensor)

一个 transformers.models.rag.modeling_rag.RetrievAugLMMarginOutput 或一个 torch.FloatTensor 元组(如果传递了 return_dict=False 或当 config.return_dict=False 时),具体取决于配置(RagConfig)和输入。

  • loss (torch.FloatTensor,形状为 (1,)可选,当提供 labels 时返回) — 语言建模损失。

  • logits (torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)) — 语言建模头的预测分数。对于每个词汇 token,分数可能被边际化了所有文档。

  • doc_scores (torch.FloatTensor of shape (batch_size, config.n_docs)) — 在每个检索到的文档嵌入(请参阅 retrieved_doc_embeds)和 question_encoder_last_hidden_state 之间的分数。

  • past_key_values (Cache, optional, 当传递 use_cache=True 或当 config.use_cache=True 时返回) — 它是 Cache 实例。更多详情,请参阅我们的 kv cache 指南

    包含解码器的预计算隐藏状态(注意力块中的键和值),可用于(请参阅 past_key_values 输入)加速顺序解码。

  • retrieved_doc_embeds (torch.FloatTensor of shape (batch_size, config.n_docs, hidden_size), optional, returned when output_retrieved=True) — 由检索器检索到的嵌入文档。与 question_encoder_last_hidden_state 一起用于计算 doc_scores

  • retrieved_doc_ids (torch.LongTensor of shape (batch_size, config.n_docs), optional, returned when output_retrieved=True) — 由检索器检索到的嵌入文档的索引。

  • context_input_ids (torch.LongTensor of shape (batch_size * config.n_docs, config.max_combined_length), optional, returned when output_retrieved=True) — 由检索器从检索到的文档和问题编码器 input_ids 后处理的输入 ID。

  • context_attention_mask (torch.LongTensor of shape (batch_size * config.n_docs, config.max_combined_length), optional, returned when output_retrieved=True) — 由检索器从检索到的文档和问题编码器 input_ids 后处理的 attention mask。

  • question_encoder_last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — 模型池化输出的问题编码器的最后一层输出处的隐藏状态序列。

  • question_enc_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — 形状为 (batch_size, sequence_length, hidden_size)torch.FloatTensor 元组(一个用于嵌入的输出,一个用于每个层的输出)。

    问题编码器的隐藏状态,在每个层的输出以及初始嵌入输出。

  • question_enc_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — 形状为 (batch_size, num_heads, sequence_length, sequence_length)torch.FloatTensor 元组(一个用于每个层)。

    问题编码器的 attention 权重,在 attention softmax 之后,用于在自注意力头中计算加权平均。

  • generator_enc_last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — 模型生成器编码器的最后一层输出处的隐藏状态序列。

  • generator_enc_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — 形状为 (batch_size, sequence_length, hidden_size)torch.FloatTensor 元组(一个用于嵌入的输出,一个用于每个层的输出)。

    生成器编码器的隐藏状态,在每个层的输出以及初始嵌入输出。

  • generator_enc_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — 形状为 (batch_size, num_heads, sequence_length, sequence_length)torch.FloatTensor 元组(一个用于每个层)。

    生成器编码器的 attention 权重,在 attention softmax 之后,用于在自注意力头中计算加权平均。

  • generator_dec_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — 形状为 (batch_size, sequence_length, hidden_size)torch.FloatTensor 元组(一个用于嵌入的输出,一个用于每个层的输出)。

    生成器解码器的隐藏状态,在每个层的输出以及初始嵌入输出。

  • generator_dec_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — 形状为 (batch_size, num_heads, sequence_length, sequence_length)torch.FloatTensor 元组(一个用于每个层)。

    生成器解码器的 attention 权重,在 attention softmax 之后,用于在自注意力头中计算加权平均。

  • generator_cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — 形状为 (batch_size, num_heads, sequence_length, sequence_length)torch.FloatTensor 元组(一个用于每个层)。

    生成器解码器的交叉 attention 权重,在 attention softmax 之后,用于在交叉注意力头中计算加权平均。

transformers.RagSequenceForGenerationforward 方法,覆盖了 __call__ 特殊方法。

虽然 forward pass 的实现需要在此函数中定义,但你应该在之后调用 Module 实例而不是这个,因为前者负责运行预处理和后处理步骤,而后者会静默地忽略它们。

示例

>>> from transformers import AutoTokenizer, RagRetriever, RagSequenceForGeneration
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("facebook/rag-sequence-nq")
>>> retriever = RagRetriever.from_pretrained(
...     "facebook/rag-sequence-nq", index_name="exact", use_dummy_dataset=True
... )
>>> # initialize with RagRetriever to do everything in one forward call
>>> model = RagSequenceForGeneration.from_pretrained("facebook/rag-token-nq", retriever=retriever)

>>> inputs = tokenizer("How many people live in Paris?", return_tensors="pt")
>>> targets = tokenizer(text_target="In Paris, there are 10 million people.", return_tensors="pt")
>>> input_ids = inputs["input_ids"]
>>> labels = targets["input_ids"]
>>> outputs = model(input_ids=input_ids, labels=labels)

>>> # or use retriever separately
>>> model = RagSequenceForGeneration.from_pretrained("facebook/rag-sequence-nq", use_dummy_dataset=True)
>>> # 1. Encode
>>> question_hidden_states = model.question_encoder(input_ids)[0]
>>> # 2. Retrieve
>>> docs_dict = retriever(input_ids.numpy(), question_hidden_states.detach().numpy(), return_tensors="pt")
>>> doc_scores = torch.bmm(
...     question_hidden_states.unsqueeze(1), docs_dict["retrieved_doc_embeds"].float().transpose(1, 2)
... ).squeeze(1)
>>> # 3. Forward to generator
>>> outputs = model(
...     context_input_ids=docs_dict["context_input_ids"],
...     context_attention_mask=docs_dict["context_attention_mask"],
...     doc_scores=doc_scores,
...     decoder_input_ids=labels,
... )

生成

< >

( input_ids: torch.LongTensor | None = None attention_mask: torch.LongTensor | None = None context_input_ids: torch.LongTensor | None = None context_attention_mask: torch.LongTensor | None = None doc_scores: torch.FloatTensor | None = None do_deduplication: bool | None = None num_return_sequences num_beams: int | None = None n_docs: int | None = None **model_kwargs ) torch.LongTensor, shape (batch_size * num_return_sequences, sequence_length)

参数

  • input_ids (torch.LongTensor, shape (batch_size, sequence_length), optional) — 作为生成任务提示的序列。如果未传递 input_ids,则必须提供 context_input_ids
  • attention_mask (torch.Tensor, shape (batch_size, sequence_length), optional) — 用于避免对填充标记索引执行注意力的掩码。掩码值选择在 [0, 1] 中:

    • 1 表示未掩码的标记,
    • 0 表示已掩码的标记。

    什么是注意力掩码?

  • context_input_ids (torch.LongTensor, shape (batch_size * config.n_docs, config.max_combined_length), optional, 当 output_retrieved=True 时返回) — 由检索器从检索到的文档和问题编码器 input_ids 中后处理的输入 ID。
  • context_attention_mask (torch.LongTensor, shape (batch_size * config.n_docs, config.max_combined_length), optional, 当 output_retrieved=True 时返回) — 由检索器从检索到的文档和问题编码器 input_ids 中后处理的注意力掩码。

    如果模型未初始化 retriever 或未给出 input_ids,则必须向前传递 context_input_idscontext_attention_mask。它们由 __call__() 返回。

  • doc_scores (torch.FloatTensor, shape (batch_size, config.n_docs)) — 检索到的文档嵌入(请参阅 retrieved_doc_embeds)与 question_encoder_last_hidden_state 之间的分数。

    如果模型未初始化 retriever 或未给出 input_ids,则必须将 doc_scores 传递到前向传递。doc_scores__call__() 返回。

  • do_deduplication (bool, optional) — 是否对来自同一输入的、来自不同上下文文档的生成进行去重。如果在使用分布式后端进行训练时,必须设置为 False
  • num_return_sequences(int, optional, 默认为 1) — 对于批次中的每个元素,独立计算的返回序列数。请注意,这不是我们传递给 generator[generate()](/docs/transformers/v5.1.0/en/main_classes/text_generation#transformers.GenerationMixin.generate) 函数的值,在其中我们将 num_return_sequences 设置为 num_beams
  • num_beams (int, optional, 默认为 1) — 用于束搜索的束数。1 表示无束搜索。
  • n_docs (int, optional, 默认为 config.n_docs) — 要检索的文档数量和/或要生成答案的文档数量。
  • kwargs (dict[str, Any], optional) — 额外的 kwargs 将被传递给 generate()

返回

torch.LongTensor, shape (batch_size * num_return_sequences, sequence_length)

生成的序列。第二个维度(序列长度)等于 max_length,或者如果所有批次由于 eos_token_id 而提前完成,则会更短。

实现了 RAG 序列“详尽”解码。有关如何设置其他生成输入参数的信息,请参阅 generate() 文档。

RagTokenForGeneration

class transformers.RagTokenForGeneration

< >

( config: transformers.configuration_utils.PreTrainedConfig | None = None question_encoder: transformers.modeling_utils.PreTrainedModel | None = None generator: transformers.modeling_utils.PreTrainedModel | None = None retriever: transformers.models.rag.retrieval_rag.RagRetriever | None = None **kwargs )

参数

  • config (PreTrainedConfig, optional) — 模型的配置类,包含模型的所有参数。使用 config 文件初始化不会加载与模型相关的权重,只会加载配置。请查看 from_pretrained() 方法以加载模型权重。
  • question_encoder (PreTrainedModel, optional) — 负责将问题编码为用于检索的隐藏状态的模型。
  • generator (PreTrainedModel, optional) — 负责根据检索到的文档生成文本的模型。
  • retriever (RagRetriever, optional) — 负责根据编码的问题从知识库中检索文档的组件。

RAG-token 模型实现。它在前向传递中执行 RAG-token 特定的边际化。

此模型继承自 PreTrainedModel。查看其父类文档,了解库为所有模型实现的通用方法(例如下载或保存、调整输入嵌入大小、修剪头等)。

此模型也是一个 PyTorch torch.nn.Module 子类。像普通的 PyTorch Module 一样使用它,并参考 PyTorch 文档了解一般用法和行为的所有相关信息。

forward

< >

( input_ids: torch.LongTensor | None = None attention_mask: torch.FloatTensor | None = None encoder_outputs: tuple[tuple[torch.Tensor]] | None = None decoder_input_ids: torch.LongTensor | None = None decoder_attention_mask: torch.BoolTensor | None = None past_key_values: transformers.cache_utils.Cache | None = None context_input_ids: torch.LongTensor | None = None context_attention_mask: torch.LongTensor | None = None doc_scores: torch.FloatTensor | None = None use_cache: bool | None = None output_attentions: bool | None = None output_hidden_states: bool | None = None output_retrieved: bool | None = None do_marginalize: bool | None = None reduce_loss: bool | None = None labels: torch.LongTensor | None = None n_docs: int | None = None **kwargs ) transformers.models.rag.modeling_rag.RetrievAugLMMarginOutputtuple(torch.FloatTensor)

参数

  • input_ids (torch.LongTensor, shape (batch_size, sequence_length)) — 词汇表中输入序列标记的索引。RagConfig,用于初始化模型,指定要使用的生成器,它还指定了一个兼容的生成器分词器。使用该分词器类来获取索引。

    什么是输入 ID?

  • attention_mask (torch.FloatTensor, shape (batch_size, sequence_length), optional) — 用于避免对填充标记索引执行注意力的掩码。掩码值选择在 [0, 1] 中:

    • 1 表示未掩码的标记,
    • 0 表示已掩码的标记。

    什么是注意力掩码?

  • encoder_outputs (tuple(tuple(torch.FloatTensor), optional) — 元组包含(generator_enc_last_hidden_state可选generator_enc_hidden_states可选generator_enc_attentions)。generator_enc_last_hidden_state,shape (batch_size, n_docs * sequence_length, hidden_size) 是生成器编码器最后一层的隐藏状态序列。

    在解码过程中由(RagModel)模型使用。

  • decoder_input_ids (torch.LongTensor, shape (batch_size, target_sequence_length), optional) — 为生成任务提供。默认值为 None,根据您与 RAG 实例一起使用的生成器模型说明进行构建。
  • decoder_input_ids (torch.LongTensor, shape (batch_size, target_sequence_length), optional) — 词汇表中解码器输入序列标记的索引。

    可以使用 AutoTokenizer 获取索引。有关详细信息,请参阅 PreTrainedTokenizer.encode()PreTrainedTokenizer.call()

    什么是解码器输入 ID?

  • decoder_attention_mask (torch.BoolTensor, shape (batch_size, target_sequence_length), optional) — 默认行为:生成一个忽略 decoder_input_ids 中填充标记的张量。因果掩码也将默认使用。
  • past_key_values (~cache_utils.Cache, optional) — 可以用于加速顺序解码的预计算隐藏状态(自注意力块和交叉注意力块中的键值)。这通常由模型在解码的上一阶段返回的 past_key_values 组成,当 use_cache=Trueconfig.use_cache=True 时。

    只允许 Cache 实例作为输入,请参阅我们的 kv cache 指南。如果不传递 past_key_values,默认将初始化 DynamicCache

    模型将输出与输入相同的缓存格式。

    如果使用 past_key_values,用户需要只输入未处理的 input_ids(即那些没有传递其过去键值状态给此模型的 input_ids),形状为 (batch_size, unprocessed_length),而不是所有 input_ids,形状为 (batch_size, sequence_length)

  • context_input_ids (torch.LongTensor, shape (batch_size * config.n_docs, config.max_combined_length), optional, 当 output_retrieved=True 时返回) — 由检索器从检索到的文档和问题编码器 input_ids 中后处理的输入 ID。如果模型没有初始化 retriever,则必须将 context_input_ids 传递到前向传递。context_input_ids__call__() 返回。
  • context_attention_mask (torch.LongTensor, shape (batch_size * config.n_docs, config.max_combined_length),optional, 当 output_retrieved=True 时返回) — 由检索器从检索到的文档和问题编码器 input_ids 中后处理的注意力掩码。如果模型未初始化 retriever,则必须将 context_attention_mask 传递到前向传递。context_attention_mask__call__() 返回。
  • doc_scores (torch.FloatTensor, shape (batch_size, config.n_docs)) — 检索到的文档嵌入(请参阅 retrieved_doc_embeds)与 question_encoder_last_hidden_state 之间的分数。如果模型没有初始化 retriever,则必须将 doc_scores 传递到前向传递。doc_scores 可以通过 question_encoder_last_hidden_stateretrieved_doc_embeds 计算,请参阅示例以获取更多信息。
  • use_cache (bool, optional) — 如果设置为 True,则返回 past_key_values 键值状态,并可用于加速解码(参见 past_key_values)。
  • output_attentions (bool, optional) — 是否返回所有注意力层的注意力张量。有关更多详细信息,请参阅返回张量下的 attentions
  • output_hidden_states (bool, optional) — 是否返回所有层的隐藏状态。有关更多详细信息,请参阅返回张量下的 hidden_states
  • output_retrieved (bool, optional) — 是否返回 retrieved_doc_embedsretrieved_doc_idscontext_input_idscontext_attention_mask。有关更多详细信息,请参阅返回张量。
  • do_marginalize (bool, optional) — 如果为 True,则通过使用 torch.nn.functional.log_softmax 对所有文档进行边际化以获得 logits。
  • reduce_loss (bool, optional) — 仅当传递了 labels 时才相关。如果为 True,则使用 torch.Tensor.sum 操作来减小 NLL 损失。
  • labels (torch.LongTensor, shape (batch_size, sequence_length), optional) — 用于计算掩码语言建模损失的标签。索引应在 [0, ..., config.vocab_size] 或 -100 范围内(请参阅 input_ids 的文档字符串)。索引设置为 -100 的标记将被忽略(掩码),仅为标签在 [0, ..., config.vocab_size] 范围内的标记计算损失。
  • n_docs (int, optional) — 要检索的文档数量。

返回

transformers.models.rag.modeling_rag.RetrievAugLMMarginOutputtuple(torch.FloatTensor)

一个 transformers.models.rag.modeling_rag.RetrievAugLMMarginOutput 或一个 torch.FloatTensor 元组(如果传递了 return_dict=False 或当 config.return_dict=False 时),具体取决于配置(RagConfig)和输入。

  • loss (torch.FloatTensor,形状为 (1,)可选,当提供 labels 时返回) — 语言建模损失。

  • logits (torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)) — 语言建模头的预测分数。对于每个词汇 token,分数可能被边际化了所有文档。

  • doc_scores (torch.FloatTensor of shape (batch_size, config.n_docs)) — 在每个检索到的文档嵌入(请参阅 retrieved_doc_embeds)和 question_encoder_last_hidden_state 之间的分数。

  • past_key_values (Cache, optional, 当传递 use_cache=True 或当 config.use_cache=True 时返回) — 它是 Cache 实例。更多详情,请参阅我们的 kv cache 指南

    包含解码器的预计算隐藏状态(注意力块中的键和值),可用于(请参阅 past_key_values 输入)加速顺序解码。

  • retrieved_doc_embeds (torch.FloatTensor of shape (batch_size, config.n_docs, hidden_size), optional, returned when output_retrieved=True) — 由检索器检索到的嵌入文档。与 question_encoder_last_hidden_state 一起用于计算 doc_scores

  • retrieved_doc_ids (torch.LongTensor of shape (batch_size, config.n_docs), optional, returned when output_retrieved=True) — 由检索器检索到的嵌入文档的索引。

  • context_input_ids (torch.LongTensor of shape (batch_size * config.n_docs, config.max_combined_length), optional, returned when output_retrieved=True) — 由检索器从检索到的文档和问题编码器 input_ids 后处理的输入 ID。

  • context_attention_mask (torch.LongTensor of shape (batch_size * config.n_docs, config.max_combined_length), optional, returned when output_retrieved=True) — 由检索器从检索到的文档和问题编码器 input_ids 后处理的 attention mask。

  • question_encoder_last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — 模型池化输出的问题编码器的最后一层输出处的隐藏状态序列。

  • question_enc_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — 形状为 (batch_size, sequence_length, hidden_size)torch.FloatTensor 元组(一个用于嵌入的输出,一个用于每个层的输出)。

    问题编码器的隐藏状态,在每个层的输出以及初始嵌入输出。

  • question_enc_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — 形状为 (batch_size, num_heads, sequence_length, sequence_length)torch.FloatTensor 元组(一个用于每个层)。

    问题编码器的 attention 权重,在 attention softmax 之后,用于在自注意力头中计算加权平均。

  • generator_enc_last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — 模型生成器编码器的最后一层输出处的隐藏状态序列。

  • generator_enc_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — 形状为 (batch_size, sequence_length, hidden_size)torch.FloatTensor 元组(一个用于嵌入的输出,一个用于每个层的输出)。

    生成器编码器的隐藏状态,在每个层的输出以及初始嵌入输出。

  • generator_enc_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — 形状为 (batch_size, num_heads, sequence_length, sequence_length)torch.FloatTensor 元组(一个用于每个层)。

    生成器编码器的 attention 权重,在 attention softmax 之后,用于在自注意力头中计算加权平均。

  • generator_dec_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — 形状为 (batch_size, sequence_length, hidden_size)torch.FloatTensor 元组(一个用于嵌入的输出,一个用于每个层的输出)。

    生成器解码器的隐藏状态,在每个层的输出以及初始嵌入输出。

  • generator_dec_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — 形状为 (batch_size, num_heads, sequence_length, sequence_length)torch.FloatTensor 元组(一个用于每个层)。

    生成器解码器的 attention 权重,在 attention softmax 之后,用于在自注意力头中计算加权平均。

  • generator_cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — 形状为 (batch_size, num_heads, sequence_length, sequence_length)torch.FloatTensor 元组(一个用于每个层)。

    生成器解码器的交叉 attention 权重,在 attention softmax 之后,用于在交叉注意力头中计算加权平均。

transformers.RagTokenForGenerationforward 方法,覆盖了 __call__ 特殊方法。

虽然 forward pass 的实现需要在此函数中定义,但你应该在之后调用 Module 实例而不是这个,因为前者负责运行预处理和后处理步骤,而后者会静默地忽略它们。

示例

>>> from transformers import AutoTokenizer, RagRetriever, RagTokenForGeneration
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("facebook/rag-token-nq")
>>> retriever = RagRetriever.from_pretrained(
...     "facebook/rag-token-nq", index_name="exact", use_dummy_dataset=True
... )
>>> # initialize with RagRetriever to do everything in one forward call
>>> model = RagTokenForGeneration.from_pretrained("facebook/rag-token-nq", retriever=retriever)

>>> inputs = tokenizer("How many people live in Paris?", return_tensors="pt")
>>> targets = tokenizer(text_target="In Paris, there are 10 million people.", return_tensors="pt")
>>> input_ids = inputs["input_ids"]
>>> labels = targets["input_ids"]
>>> outputs = model(input_ids=input_ids, labels=labels)

>>> # or use retriever separately
>>> model = RagTokenForGeneration.from_pretrained("facebook/rag-token-nq", use_dummy_dataset=True)
>>> # 1. Encode
>>> question_hidden_states = model.question_encoder(input_ids)[0]
>>> # 2. Retrieve
>>> docs_dict = retriever(input_ids.numpy(), question_hidden_states.detach().numpy(), return_tensors="pt")
>>> doc_scores = torch.bmm(
...     question_hidden_states.unsqueeze(1), docs_dict["retrieved_doc_embeds"].float().transpose(1, 2)
... ).squeeze(1)
>>> # 3. Forward to generator
>>> outputs = model(
...     context_input_ids=docs_dict["context_input_ids"],
...     context_attention_mask=docs_dict["context_attention_mask"],
...     doc_scores=doc_scores,
...     decoder_input_ids=labels,
... )

>>> # or directly generate
>>> generated = model.generate(
...     context_input_ids=docs_dict["context_input_ids"],
...     context_attention_mask=docs_dict["context_attention_mask"],
...     doc_scores=doc_scores,
... )
>>> generated_string = tokenizer.batch_decode(generated, skip_special_tokens=True)

生成

< >

( input_ids: torch.LongTensor | None = None attention_mask: torch.LongTensor | None = None context_input_ids: torch.LongTensor | None = None context_attention_mask: torch.LongTensor | None = None doc_scores: torch.FloatTensor | None = None n_docs: int | None = None generation_config: transformers.generation.configuration_utils.GenerationConfig | None = None prefix_allowed_tokens_fn: collections.abc.Callable[[int, torch.Tensor], list[int]] | None = None logits_processor: transformers.generation.logits_process.LogitsProcessorList | None = [] stopping_criteria: transformers.generation.stopping_criteria.StoppingCriteriaList | None = [] **kwargs ) torch.LongTensor of shape (batch_size * num_return_sequences, sequence_length)

参数

  • input_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — The sequence used as a prompt for the generation. If input_ids is not passed, then context_input_ids has to be provided.
  • attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

    • 1 for tokens that are not masked,
    • 0 for tokens that are masked.

    什么是注意力掩码?

  • context_input_ids (torch.LongTensor of shape (batch_size * config.n_docs, config.max_combined_length), optional, returned when output_retrieved=True) — Input IDs post-processed from the retrieved documents and the question encoder input_ids by the retriever.

    If the model has is not initialized with a retriever, context_input_ids has to be provided to the forward pass. context_input_ids are returned by __call__().

  • context_attention_mask (torch.LongTensor of shape (batch_size * config.n_docs, config.max_combined_length), optional, returned when output_retrieved=True) — Attention mask post-processed from the retrieved documents and the question encoder input_ids by the retriever.

    If the model has is not initialized with a retriever, context_input_ids has to be provided to the forward pass. context_input_ids are returned by __call__().

  • doc_scores (torch.FloatTensor of shape (batch_size, config.n_docs)) — Score between each retrieved document embeddings (see retrieved_doc_embeds) and question_encoder_last_hidden_state.

    If the model has is not initialized with a retriever, context_input_ids has to be provided to the forward pass. context_input_ids are returned by __call__().

  • n_docs (int, optional, defaults to config.n_docs) — Number of documents to retrieve and/or number of documents for which to generate an answer.
  • generation_config (~generation.GenerationConfig, optional) — The generation configuration to be used as base parametrization for the generation call. **kwargs passed to generate matching the attributes of generation_config will override them. If generation_config is not provided, the default will be used, which has the following loading priority: 1) from the generation_config.json model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit GenerationConfig’s default values, whose documentation should be checked to parameterize generation.
  • prefix_allowed_tokens_fn (Callable[[int, torch.Tensor], list[int]], optional) — If provided, this function constraints the beam search to allowed tokens only at each step. If not provided no constraint is applied. This function takes 2 arguments inputs_ids and the batch ID batch_id. It has to return a list with the allowed tokens for the next generation step conditioned on the previously generated tokens inputs_ids and the batch ID batch_id. This argument is useful for constrained generation conditioned on the prefix, as described in Autoregressive Entity Retrieval.
  • logits_processor (LogitsProcessorList, optional) — Custom logits processors that complement the default logits processors built from arguments and a model’s config. If a logit processor is passed that is already created with the arguments or a model’s config an error is thrown.
  • stopping_criteria (StoppingCriteriaList, optional) — Custom stopping criteria that complement the default stopping criteria built from arguments and a model’s config. If a stopping criteria is passed that is already created with the arguments or a model’s config an error is thrown.
  • kwargs (dict[str, Any], optional) — Ad hoc parametrization of generate_config and/or additional model-specific kwargs that will be forwarded to the forward function of the model.

返回

torch.LongTensor, shape (batch_size * num_return_sequences, sequence_length)

生成的序列。第二个维度(sequence_length)要么等于 max_length,要么因为 eos_token_id 而提前结束所有批次,因此会更短。

实现 RAG token 解码。

在 GitHub 上更新

© . This site is unofficial and not affiliated with Hugging Face, Inc.