Transformers 文档
RAG
并获得增强的文档体验
开始使用
该模型于 2020-05-22 发布,并于 2020-11-16 添加到 Hugging Face Transformers。
RAG
检索增强生成 (RAG) 将预训练语言模型(参数化记忆)与通过预训练神经检索器访问的外部数据源(非参数化记忆)相结合。RAG 在推理过程中会检索相关段落并以它们为条件进行生成。这通常使答案更具事实性,并且允许您通过更改索引来更新知识,而无需重新训练整个模型。
您可以在 AI at Meta 组织下找到所有原始 RAG 检查点。
此模型由 ola13 贡献。
单击侧边栏中的 RAG 模型,了解如何将 RAG 应用于不同语言任务的更多示例。
以下示例演示了如何使用 AutoModel 生成文本。
import torch
from transformers import RagTokenizer, RagRetriever, RagSequenceForGeneration
tokenizer = RagTokenizer.from_pretrained("facebook/rag-sequence-nq")
retriever = RagRetriever.from_pretrained(
"facebook/dpr-ctx_encoder-single-nq-base", dataset="wiki_dpr", index_name="compressed"
)
model = RagSequenceForGeneration.from_pretrained(
"facebook/rag-token-nq",
retriever=retriever,
dtype="auto",
attn_implementation="flash_attention_2",
)
input_dict = tokenizer.prepare_seq2seq_batch("How many people live in Paris?", return_tensors="pt")
generated = model.generate(input_ids=input_dict["input_ids"])
print(tokenizer.batch_decode(generated, skip_special_tokens=True)[0])量化通过以较低精度存储权重来减少内存占用。有关支持的后端,请参阅 量化概述。下面的示例使用 bitsandbytes 将权重量化为 4 位。
import torch
from transformers import BitsAndBytesConfig, RagTokenizer, RagRetriever, RagSequenceForGeneration
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
tokenizer = RagTokenizer.from_pretrained("facebook/rag-sequence-nq")
retriever = RagRetriever.from_pretrained(
"facebook/dpr-ctx_encoder-single-nq-base", dataset="wiki_dpr", index_name="compressed"
)
model = RagSequenceForGeneration.from_pretrained(
"facebook/rag-token-nq",
retriever=retriever,
quantization_config=bnb, # quantizes generator weights
device_map="auto",
)
input_dict = tokenizer.prepare_seq2seq_batch("How many people live in Paris?", return_tensors="pt")
generated = model.generate(input_ids=input_dict["input_ids"])
print(tokenizer.batch_decode(generated, skip_special_tokens=True)[0])RagConfig
class transformers.RagConfig
< 来源 >( vocab_size = None is_encoder_decoder = True prefix = None bos_token_id = None pad_token_id = None eos_token_id = None decoder_start_token_id = None title_sep = ' / ' doc_sep = ' // ' n_docs = 5 max_combined_length = 300 retrieval_vector_size = 768 retrieval_batch_size = 8 dataset = 'wiki_dpr' dataset_split = 'train' index_name = 'compressed' index_path = None passages_path = None use_dummy_dataset = False reduce_loss = False label_smoothing = 0.0 do_deduplication = True exclude_bos_score = False do_marginalize = False output_retrieved = False use_cache = True dataset_revision = None **kwargs )
参数
- title_sep (
str, optional, defaults to" / ") — 在调用 RagRetriever 时,插入到检索到的文档的标题和文本之间的分隔符。 - doc_sep (
str, optional, defaults to" // ") — 在调用 RagRetriever 时,插入到检索到的文档文本和原始输入之间的分隔符。 - n_docs (
int, optional, defaults to 5) — 要检索的文档数量。 - max_combined_length (
int, optional, defaults to 300) — RagRetriever 的__call__()方法返回的上下文输入的 max 长度。 - retrieval_vector_size (
int, optional, defaults to 768) — RagRetriever 索引的文档嵌入的维度。 - retrieval_batch_size (
int, optional, defaults to 8) — 检索批次大小,定义为同时提交给 RagRetriever 封装的 faiss 索引的查询数量。 - dataset (
str, optional, defaults to"wiki_dpr") — HuggingFace Datasets 中索引数据集的标识符(使用datasets.list_datasets()列出所有可用的数据集和 ID)。 - dataset_split (
str, optional, defaults to"train") — 要加载的dataset的哪个分割。 - index_name (
str, optional, defaults to"compressed") — 与dataset关联的索引的索引名称。可以选择"legacy"、"exact"和"compressed"。 - index_path (
str, optional) — faiss 索引在磁盘上序列化的路径。 - passages_path (
str, optional) — 与 faiss 索引兼容的文本段落路径。如果使用LegacyIndex则必需。 - use_dummy_dataset (
bool, optional, defaults toFalse) — 是否加载由dataset指定的数据集的“dummy”变体。 - label_smoothing (
float, optional, defaults to 0.0) — 仅当return_loss设置为True时相关。控制损失计算中标签平滑的epsilon参数值。如果设置为 0,则不执行标签平滑。 - do_marginalize (
bool, optional, defaults toFalse) — 如果为True,则通过使用torch.nn.functional.log_softmax来对所有文档的 logit 进行边际化。 - reduce_loss (
bool, optional, defaults toFalse) — 是否使用torch.Tensor.sum操作来减少 NLL 损失。 - do_deduplication (
bool, optional, defaults toTrue) — 是否对来自给定输入的多个上下文文档的生成进行去重。如果在使用分布式后端进行训练时,必须设置为False。 - exclude_bos_score (
bool, optional, defaults toFalse) — 是否在计算损失时忽略 BOS token。 - output_retrieved(
bool, optional, defaults toFalse) — 如果设置为True,则返回retrieved_doc_embeds、retrieved_doc_ids、context_input_ids和context_attention_mask。有关详细信息,请参阅返回的张量。 - use_cache (
bool, optional, defaults toTrue) — 模型是否应返回最后一个 key/values 注意力(并非所有模型都使用)。
RagConfig 存储 RagModel 的配置。配置对象继承自 PreTrainedConfig,可用于控制模型输出。有关更多信息,请阅读 PreTrainedConfig 的文档。
from_question_encoder_generator_configs
< 来源 >( question_encoder_config: PreTrainedConfig generator_config: PreTrainedConfig **kwargs ) → EncoderDecoderConfig
从预训练的编码器模型配置和解码器模型配置实例化一个 EncoderDecoderConfig(或其派生类)。
RagTokenizer
Rag specific outputs
class transformers.models.rag.modeling_rag.RetrievAugLMMarginOutput
< source >( loss: torch.FloatTensor | None = None logits: torch.FloatTensor | None = None doc_scores: torch.FloatTensor | None = None past_key_values: transformers.cache_utils.Cache | None = None retrieved_doc_embeds: torch.FloatTensor | None = None retrieved_doc_ids: torch.LongTensor | None = None context_input_ids: torch.LongTensor | None = None context_attention_mask: torch.LongTensor | None = None question_encoder_last_hidden_state: torch.FloatTensor | None = None question_enc_hidden_states: tuple[torch.FloatTensor, ...] | None = None question_enc_attentions: tuple[torch.FloatTensor, ...] | None = None generator_enc_last_hidden_state: torch.FloatTensor | None = None generator_enc_hidden_states: tuple[torch.FloatTensor, ...] | None = None generator_enc_attentions: tuple[torch.FloatTensor, ...] | None = None generator_dec_hidden_states: tuple[torch.FloatTensor, ...] | None = None generator_dec_attentions: tuple[torch.FloatTensor, ...] | None = None generator_cross_attentions: tuple[torch.FloatTensor, ...] | None = None )
参数
- loss (
torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Language modeling loss. - logits (
torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head. The score is possibly marginalized over all documents for each vocabulary token. - doc_scores (
torch.FloatTensorof shape(batch_size, config.n_docs)) — Score between each retrieved document embeddings (seeretrieved_doc_embeds) andquestion_encoder_last_hidden_state. - past_key_values (
Cache, optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — It is a Cache instance. For more details, see our kv cache guide.Contains precomputed hidden-states (key and values in the attention blocks) of the decoder that can be used (see
past_key_valuesinput) to speed up sequential decoding. - retrieved_doc_embeds (
torch.FloatTensorof shape(batch_size, config.n_docs, hidden_size), optional, returned when output_retrieved=True) — Embedded documents retrieved by the retriever. Is used withquestion_encoder_last_hidden_stateto compute thedoc_scores. - retrieved_doc_ids (
torch.LongTensorof shape(batch_size, config.n_docs), optional, returned when output_retrieved=True) — The indexes of the embedded documents retrieved by the retriever. - context_input_ids (
torch.LongTensorof shape(batch_size * config.n_docs, config.max_combined_length), optional, returned when output_retrieved=True) — Input ids post-processed from the retrieved documents and the question encoder input_ids by the retriever. - context_attention_mask (
torch.LongTensorof shape(batch_size * config.n_docs, config.max_combined_length), optional, returned when output_retrieved=True) — Attention mask post-processed from the retrieved documents and the question encoderinput_idsby the retriever. - question_encoder_last_hidden_state (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Sequence of hidden states at the output of the last layer of the question encoder pooled output of the model. - question_enc_hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings and one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden states of the question encoder at the output of each layer plus the initial embedding outputs.
- question_enc_attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the question encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.
- generator_enc_last_hidden_state (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the generator encoder of the model. - generator_enc_hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings and one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden states of the generator encoder at the output of each layer plus the initial embedding outputs.
- generator_enc_attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the generator encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.
- generator_dec_hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings and one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden states of the generator decoder at the output of each layer plus the initial embedding outputs.
- generator_dec_attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the generator decoder, after the attention softmax, used to compute the weighted average in the self-attention heads.
- generator_cross_attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Cross-attentions weights of the generator decoder, after the attention softmax, used to compute the weighted average in the cross-attention heads.
Base class for retriever augmented marginalized models outputs.
class transformers.models.rag.modeling_rag.RetrievAugLMOutput
< source >( logits: torch.FloatTensor | None = None doc_scores: torch.FloatTensor | None = None past_key_values: transformers.cache_utils.Cache | None = None retrieved_doc_embeds: torch.FloatTensor | None = None retrieved_doc_ids: torch.LongTensor | None = None context_input_ids: torch.LongTensor | None = None context_attention_mask: torch.LongTensor | None = None question_encoder_last_hidden_state: torch.FloatTensor | None = None question_enc_hidden_states: tuple[torch.FloatTensor, ...] | None = None question_enc_attentions: tuple[torch.FloatTensor, ...] | None = None generator_enc_last_hidden_state: torch.FloatTensor | None = None generator_enc_hidden_states: tuple[torch.FloatTensor, ...] | None = None generator_enc_attentions: tuple[torch.FloatTensor, ...] | None = None generator_dec_hidden_states: tuple[torch.FloatTensor, ...] | None = None generator_dec_attentions: tuple[torch.FloatTensor, ...] | None = None generator_cross_attentions: tuple[torch.FloatTensor, ...] | None = None )
参数
- logits (
torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head. The score is possibly marginalized over all documents for each vocabulary token. - doc_scores (
torch.FloatTensorof shape(batch_size, config.n_docs)) — Score between each retrieved document embeddings (seeretrieved_doc_embeds) andquestion_encoder_last_hidden_state. - past_key_values (
Cache, optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — It is a Cache instance. For more details, see our kv cache guide.Contains precomputed hidden-states (key and values in the attention blocks) of the decoder that can be used (see
past_key_valuesinput) to speed up sequential decoding. - retrieved_doc_embeds (
torch.FloatTensorof shape(batch_size, config.n_docs, hidden_size), optional, returned when output_retrieved=True) — Embedded documents retrieved by the retriever. Is used withquestion_encoder_last_hidden_stateto compute thedoc_scores. - retrieved_doc_ids (
torch.LongTensorof shape(batch_size, config.n_docs), optional, returned when output_retrieved=True) — The indexes of the embedded documents retrieved by the retriever. - context_input_ids (
torch.LongTensorof shape(batch_size * config.n_docs, config.max_combined_length), optional, returned when output_retrieved=True) — Input ids post-processed from the retrieved documents and the question encoder input_ids by the retriever. - context_attention_mask (
torch.LongTensorof shape(batch_size * config.n_docs, config.max_combined_length), optional, returned when output_retrieved=True) — Attention mask post-processed from the retrieved documents and the question encoderinput_idsby the retriever. - question_encoder_last_hidden_state (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Sequence of hidden states at the output of the last layer of the question encoder pooled output of the model. - question_enc_hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings and one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden states of the question encoder at the output of each layer plus the initial embedding outputs.
- question_enc_attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the question encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.
- generator_enc_last_hidden_state (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the generator encoder of the model. - generator_enc_hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings and one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden states of the generator encoder at the output of each layer plus the initial embedding outputs.
- generator_enc_attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the generator encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.
- generator_dec_hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings and one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden states of the generator decoder at the output of each layer plus the initial embedding outputs.
- generator_dec_attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the generator decoder, after the attention softmax, used to compute the weighted average in the self-attention heads.
- generator_cross_attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Cross-attentions weights of the generator decoder, after the attention softmax, used to compute the weighted average in the cross-attention heads.
RagRetriever
class transformers.RagRetriever
< source >( config question_encoder_tokenizer generator_tokenizer index = None init_retrieval = True )
参数
- config (RagConfig) — The configuration of the RAG model this Retriever is used with. Contains parameters indicating which
Indexto build. You can load your own custom dataset withconfig.index_name="custom"or use a canonical one (default) from the datasets library withconfig.index_name="wiki_dpr"for example. - question_encoder_tokenizer (PreTrainedTokenizer) — The tokenizer that was used to tokenize the question. It is used to decode the question and then use the generator_tokenizer.
- generator_tokenizer (PreTrainedTokenizer) — The tokenizer used for the generator part of the RagModel.
- index (
Index, optional, defaults to the one defined by the configuration) — If specified, use this index instead of the one built using the configuration
Retriever used to get documents from vector queries. It retrieves the documents embeddings as well as the documents contents, and it formats them to be used with a RagModel.
示例
>>> # To load the default "wiki_dpr" dataset with 21M passages from wikipedia (index name is 'compressed' or 'exact')
>>> from transformers import RagRetriever
>>> retriever = RagRetriever.from_pretrained(
... "facebook/dpr-ctx_encoder-single-nq-base", dataset="wiki_dpr", index_name="compressed"
... )
>>> # To load your own indexed dataset built with the datasets library. More info on how to build the indexed dataset in examples/rag/use_own_knowledge_dataset.py
>>> from transformers import RagRetriever
>>> dataset = (
... ...
... ) # dataset must be a datasets.Datasets object with columns "title", "text" and "embeddings", and it must have a supported index (e.g., Faiss or other index types depending on your setup)
>>> retriever = RagRetriever.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base", indexed_dataset=dataset)
>>> # To load your own indexed dataset built with the datasets library that was saved on disk. More info in examples/rag/use_own_knowledge_dataset.py
>>> from transformers import RagRetriever
>>> dataset_path = "path/to/my/dataset" # dataset saved via *dataset.save_to_disk(...)*
>>> index_path = "path/to/my/index" # index saved via *dataset.get_index("embeddings").save(...)*
>>> retriever = RagRetriever.from_pretrained(
... "facebook/dpr-ctx_encoder-single-nq-base",
... index_name="custom",
... passages_path=dataset_path,
... index_path=index_path,
... )
>>> # To load the legacy index built originally for Rag's paper
>>> from transformers import RagRetriever
>>> retriever = RagRetriever.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base", index_name="legacy")Retriever initialization function. It loads the index into memory.
postprocess_docs
< source >( docs input_strings prefix n_docs return_tensors = None ) → tuple(tensors)
后处理检索到的 docs 并将它们与 input_strings 结合。
retrieve
< source >( question_hidden_states: ndarray n_docs: int ) → tuple[np.ndarray, np.ndarray, list[dict]]
参数
- question_hidden_states (
np.ndarrayof shape(batch_size, vector_size)) — 用于检索的查询向量批次。 - n_docs (
int) — 每个查询检索的文档数量。
返回
tuple[np.ndarray, np.ndarray, list[dict]]
一个包含以下对象的元组
- retrieved_doc_embeds (
np.ndarrayof shape(batch_size, n_docs, dim)) — 每个查询检索到的文档的嵌入。 - doc_ids (
np.ndarrayof shape(batch_size, n_docs)) — 索引中文档的 ID - doc_dicts (
list[dict]): 每个查询检索到的retrieved_doc_embeds的示例。
为指定的 question_hidden_states 检索文档。
RagModel
class transformers.RagModel
< source >( config: transformers.configuration_utils.PreTrainedConfig | None = None question_encoder: transformers.modeling_utils.PreTrainedModel | None = None generator: transformers.modeling_utils.PreTrainedModel | None = None retriever: transformers.models.rag.retrieval_rag.RagRetriever | None = None **kwargs )
参数
- config (
PreTrainedConfig, optional) — 模型配置类,包含模型的所有参数。使用 config 文件初始化不会加载与模型相关的权重,只加载配置。查看 from_pretrained() 方法来加载模型权重。 - question_encoder (
PreTrainedModel, optional) — 用于将问题编码为用于检索的隐藏状态的模型。 - generator (
PreTrainedModel, optional) — 基于检索到的文档生成文本的模型。 - retriever (
RagRetriever, optional) — 负责根据编码后的问题从知识库中检索文档的组件。
输出原始隐藏状态而没有顶部任何特定头的 RAG 模型。
此模型继承自 PreTrainedModel。查看其父类文档,了解库为所有模型实现的通用方法(例如下载或保存、调整输入嵌入大小、修剪头等)。
此模型也是一个 PyTorch torch.nn.Module 子类。像普通的 PyTorch Module 一样使用它,并参考 PyTorch 文档了解一般用法和行为的所有相关信息。
forward
< source >( input_ids: torch.LongTensor | None = None attention_mask: torch.Tensor | None = None encoder_outputs: tuple[tuple[torch.FloatTensor]] | None = None decoder_input_ids: torch.LongTensor | None = None decoder_attention_mask: torch.BoolTensor | None = None past_key_values: transformers.cache_utils.Cache | None = None doc_scores: torch.FloatTensor | None = None context_input_ids: torch.LongTensor | None = None context_attention_mask: torch.LongTensor | None = None use_cache: bool | None = None output_attentions: bool | None = None output_hidden_states: bool | None = None output_retrieved: bool | None = None n_docs: int | None = None **kwargs ) → transformers.models.rag.modeling_rag.RetrievAugLMOutput or tuple(torch.FloatTensor)
参数
- input_ids (
torch.LongTensorof shape(batch_size, sequence_length)) — 词汇表中输入序列 token 的索引。RagConfig,用于初始化模型,指定要使用的生成器,它还指定一个兼容的生成器分词器。使用该分词器类来获取索引。 - attention_mask (
torch.Tensorof shape(batch_size, sequence_length), optional) — 避免在 padding token 索引上执行 attention 的掩码。掩码值选择在[0, 1]中:- 1 表示未被掩码的 token,
- 0 表示被掩码的 token。
- encoder_outputs (
tuple(tuple(torch.FloatTensor), optional) — 由 (generator_enc_last_hidden_state, optional:generator_enc_hidden_states, optional:generator_enc_attentions) 组成的元组。形状为(batch_size, n_docs * sequence_length, hidden_size)的generator_enc_last_hidden_state是生成器编码器最后一层输出的隐藏状态序列。由 (RagModel) 模型在解码过程中使用。
- decoder_input_ids (
torch.LongTensorof shape(batch_size, target_sequence_length), optional) — 为生成任务提供。默认值为None,根据您与 RAG 实例使用的生成器模型的说明进行构建。 - decoder_input_ids (
torch.LongTensorof shape(batch_size, target_sequence_length), optional) — 词汇表中解码器输入序列 token 的索引。可以使用 AutoTokenizer 获取索引。有关详细信息,请参阅 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。
- decoder_attention_mask (
torch.BoolTensorof shape(batch_size, target_sequence_length), optional) — 默认行为:生成一个忽略decoder_input_ids中 pad token 的张量。也会默认使用因果掩码。 - past_key_values (
~cache_utils.Cache, optional) — 可以用于加速顺序解码的预计算隐藏状态(自注意力块和交叉注意力块中的键和值)。这通常由模型在解码的早期阶段返回的past_key_values组成,当use_cache=True或config.use_cache=True时。只允许
Cache实例作为输入,请参阅我们的 kv cache 指南。如果不传递past_key_values,默认将初始化DynamicCache。模型将输出与输入相同的缓存格式。
如果使用
past_key_values,用户需要只输入未处理的input_ids(那些没有传递其过去键值状态给该模型)的形状为(batch_size, unprocessed_length),而不是所有input_ids的形状为(batch_size, sequence_length)。 - doc_scores (
torch.FloatTensorof shape(batch_size, config.n_docs)) — 在每个检索到的文档嵌入(请参阅retrieved_doc_embeds)和question_encoder_last_hidden_state之间的分数。如果模型未初始化retriever,则必须在 forward pass 中提供doc_scores。doc_scores可以通过question_encoder_last_hidden_state和retrieved_doc_embeds计算,请参阅示例了解更多信息。 - context_input_ids (
torch.LongTensorof shape(batch_size * config.n_docs, config.max_combined_length), optional, returned when output_retrieved=True) — 由检索到的文档和问题编码器input_ids经检索器后处理的输入 ID。如果模型未用retriever初始化,则必须在 forward pass 中提供context_input_ids。context_input_ids由__call__()返回。 - context_attention_mask (
torch.LongTensorof shape(batch_size * config.n_docs, config.max_combined_length),optional, returned when output_retrieved=True) — 由检索到的文档和问题编码器input_ids经检索器后处理的 attention mask。如果模型未用retriever初始化,则必须在 forward pass 中提供context_attention_mask。context_attention_mask由__call__()返回。 - use_cache (
bool, optional) — 如果设置为True,则会返回past_key_values键值状态,并可用于加速解码。 - output_attentions (
bool, optional) — 是否返回所有 attention 层的 attention 张量。有关详细信息,请参阅返回的张量下的attentions。 - output_hidden_states (
bool, optional) — 是否返回所有层的隐藏状态。有关详细信息,请参阅返回的张量下的hidden_states。 - output_retrieved (
bool, optional) — 是否返回retrieved_doc_embeds、retrieved_doc_ids、context_input_ids和context_attention_mask。有关详细信息,请参阅返回的张量。 - n_docs (
int, optional) — 要检索的文档数量。
返回
transformers.models.rag.modeling_rag.RetrievAugLMOutput or tuple(torch.FloatTensor)
根据配置(RagConfig)和输入,包含各种元素的 transformers.models.rag.modeling_rag.RetrievAugLMOutput 或 torch.FloatTensor 元组(如果传递了 return_dict=False 或当 config.return_dict=False 时)。
-
logits (
torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) — 语言建模头的预测分数。对于每个词汇 token,分数可能被边际化了所有文档。 -
doc_scores (
torch.FloatTensorof shape(batch_size, config.n_docs)) — 在每个检索到的文档嵌入(请参阅retrieved_doc_embeds)和question_encoder_last_hidden_state之间的分数。 -
past_key_values (
Cache, optional, 当传递use_cache=True或当config.use_cache=True时返回) — 它是 Cache 实例。更多详情,请参阅我们的 kv cache 指南。包含解码器的预计算隐藏状态(注意力块中的键和值),可用于(请参阅
past_key_values输入)加速顺序解码。 -
retrieved_doc_embeds (
torch.FloatTensorof shape(batch_size, config.n_docs, hidden_size), optional, returned when output_retrieved=True) — 由检索器检索到的嵌入文档。与question_encoder_last_hidden_state一起用于计算doc_scores。 -
retrieved_doc_ids (
torch.LongTensorof shape(batch_size, config.n_docs), optional, returned when output_retrieved=True) — 由检索器检索到的嵌入文档的索引。 -
context_input_ids (
torch.LongTensorof shape(batch_size * config.n_docs, config.max_combined_length), optional, returned when output_retrieved=True) — 由检索器从检索到的文档和问题编码器 input_ids 后处理的输入 ID。 -
context_attention_mask (
torch.LongTensorof shape(batch_size * config.n_docs, config.max_combined_length), optional, returned when output_retrieved=True) — 由检索器从检索到的文档和问题编码器input_ids后处理的 attention mask。 -
question_encoder_last_hidden_state (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — 模型池化输出的问题编码器的最后一层输出处的隐藏状态序列。 -
question_enc_hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — 形状为(batch_size, sequence_length, hidden_size)的torch.FloatTensor元组(一个用于嵌入的输出,一个用于每个层的输出)。问题编码器的隐藏状态,在每个层的输出以及初始嵌入输出。
-
question_enc_attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — 形状为(batch_size, num_heads, sequence_length, sequence_length)的torch.FloatTensor元组(一个用于每个层)。问题编码器的 attention 权重,在 attention softmax 之后,用于在自注意力头中计算加权平均。
-
generator_enc_last_hidden_state (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — 模型生成器编码器的最后一层输出处的隐藏状态序列。 -
generator_enc_hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — 形状为(batch_size, sequence_length, hidden_size)的torch.FloatTensor元组(一个用于嵌入的输出,一个用于每个层的输出)。生成器编码器的隐藏状态,在每个层的输出以及初始嵌入输出。
-
generator_enc_attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — 形状为(batch_size, num_heads, sequence_length, sequence_length)的torch.FloatTensor元组(一个用于每个层)。生成器编码器的 attention 权重,在 attention softmax 之后,用于在自注意力头中计算加权平均。
-
generator_dec_hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — 形状为(batch_size, sequence_length, hidden_size)的torch.FloatTensor元组(一个用于嵌入的输出,一个用于每个层的输出)。生成器解码器的隐藏状态,在每个层的输出以及初始嵌入输出。
-
generator_dec_attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — 形状为(batch_size, num_heads, sequence_length, sequence_length)的torch.FloatTensor元组(一个用于每个层)。生成器解码器的 attention 权重,在 attention softmax 之后,用于在自注意力头中计算加权平均。
-
generator_cross_attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — 形状为(batch_size, num_heads, sequence_length, sequence_length)的torch.FloatTensor元组(一个用于每个层)。生成器解码器的交叉 attention 权重,在 attention softmax 之后,用于在交叉注意力头中计算加权平均。
RagModel 的 forward 方法,覆盖了 __call__ 特殊方法。
虽然 forward pass 的实现需要在此函数中定义,但你应该在之后调用
Module实例而不是这个,因为前者负责运行预处理和后处理步骤,而后者会静默地忽略它们。
示例
>>> from transformers import AutoTokenizer, RagRetriever, RagModel
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("facebook/rag-token-base")
>>> retriever = RagRetriever.from_pretrained(
... "facebook/rag-token-base", index_name="exact", use_dummy_dataset=True
... )
>>> # initialize with RagRetriever to do everything in one forward call
>>> model = RagModel.from_pretrained("facebook/rag-token-base", retriever=retriever)
>>> inputs = tokenizer("How many people live in Paris?", return_tensors="pt")
>>> outputs = model(input_ids=inputs["input_ids"])RagSequenceForGeneration
class transformers.RagSequenceForGeneration
< source >( config: transformers.configuration_utils.PreTrainedConfig | None = None question_encoder: transformers.modeling_utils.PreTrainedModel | None = None generator: transformers.modeling_utils.PreTrainedModel | None = None retriever: transformers.models.rag.retrieval_rag.RagRetriever | None = None **kwargs )
参数
- config (
PreTrainedConfig, optional) — 模型配置类,包含模型的所有参数。使用 config 文件初始化不会加载与模型相关的权重,只加载配置。查看 from_pretrained() 方法来加载模型权重。 - question_encoder (
PreTrainedModel, optional) — 用于将问题编码为用于检索的隐藏状态的模型。 - generator (
PreTrainedModel, optional) — 基于检索到的文档生成文本的模型。 - retriever (
RagRetriever, optional) — 负责根据编码后的问题从知识库中检索文档的组件。
RAG-sequence 模型实现。它在 forward pass 中执行 RAG-sequence 特定的边际化。
此模型继承自 PreTrainedModel。查看其父类文档,了解库为所有模型实现的通用方法(例如下载或保存、调整输入嵌入大小、修剪头等)。
此模型也是一个 PyTorch torch.nn.Module 子类。像普通的 PyTorch Module 一样使用它,并参考 PyTorch 文档了解一般用法和行为的所有相关信息。
forward
< source >( input_ids: torch.LongTensor | None = None attention_mask: torch.Tensor | None = None encoder_outputs: tuple[tuple[torch.Tensor]] | None = None decoder_input_ids: torch.LongTensor | None = None decoder_attention_mask: torch.BoolTensor | None = None past_key_values: transformers.cache_utils.Cache | None = None context_input_ids: torch.LongTensor | None = None context_attention_mask: torch.LongTensor | None = None doc_scores: torch.FloatTensor | None = None use_cache: bool | None = None output_attentions: bool | None = None output_hidden_states: bool | None = None output_retrieved: bool | None = None exclude_bos_score: bool | None = None reduce_loss: bool | None = None labels: torch.LongTensor | None = None n_docs: int | None = None **kwargs ) → transformers.models.rag.modeling_rag.RetrievAugLMMarginOutput or tuple(torch.FloatTensor)
参数
- input_ids (
torch.LongTensorof shape(batch_size, sequence_length)) — 词汇表中输入序列 token 的索引。RagConfig,用于初始化模型,指定要使用的生成器,它还指定一个兼容的生成器分词器。使用该分词器类来获取索引。 - attention_mask (
torch.Tensorof shape(batch_size, sequence_length), optional) — 避免在 padding token 索引上执行 attention 的掩码。掩码值选择在[0, 1]中:- 1 表示未被掩码的 token,
- 0 表示被掩码的 token。
- encoder_outputs (
tuple(tuple(torch.FloatTensor), optional) — 由 (generator_enc_last_hidden_state, optional:generator_enc_hidden_states, optional:generator_enc_attentions) 组成的元组。形状为(batch_size, n_docs * sequence_length, hidden_size)的generator_enc_last_hidden_state是生成器编码器最后一层输出的隐藏状态序列。由 (RagModel) 模型在解码过程中使用。
- decoder_input_ids (
torch.LongTensorof shape(batch_size, target_sequence_length), optional) — 为生成任务提供。默认值为None,根据您与 RAG 实例使用的生成器模型的说明进行构建。 - decoder_input_ids (
torch.LongTensorof shape(batch_size, target_sequence_length), optional) — 词汇表中解码器输入序列 token 的索引。可以使用 AutoTokenizer 获取索引。有关详细信息,请参阅 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。
- decoder_attention_mask (
torch.BoolTensorof shape(batch_size, target_sequence_length), optional) — 默认行为:生成一个忽略decoder_input_ids中 pad token 的张量。也会默认使用因果掩码。 - past_key_values (
~cache_utils.Cache, optional) — 可以用于加速顺序解码的预计算隐藏状态(自注意力块和交叉注意力块中的键和值)。这通常由模型在解码的早期阶段返回的past_key_values组成,当use_cache=True或config.use_cache=True时。只允许
Cache实例作为输入,请参阅我们的 kv cache 指南。如果不传递past_key_values,默认将初始化DynamicCache。模型将输出与输入相同的缓存格式。
如果使用
past_key_values,用户需要只输入未处理的input_ids(那些没有传递其过去键值状态给该模型)的形状为(batch_size, unprocessed_length),而不是所有input_ids的形状为(batch_size, sequence_length)。 - context_input_ids (
torch.LongTensorof shape(batch_size * config.n_docs, config.max_combined_length), optional, returned when output_retrieved=True) — 由检索到的文档和问题编码器input_ids经检索器后处理的输入 ID。如果模型未用retriever初始化,则必须在 forward pass 中提供context_input_ids。context_input_ids由__call__()返回。 - context_attention_mask (
torch.LongTensorof shape(batch_size * config.n_docs, config.max_combined_length),optional, returned when output_retrieved=True) — 由检索到的文档和问题编码器input_ids经检索器后处理的 attention mask。如果模型未用retriever初始化,则必须在 forward pass 中提供context_attention_mask。context_attention_mask由__call__()返回。 - doc_scores (
torch.FloatTensorof shape(batch_size, config.n_docs)) — 在每个检索到的文档嵌入(请参阅retrieved_doc_embeds)和question_encoder_last_hidden_state之间的分数。如果模型未初始化retriever,则必须在 forward pass 中提供doc_scores。doc_scores可以通过question_encoder_last_hidden_state和retrieved_doc_embeds计算,请参阅示例了解更多信息。 - use_cache (
bool, optional) — 如果设置为True,则返回past_key_values键值状态,并可用于加速解码(参见past_key_values)。 - output_attentions (
bool, optional) — 是否返回所有注意力层的注意力张量。有关更多详细信息,请参阅返回张量下的attentions。 - output_hidden_states (
bool, optional) — 是否返回所有层的隐藏状态。有关更多详细信息,请参阅返回张量下的hidden_states。 - output_retrieved (
bool, optional) — 是否返回retrieved_doc_embeds、retrieved_doc_ids、context_input_ids和context_attention_mask。有关更多详细信息,请参阅返回张量。 - exclude_bos_score (
bool, optional) — 仅当传递了labels时才相关。如果为True,则在计算损失时将忽略 BOS 标记的分数。 - reduce_loss (
bool, optional) — 仅当传递了labels时才相关。如果为True,则使用torch.Tensor.sum操作来减小 NLL 损失。 - labels (
torch.LongTensor, shape(batch_size, sequence_length), optional) — 用于计算掩码语言建模损失的标签。索引应在[0, ..., config.vocab_size]或 -100 范围内(请参阅input_ids的文档字符串)。索引设置为-100的标记将被忽略(掩码),仅为标签在[0, ..., config.vocab_size]范围内的标记计算损失。 - n_docs (
int, optional) — 要检索的文档数量。
返回
transformers.models.rag.modeling_rag.RetrievAugLMMarginOutput 或 tuple(torch.FloatTensor)
一个 transformers.models.rag.modeling_rag.RetrievAugLMMarginOutput 或一个 torch.FloatTensor 元组(如果传递了 return_dict=False 或当 config.return_dict=False 时),具体取决于配置(RagConfig)和输入。
-
loss (
torch.FloatTensor,形状为(1,),可选,当提供labels时返回) — 语言建模损失。 -
logits (
torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) — 语言建模头的预测分数。对于每个词汇 token,分数可能被边际化了所有文档。 -
doc_scores (
torch.FloatTensorof shape(batch_size, config.n_docs)) — 在每个检索到的文档嵌入(请参阅retrieved_doc_embeds)和question_encoder_last_hidden_state之间的分数。 -
past_key_values (
Cache, optional, 当传递use_cache=True或当config.use_cache=True时返回) — 它是 Cache 实例。更多详情,请参阅我们的 kv cache 指南。包含解码器的预计算隐藏状态(注意力块中的键和值),可用于(请参阅
past_key_values输入)加速顺序解码。 -
retrieved_doc_embeds (
torch.FloatTensorof shape(batch_size, config.n_docs, hidden_size), optional, returned when output_retrieved=True) — 由检索器检索到的嵌入文档。与question_encoder_last_hidden_state一起用于计算doc_scores。 -
retrieved_doc_ids (
torch.LongTensorof shape(batch_size, config.n_docs), optional, returned when output_retrieved=True) — 由检索器检索到的嵌入文档的索引。 -
context_input_ids (
torch.LongTensorof shape(batch_size * config.n_docs, config.max_combined_length), optional, returned when output_retrieved=True) — 由检索器从检索到的文档和问题编码器 input_ids 后处理的输入 ID。 -
context_attention_mask (
torch.LongTensorof shape(batch_size * config.n_docs, config.max_combined_length), optional, returned when output_retrieved=True) — 由检索器从检索到的文档和问题编码器input_ids后处理的 attention mask。 -
question_encoder_last_hidden_state (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — 模型池化输出的问题编码器的最后一层输出处的隐藏状态序列。 -
question_enc_hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — 形状为(batch_size, sequence_length, hidden_size)的torch.FloatTensor元组(一个用于嵌入的输出,一个用于每个层的输出)。问题编码器的隐藏状态,在每个层的输出以及初始嵌入输出。
-
question_enc_attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — 形状为(batch_size, num_heads, sequence_length, sequence_length)的torch.FloatTensor元组(一个用于每个层)。问题编码器的 attention 权重,在 attention softmax 之后,用于在自注意力头中计算加权平均。
-
generator_enc_last_hidden_state (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — 模型生成器编码器的最后一层输出处的隐藏状态序列。 -
generator_enc_hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — 形状为(batch_size, sequence_length, hidden_size)的torch.FloatTensor元组(一个用于嵌入的输出,一个用于每个层的输出)。生成器编码器的隐藏状态,在每个层的输出以及初始嵌入输出。
-
generator_enc_attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — 形状为(batch_size, num_heads, sequence_length, sequence_length)的torch.FloatTensor元组(一个用于每个层)。生成器编码器的 attention 权重,在 attention softmax 之后,用于在自注意力头中计算加权平均。
-
generator_dec_hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — 形状为(batch_size, sequence_length, hidden_size)的torch.FloatTensor元组(一个用于嵌入的输出,一个用于每个层的输出)。生成器解码器的隐藏状态,在每个层的输出以及初始嵌入输出。
-
generator_dec_attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — 形状为(batch_size, num_heads, sequence_length, sequence_length)的torch.FloatTensor元组(一个用于每个层)。生成器解码器的 attention 权重,在 attention softmax 之后,用于在自注意力头中计算加权平均。
-
generator_cross_attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — 形状为(batch_size, num_heads, sequence_length, sequence_length)的torch.FloatTensor元组(一个用于每个层)。生成器解码器的交叉 attention 权重,在 attention softmax 之后,用于在交叉注意力头中计算加权平均。
transformers.RagSequenceForGeneration 的 forward 方法,覆盖了 __call__ 特殊方法。
虽然 forward pass 的实现需要在此函数中定义,但你应该在之后调用
Module实例而不是这个,因为前者负责运行预处理和后处理步骤,而后者会静默地忽略它们。
示例
>>> from transformers import AutoTokenizer, RagRetriever, RagSequenceForGeneration
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("facebook/rag-sequence-nq")
>>> retriever = RagRetriever.from_pretrained(
... "facebook/rag-sequence-nq", index_name="exact", use_dummy_dataset=True
... )
>>> # initialize with RagRetriever to do everything in one forward call
>>> model = RagSequenceForGeneration.from_pretrained("facebook/rag-token-nq", retriever=retriever)
>>> inputs = tokenizer("How many people live in Paris?", return_tensors="pt")
>>> targets = tokenizer(text_target="In Paris, there are 10 million people.", return_tensors="pt")
>>> input_ids = inputs["input_ids"]
>>> labels = targets["input_ids"]
>>> outputs = model(input_ids=input_ids, labels=labels)
>>> # or use retriever separately
>>> model = RagSequenceForGeneration.from_pretrained("facebook/rag-sequence-nq", use_dummy_dataset=True)
>>> # 1. Encode
>>> question_hidden_states = model.question_encoder(input_ids)[0]
>>> # 2. Retrieve
>>> docs_dict = retriever(input_ids.numpy(), question_hidden_states.detach().numpy(), return_tensors="pt")
>>> doc_scores = torch.bmm(
... question_hidden_states.unsqueeze(1), docs_dict["retrieved_doc_embeds"].float().transpose(1, 2)
... ).squeeze(1)
>>> # 3. Forward to generator
>>> outputs = model(
... context_input_ids=docs_dict["context_input_ids"],
... context_attention_mask=docs_dict["context_attention_mask"],
... doc_scores=doc_scores,
... decoder_input_ids=labels,
... )生成
< source >( input_ids: torch.LongTensor | None = None attention_mask: torch.LongTensor | None = None context_input_ids: torch.LongTensor | None = None context_attention_mask: torch.LongTensor | None = None doc_scores: torch.FloatTensor | None = None do_deduplication: bool | None = None num_return_sequences num_beams: int | None = None n_docs: int | None = None **model_kwargs ) → torch.LongTensor, shape (batch_size * num_return_sequences, sequence_length)
参数
- input_ids (
torch.LongTensor, shape(batch_size, sequence_length), optional) — 作为生成任务提示的序列。如果未传递input_ids,则必须提供context_input_ids。 - attention_mask (
torch.Tensor, shape(batch_size, sequence_length), optional) — 用于避免对填充标记索引执行注意力的掩码。掩码值选择在[0, 1]中:- 1 表示未掩码的标记,
- 0 表示已掩码的标记。
- context_input_ids (
torch.LongTensor, shape(batch_size * config.n_docs, config.max_combined_length), optional, 当 output_retrieved=True 时返回) — 由检索器从检索到的文档和问题编码器 input_ids 中后处理的输入 ID。 - context_attention_mask (
torch.LongTensor, shape(batch_size * config.n_docs, config.max_combined_length), optional, 当 output_retrieved=True 时返回) — 由检索器从检索到的文档和问题编码器input_ids中后处理的注意力掩码。如果模型未初始化
retriever或未给出input_ids,则必须向前传递context_input_ids和context_attention_mask。它们由__call__()返回。 - doc_scores (
torch.FloatTensor, shape(batch_size, config.n_docs)) — 检索到的文档嵌入(请参阅retrieved_doc_embeds)与question_encoder_last_hidden_state之间的分数。如果模型未初始化
retriever或未给出input_ids,则必须将doc_scores传递到前向传递。doc_scores由__call__()返回。 - do_deduplication (
bool, optional) — 是否对来自同一输入的、来自不同上下文文档的生成进行去重。如果在使用分布式后端进行训练时,必须设置为False。 - num_return_sequences(
int, optional, 默认为 1) — 对于批次中的每个元素,独立计算的返回序列数。请注意,这不是我们传递给generator的[generate()](/docs/transformers/v5.1.0/en/main_classes/text_generation#transformers.GenerationMixin.generate)函数的值,在其中我们将num_return_sequences设置为num_beams。 - num_beams (
int, optional, 默认为 1) — 用于束搜索的束数。1 表示无束搜索。 - n_docs (
int, optional, 默认为config.n_docs) — 要检索的文档数量和/或要生成答案的文档数量。 - kwargs (
dict[str, Any], optional) — 额外的 kwargs 将被传递给 generate()。
返回
torch.LongTensor, shape (batch_size * num_return_sequences, sequence_length)
生成的序列。第二个维度(序列长度)等于 max_length,或者如果所有批次由于 eos_token_id 而提前完成,则会更短。
实现了 RAG 序列“详尽”解码。有关如何设置其他生成输入参数的信息,请参阅 generate() 文档。
RagTokenForGeneration
class transformers.RagTokenForGeneration
< source >( config: transformers.configuration_utils.PreTrainedConfig | None = None question_encoder: transformers.modeling_utils.PreTrainedModel | None = None generator: transformers.modeling_utils.PreTrainedModel | None = None retriever: transformers.models.rag.retrieval_rag.RagRetriever | None = None **kwargs )
参数
- config (PreTrainedConfig, optional) — 模型的配置类,包含模型的所有参数。使用 config 文件初始化不会加载与模型相关的权重,只会加载配置。请查看 from_pretrained() 方法以加载模型权重。
- question_encoder (
PreTrainedModel, optional) — 负责将问题编码为用于检索的隐藏状态的模型。 - generator (
PreTrainedModel, optional) — 负责根据检索到的文档生成文本的模型。 - retriever (
RagRetriever, optional) — 负责根据编码的问题从知识库中检索文档的组件。
RAG-token 模型实现。它在前向传递中执行 RAG-token 特定的边际化。
此模型继承自 PreTrainedModel。查看其父类文档,了解库为所有模型实现的通用方法(例如下载或保存、调整输入嵌入大小、修剪头等)。
此模型也是一个 PyTorch torch.nn.Module 子类。像普通的 PyTorch Module 一样使用它,并参考 PyTorch 文档了解一般用法和行为的所有相关信息。
forward
< source >( input_ids: torch.LongTensor | None = None attention_mask: torch.FloatTensor | None = None encoder_outputs: tuple[tuple[torch.Tensor]] | None = None decoder_input_ids: torch.LongTensor | None = None decoder_attention_mask: torch.BoolTensor | None = None past_key_values: transformers.cache_utils.Cache | None = None context_input_ids: torch.LongTensor | None = None context_attention_mask: torch.LongTensor | None = None doc_scores: torch.FloatTensor | None = None use_cache: bool | None = None output_attentions: bool | None = None output_hidden_states: bool | None = None output_retrieved: bool | None = None do_marginalize: bool | None = None reduce_loss: bool | None = None labels: torch.LongTensor | None = None n_docs: int | None = None **kwargs ) → transformers.models.rag.modeling_rag.RetrievAugLMMarginOutput 或 tuple(torch.FloatTensor)
参数
- input_ids (
torch.LongTensor, shape(batch_size, sequence_length)) — 词汇表中输入序列标记的索引。RagConfig,用于初始化模型,指定要使用的生成器,它还指定了一个兼容的生成器分词器。使用该分词器类来获取索引。 - attention_mask (
torch.FloatTensor, shape(batch_size, sequence_length), optional) — 用于避免对填充标记索引执行注意力的掩码。掩码值选择在[0, 1]中:- 1 表示未掩码的标记,
- 0 表示已掩码的标记。
- encoder_outputs (
tuple(tuple(torch.FloatTensor), optional) — 元组包含(generator_enc_last_hidden_state,可选:generator_enc_hidden_states,可选:generator_enc_attentions)。generator_enc_last_hidden_state,shape(batch_size, n_docs * sequence_length, hidden_size)是生成器编码器最后一层的隐藏状态序列。在解码过程中由(RagModel)模型使用。
- decoder_input_ids (
torch.LongTensor, shape(batch_size, target_sequence_length), optional) — 为生成任务提供。默认值为None,根据您与 RAG 实例一起使用的生成器模型说明进行构建。 - decoder_input_ids (
torch.LongTensor, shape(batch_size, target_sequence_length), optional) — 词汇表中解码器输入序列标记的索引。可以使用 AutoTokenizer 获取索引。有关详细信息,请参阅 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。
- decoder_attention_mask (
torch.BoolTensor, shape(batch_size, target_sequence_length), optional) — 默认行为:生成一个忽略decoder_input_ids中填充标记的张量。因果掩码也将默认使用。 - past_key_values (
~cache_utils.Cache, optional) — 可以用于加速顺序解码的预计算隐藏状态(自注意力块和交叉注意力块中的键值)。这通常由模型在解码的上一阶段返回的past_key_values组成,当use_cache=True或config.use_cache=True时。只允许 Cache 实例作为输入,请参阅我们的 kv cache 指南。如果不传递
past_key_values,默认将初始化 DynamicCache。模型将输出与输入相同的缓存格式。
如果使用
past_key_values,用户需要只输入未处理的input_ids(即那些没有传递其过去键值状态给此模型的input_ids),形状为(batch_size, unprocessed_length),而不是所有input_ids,形状为(batch_size, sequence_length)。 - context_input_ids (
torch.LongTensor, shape(batch_size * config.n_docs, config.max_combined_length), optional, 当 output_retrieved=True 时返回) — 由检索器从检索到的文档和问题编码器input_ids中后处理的输入 ID。如果模型没有初始化retriever,则必须将context_input_ids传递到前向传递。context_input_ids由__call__()返回。 - context_attention_mask (
torch.LongTensor, shape(batch_size * config.n_docs, config.max_combined_length),optional, 当 output_retrieved=True 时返回) — 由检索器从检索到的文档和问题编码器input_ids中后处理的注意力掩码。如果模型未初始化retriever,则必须将context_attention_mask传递到前向传递。context_attention_mask由__call__()返回。 - doc_scores (
torch.FloatTensor, shape(batch_size, config.n_docs)) — 检索到的文档嵌入(请参阅retrieved_doc_embeds)与question_encoder_last_hidden_state之间的分数。如果模型没有初始化retriever,则必须将doc_scores传递到前向传递。doc_scores可以通过question_encoder_last_hidden_state和retrieved_doc_embeds计算,请参阅示例以获取更多信息。 - use_cache (
bool, optional) — 如果设置为True,则返回past_key_values键值状态,并可用于加速解码(参见past_key_values)。 - output_attentions (
bool, optional) — 是否返回所有注意力层的注意力张量。有关更多详细信息,请参阅返回张量下的attentions。 - output_hidden_states (
bool, optional) — 是否返回所有层的隐藏状态。有关更多详细信息,请参阅返回张量下的hidden_states。 - output_retrieved (
bool, optional) — 是否返回retrieved_doc_embeds、retrieved_doc_ids、context_input_ids和context_attention_mask。有关更多详细信息,请参阅返回张量。 - do_marginalize (
bool, optional) — 如果为True,则通过使用torch.nn.functional.log_softmax对所有文档进行边际化以获得 logits。 - reduce_loss (
bool, optional) — 仅当传递了labels时才相关。如果为True,则使用torch.Tensor.sum操作来减小 NLL 损失。 - labels (
torch.LongTensor, shape(batch_size, sequence_length), optional) — 用于计算掩码语言建模损失的标签。索引应在[0, ..., config.vocab_size]或 -100 范围内(请参阅input_ids的文档字符串)。索引设置为-100的标记将被忽略(掩码),仅为标签在[0, ..., config.vocab_size]范围内的标记计算损失。 - n_docs (
int, optional) — 要检索的文档数量。
返回
transformers.models.rag.modeling_rag.RetrievAugLMMarginOutput 或 tuple(torch.FloatTensor)
一个 transformers.models.rag.modeling_rag.RetrievAugLMMarginOutput 或一个 torch.FloatTensor 元组(如果传递了 return_dict=False 或当 config.return_dict=False 时),具体取决于配置(RagConfig)和输入。
-
loss (
torch.FloatTensor,形状为(1,),可选,当提供labels时返回) — 语言建模损失。 -
logits (
torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) — 语言建模头的预测分数。对于每个词汇 token,分数可能被边际化了所有文档。 -
doc_scores (
torch.FloatTensorof shape(batch_size, config.n_docs)) — 在每个检索到的文档嵌入(请参阅retrieved_doc_embeds)和question_encoder_last_hidden_state之间的分数。 -
past_key_values (
Cache, optional, 当传递use_cache=True或当config.use_cache=True时返回) — 它是 Cache 实例。更多详情,请参阅我们的 kv cache 指南。包含解码器的预计算隐藏状态(注意力块中的键和值),可用于(请参阅
past_key_values输入)加速顺序解码。 -
retrieved_doc_embeds (
torch.FloatTensorof shape(batch_size, config.n_docs, hidden_size), optional, returned when output_retrieved=True) — 由检索器检索到的嵌入文档。与question_encoder_last_hidden_state一起用于计算doc_scores。 -
retrieved_doc_ids (
torch.LongTensorof shape(batch_size, config.n_docs), optional, returned when output_retrieved=True) — 由检索器检索到的嵌入文档的索引。 -
context_input_ids (
torch.LongTensorof shape(batch_size * config.n_docs, config.max_combined_length), optional, returned when output_retrieved=True) — 由检索器从检索到的文档和问题编码器 input_ids 后处理的输入 ID。 -
context_attention_mask (
torch.LongTensorof shape(batch_size * config.n_docs, config.max_combined_length), optional, returned when output_retrieved=True) — 由检索器从检索到的文档和问题编码器input_ids后处理的 attention mask。 -
question_encoder_last_hidden_state (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — 模型池化输出的问题编码器的最后一层输出处的隐藏状态序列。 -
question_enc_hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — 形状为(batch_size, sequence_length, hidden_size)的torch.FloatTensor元组(一个用于嵌入的输出,一个用于每个层的输出)。问题编码器的隐藏状态,在每个层的输出以及初始嵌入输出。
-
question_enc_attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — 形状为(batch_size, num_heads, sequence_length, sequence_length)的torch.FloatTensor元组(一个用于每个层)。问题编码器的 attention 权重,在 attention softmax 之后,用于在自注意力头中计算加权平均。
-
generator_enc_last_hidden_state (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — 模型生成器编码器的最后一层输出处的隐藏状态序列。 -
generator_enc_hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — 形状为(batch_size, sequence_length, hidden_size)的torch.FloatTensor元组(一个用于嵌入的输出,一个用于每个层的输出)。生成器编码器的隐藏状态,在每个层的输出以及初始嵌入输出。
-
generator_enc_attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — 形状为(batch_size, num_heads, sequence_length, sequence_length)的torch.FloatTensor元组(一个用于每个层)。生成器编码器的 attention 权重,在 attention softmax 之后,用于在自注意力头中计算加权平均。
-
generator_dec_hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — 形状为(batch_size, sequence_length, hidden_size)的torch.FloatTensor元组(一个用于嵌入的输出,一个用于每个层的输出)。生成器解码器的隐藏状态,在每个层的输出以及初始嵌入输出。
-
generator_dec_attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — 形状为(batch_size, num_heads, sequence_length, sequence_length)的torch.FloatTensor元组(一个用于每个层)。生成器解码器的 attention 权重,在 attention softmax 之后,用于在自注意力头中计算加权平均。
-
generator_cross_attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — 形状为(batch_size, num_heads, sequence_length, sequence_length)的torch.FloatTensor元组(一个用于每个层)。生成器解码器的交叉 attention 权重,在 attention softmax 之后,用于在交叉注意力头中计算加权平均。
transformers.RagTokenForGeneration 的 forward 方法,覆盖了 __call__ 特殊方法。
虽然 forward pass 的实现需要在此函数中定义,但你应该在之后调用
Module实例而不是这个,因为前者负责运行预处理和后处理步骤,而后者会静默地忽略它们。
示例
>>> from transformers import AutoTokenizer, RagRetriever, RagTokenForGeneration
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("facebook/rag-token-nq")
>>> retriever = RagRetriever.from_pretrained(
... "facebook/rag-token-nq", index_name="exact", use_dummy_dataset=True
... )
>>> # initialize with RagRetriever to do everything in one forward call
>>> model = RagTokenForGeneration.from_pretrained("facebook/rag-token-nq", retriever=retriever)
>>> inputs = tokenizer("How many people live in Paris?", return_tensors="pt")
>>> targets = tokenizer(text_target="In Paris, there are 10 million people.", return_tensors="pt")
>>> input_ids = inputs["input_ids"]
>>> labels = targets["input_ids"]
>>> outputs = model(input_ids=input_ids, labels=labels)
>>> # or use retriever separately
>>> model = RagTokenForGeneration.from_pretrained("facebook/rag-token-nq", use_dummy_dataset=True)
>>> # 1. Encode
>>> question_hidden_states = model.question_encoder(input_ids)[0]
>>> # 2. Retrieve
>>> docs_dict = retriever(input_ids.numpy(), question_hidden_states.detach().numpy(), return_tensors="pt")
>>> doc_scores = torch.bmm(
... question_hidden_states.unsqueeze(1), docs_dict["retrieved_doc_embeds"].float().transpose(1, 2)
... ).squeeze(1)
>>> # 3. Forward to generator
>>> outputs = model(
... context_input_ids=docs_dict["context_input_ids"],
... context_attention_mask=docs_dict["context_attention_mask"],
... doc_scores=doc_scores,
... decoder_input_ids=labels,
... )
>>> # or directly generate
>>> generated = model.generate(
... context_input_ids=docs_dict["context_input_ids"],
... context_attention_mask=docs_dict["context_attention_mask"],
... doc_scores=doc_scores,
... )
>>> generated_string = tokenizer.batch_decode(generated, skip_special_tokens=True)生成
< 源 >( input_ids: torch.LongTensor | None = None attention_mask: torch.LongTensor | None = None context_input_ids: torch.LongTensor | None = None context_attention_mask: torch.LongTensor | None = None doc_scores: torch.FloatTensor | None = None n_docs: int | None = None generation_config: transformers.generation.configuration_utils.GenerationConfig | None = None prefix_allowed_tokens_fn: collections.abc.Callable[[int, torch.Tensor], list[int]] | None = None logits_processor: transformers.generation.logits_process.LogitsProcessorList | None = [] stopping_criteria: transformers.generation.stopping_criteria.StoppingCriteriaList | None = [] **kwargs ) → torch.LongTensor of shape (batch_size * num_return_sequences, sequence_length)
参数
- input_ids (
torch.LongTensorof shape(batch_size, sequence_length), optional) — The sequence used as a prompt for the generation. Ifinput_idsis not passed, thencontext_input_idshas to be provided. - attention_mask (
torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- context_input_ids (
torch.LongTensorof shape(batch_size * config.n_docs, config.max_combined_length), optional, returned when output_retrieved=True) — Input IDs post-processed from the retrieved documents and the question encoderinput_idsby the retriever.If the model has is not initialized with a
retriever,context_input_idshas to be provided to the forward pass.context_input_idsare returned by__call__(). - context_attention_mask (
torch.LongTensorof shape(batch_size * config.n_docs, config.max_combined_length), optional, returned when output_retrieved=True) — Attention mask post-processed from the retrieved documents and the question encoderinput_idsby the retriever.If the model has is not initialized with a
retriever,context_input_idshas to be provided to the forward pass.context_input_idsare returned by__call__(). - doc_scores (
torch.FloatTensorof shape(batch_size, config.n_docs)) — Score between each retrieved document embeddings (seeretrieved_doc_embeds) andquestion_encoder_last_hidden_state.If the model has is not initialized with a
retriever,context_input_idshas to be provided to the forward pass.context_input_idsare returned by__call__(). - n_docs (
int, optional, defaults toconfig.n_docs) — Number of documents to retrieve and/or number of documents for which to generate an answer. - generation_config (
~generation.GenerationConfig, optional) — The generation configuration to be used as base parametrization for the generation call.**kwargspassed to generate matching the attributes ofgeneration_configwill override them. Ifgeneration_configis not provided, the default will be used, which has the following loading priority: 1) from thegeneration_config.jsonmodel file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit GenerationConfig’s default values, whose documentation should be checked to parameterize generation. - prefix_allowed_tokens_fn (
Callable[[int, torch.Tensor], list[int]], optional) — If provided, this function constraints the beam search to allowed tokens only at each step. If not provided no constraint is applied. This function takes 2 argumentsinputs_idsand the batch IDbatch_id. It has to return a list with the allowed tokens for the next generation step conditioned on the previously generated tokensinputs_idsand the batch IDbatch_id. This argument is useful for constrained generation conditioned on the prefix, as described in Autoregressive Entity Retrieval. - logits_processor (
LogitsProcessorList, optional) — Custom logits processors that complement the default logits processors built from arguments and a model’s config. If a logit processor is passed that is already created with the arguments or a model’s config an error is thrown. - stopping_criteria (
StoppingCriteriaList, optional) — Custom stopping criteria that complement the default stopping criteria built from arguments and a model’s config. If a stopping criteria is passed that is already created with the arguments or a model’s config an error is thrown. - kwargs (
dict[str, Any], optional) — Ad hoc parametrization ofgenerate_configand/or additional model-specific kwargs that will be forwarded to theforwardfunction of the model.
返回
torch.LongTensor, shape (batch_size * num_return_sequences, sequence_length)
生成的序列。第二个维度(sequence_length)要么等于 max_length,要么因为 eos_token_id 而提前结束所有批次,因此会更短。
实现 RAG token 解码。