Transformers 文档
字节潜在Transformer (BLT)
并获得增强的文档体验
开始使用
此模型于 2024-12-13 发布,并于 2025-09-19 添加到 Hugging Face Transformers。
字节潜在Transformer (BLT)
概述
BLT 模型在 Byte Latent Transformer: Patches Scale Better Than Tokens 中由 Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li1, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman†, Srinivasan Iyer 提出。BLT 是一种字节级 LLM,通过基于熵的动态分块实现了与分词级 LLM 相当的性能。
论文摘要如下:
我们引入了字节潜在Transformer (BLT),这是一种新的字节级 LLM 架构,首次在扩展时实现了与基于分词的 LLM 性能相当的水平,同时在推理效率和鲁棒性方面有显著改进。BLT 将字节编码为动态大小的块(patch),作为主要的计算单元。块根据下一个字节的熵进行分割,在数据复杂性要求更高的地方分配更多的计算和模型容量。我们展示了对多达 8B 参数和 4T 训练字节的字节级模型进行的首次 FLOPs 控制的缩放研究。我们的结果证明了在没有固定词汇表的情况下,对原始字节进行训练的模型是可行的。由于在数据可预测时动态选择长块,训练和推理效率都有所提高,并且在推理和长尾泛化方面也实现了定性改进。总的来说,对于固定的推理成本,BLT 通过同时增加块大小和模型大小,显示出比基于分词的模型有显著更好的扩展性。
使用技巧
双模型架构:BLT 由两个单独训练的模型组成
- 分块器(熵模型):一个较小的 Transformer 模型,用于预测字节级熵以确定块边界和分割输入。
- 主 Transformer 模型:处理块的主要模型,包括局部编码器(Local Encoder)、全局 Transformer(Global Transformer)和局部解码器(Local Decoder)。
动态分块:模型使用基于熵的动态分块,其中
- 高熵区域(复杂数据)获得更短的块并受到更多的计算关注
- 低熵区域(可预测数据)获得更长的块以提高效率
- 这使得模型能够将计算资源分配给最需要它们的地方
局部编码器:使用交叉注意力处理字节序列到块嵌入(patch embeddings)
全局 Transformer:使用完整注意力处理块级表示
局部解码器:通过交叉注意力回溯到原始字节序列生成输出
字节级分词器:与使用学习到的词汇表的传统分词器不同,BLT 的分词器仅将文本转换为 UTF-8 字节并将每个字节映射到令牌 ID。无需词汇表。
可以通过以下方式加载模型:
<hfoption id="AutoModel">import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("itazap/blt-1b-hf")
model = AutoModelForCausalLM.from_pretrained(
"itazap/blt-1b-hf",
device_map="auto",
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
prompt = "my name is"
generated_ids = model.generate(
**inputs, max_new_tokens=NUM_TOKENS_TO_GENERATE, do_sample=False, use_cache=False
)
print(tokenizer.decode(generated_ids[0]))BltConfig
class transformers.BltConfig
< source >( vocab_size: int | None = 260 max_position_embeddings: int | None = 4096 patch_in_forward: bool | None = True patch_size: int | None = 4 patching_mode: str | None = 'entropy' patching_threshold: float | None = 1.335442066192627 patching_batch_size: int | None = 1 max_patch_length: int | None = None cross_attn_k: int | None = 2 encoder_hash_byte_group_size: int | None = None encoder_hash_byte_group_vocab: int | None = 500002 encoder_hash_byte_group_nb_functions: int | None = 1 patcher_config: dict | None = None encoder_config: dict | None = None decoder_config: dict | None = None global_config: dict | None = None tie_word_embeddings: bool | None = False pad_token_id: int | None = None bos_token_id: int | None = None eos_token_id: int | None = None initializer_range: float | None = 0.02 rope_parameters: transformers.modeling_rope_utils.RopeParameters | dict[str, transformers.modeling_rope_utils.RopeParameters] | None = None **kwargs )
参数
- vocab_size (
int, optional, defaults to 260) — Blt 模型词汇表大小。定义调用 BltModel 时传递给inputs_ids可以表示的不同令牌的数量。 - max_position_embeddings (
int, optional, defaults to 4096) — 此模型可能使用的最大序列长度。 - patch_in_forward (
bool, optional, defaults toTrue) — 是否在正向传播中执行分块。 - patch_size (
int, optional, defaults to 4) — 分块机制中使用的块的大小。 - patching_mode (
str, optional, defaults to"entropy") — 用于分块的模式,例如基于熵的分块。 - patching_threshold (
float, optional, defaults to 1.34) — 用于确定何时应用块的阈值。 - patching_batch_size (
int, optional, defaults to 1) — 分块过程中使用的批次大小。 - max_patch_length (
int, optional) — 可以生成的块的最大长度。 - cross_attn_k (
int, optional, defaults to 2) — 模型中使用的交叉注意力头数。 - encoder_hash_byte_group_size (
list, optional) — 编码器哈希函数中使用的字节组大小列表。 - encoder_hash_byte_group_vocab (
int, optional, defaults to 500002) — 编码器哈希字节组的词汇表大小。 - encoder_hash_byte_group_nb_functions (
int, optional, defaults to 1) — 编码器字节分组中使用的哈希函数数量。 - patcher_config (
BltPatcherConfig, optional) — 模型分块器组件的配置。 - encoder_config (
BltLocalEncoderConfig, optional) — 模型局部编码器组件的配置。 - decoder_config (
BltLocalDecoderConfig, optional) — 模型局部解码器组件的配置。 - global_config (
BltGlobalTransformerConfig, optional) — 模型全局 Transformer 组件的配置。 - tie_word_embeddings (
bool, optional, defaults toFalse) — 是否绑定权重嵌入。 - initializer_range (
float, optional, defaults to 0.02) — 用于初始化所有权重矩阵的 truncated_normal_initializer 的标准差。 - rope_parameters (
RopeParameters, optional) — 包含 RoPE 嵌入配置参数的字典。字典应包含rope_theta的值,如果希望将 RoPE 与更长的max_position_embeddings一起使用,还可以包含用于缩放的可选参数。
这是用于存储 BltModel 配置的配置类。它用于根据指定的参数实例化 Blt 模型,定义模型架构。
配置对象继承自 PreTrainedConfig,可用于控制模型输出。有关更多信息,请阅读 PreTrainedConfig 的文档。
>>> from transformers import BltModel, BltConfig
>>> # Initializing a Blt configuration
>>> configuration = BltConfig()
>>> # Initializing a model from the configuration
>>> model = BltModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config检查点: facebook/blt
forward
< 源代码 >( input_ids: torch.LongTensor | None = None patch_lengths: torch.Tensor | None = None attention_mask: torch.Tensor | None = None position_ids: torch.LongTensor | None = None past_key_values: transformers.cache_utils.Cache | None = None inputs_embeds: torch.FloatTensor | None = None use_cache: bool | None = None cache_position: torch.LongTensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] )
BltForCausalLM
class transformers.BltForCausalLM
< 源代码 >( config: BltConfig )
参数
- config (BltConfig) — 模型配置类,包含模型的所有参数。使用配置文件初始化仅加载配置,不加载与模型相关的权重。请查看 `from_pretrained()` 方法以加载模型权重。
BLT文本模型,顶部带有语言建模头。
此模型继承自 PreTrainedModel。查看其父类文档,了解库为所有模型实现的通用方法(例如下载或保存、调整输入嵌入大小、修剪头等)。
此模型也是一个 PyTorch torch.nn.Module 子类。像普通的 PyTorch Module 一样使用它,并参考 PyTorch 文档了解一般用法和行为的所有相关信息。
forward
< 源代码 >( input_ids: torch.LongTensor | None = None attention_mask: torch.Tensor | None = None position_ids: torch.LongTensor | None = None cross_attention_states: torch.LongTensor | None = None cross_attention_mask: torch.LongTensor | None = None full_text_row_masked_out_mask: tuple[torch.Tensor, torch.Tensor] | None = None past_key_values: transformers.cache_utils.Cache | None = None inputs_embeds: torch.FloatTensor | None = None labels: torch.LongTensor | None = None use_cache: bool | None = None cache_position: torch.LongTensor | None = None logits_to_keep: int | torch.Tensor = 0 **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → transformers.modeling_outputs.CausalLMOutputWithPast 或 tuple(torch.FloatTensor)
参数
- input_ids (
torch.LongTensor, shape(batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. 默认情况下会忽略填充。可以使用 AutoTokenizer 获取索引。有关详细信息,请参阅 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。
- attention_mask (
torch.Tensor, shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- position_ids (
torch.LongTensor, shape(batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]. - cross_attention_states (
torch.FloatTensor, optional) — 视觉模型的输出,用于交叉注意力。此张量包含语言模型将关注的已处理图像特征。 - cross_attention_mask (
torch.Tensor, shape(batch_size, seq_length, max_num_images, max_num_tiles), optional) — Cross-attention mask to control the interaction between text tokens and image tiles. This 4D tensor defines which image tiles each text token should attend to.For each text token (in seq_length):
- 1 indicates the token should attend to the corresponding image tile
- 0 indicates the token should not attend to the corresponding image tile
- full_text_row_masked_out_mask (
tuple[torch.Tensor, torch.Tensor], optional) — A tuple containing two tensors that mask out rows in the cross-attention mechanism:- The first tensor has shape
(batch_size, 1, seq_length, 1)and contains values of 0 or 1. A value of 0 indicates that the corresponding text token’s entire row in the cross-attention matrix should be masked out (all image tokens ignored). - The second tensor has the same shape and is used internally to apply the masking during the forward pass of cross-attention layers. This mask is derived from the cross_attention_mask and is used to handle cases where a text token should not attend to any image token.
- The first tensor has shape
- past_key_values (
~cache_utils.Cache, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.Only Cache instance is allowed as input, see our kv cache guide. If no
past_key_valuesare passed, DynamicCache will be initialized by default.The model will output the same cache format that is fed as input.
If
past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’t have their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length). - inputs_embeds (
torch.FloatTensor, shape(batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_idsindices into associated vectors than the model’s internal embedding lookup matrix. - labels (
torch.LongTensor, shape(batch_size, sequence_length), optional) — Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size]or -100 (seeinput_idsdocstring). Tokens with indices set to-100are ignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size]. - use_cache (
bool, optional) — If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values). - cache_position (
torch.LongTensor, shape(sequence_length), optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length. - logits_to_keep (
Union[int, torch.Tensor], optional, defaults to0) — If anint, compute logits for the lastlogits_to_keeptokens. If0, calculate logits for allinput_ids(special case). Only last token logits are needed for generation, and calculating them only for that token can save memory, which becomes pretty significant for long sequences or large vocabulary size. If atorch.Tensor, must be 1D corresponding to the indices to keep in the sequence length dimension. This is useful when using packed tensor format (single dimension for batch and sequence length).
返回
transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
一个 transformers.modeling_outputs.CausalLMOutputWithPast 或一个 torch.FloatTensor 的元组(如果传递了 return_dict=False 或当 config.return_dict=False 时),包含根据配置(BltConfig)和输入而变化的各种元素。
-
loss (
torch.FloatTensor形状为(1,),可选,当提供labels时返回) — 语言建模损失(用于下一个 token 预测)。 -
logits (形状为
(batch_size, sequence_length, config.vocab_size)的torch.FloatTensor) — 语言建模头部的预测分数(SoftMax 之前的每个词汇标记的分数)。 -
past_key_values (
Cache, optional, 当传递use_cache=True或当config.use_cache=True时返回) — 它是 Cache 实例。更多详情,请参阅我们的 kv cache 指南。包含预计算的隐藏状态(自注意力块中的键和值),可用于(参见
past_key_values输入)加速顺序解码。 -
hidden_states (
tuple(torch.FloatTensor), optional, 当传递output_hidden_states=True或当config.output_hidden_states=True时返回) —torch.FloatTensor的元组(一个用于嵌入层的输出,如果模型有嵌入层;+一个用于每个层的输出),形状为(batch_size, sequence_length, hidden_size)。模型在每个层输出的隐藏状态以及可选的初始嵌入输出。
-
attentions (
tuple(torch.FloatTensor), optional, 当传递output_attentions=True或当config.output_attentions=True时返回) —torch.FloatTensor的元组(每个层一个),形状为(batch_size, num_heads, sequence_length, sequence_length)。注意力 softmax 后的注意力权重,用于计算自注意力头中的加权平均值。
BltForCausalLM 的 `forward` 方法,覆盖了 `__call__` 特殊方法。
虽然 forward pass 的实现需要在此函数中定义,但你应该在之后调用
Module实例而不是这个,因为前者负责运行预处理和后处理步骤,而后者会静默地忽略它们。
示例
>>> from transformers import AutoTokenizer, BltForCausalLM
>>> model = BltForCausalLM.from_pretrained("itazap/blt-1b-hf")
>>> tokenizer = AutoTokenizer.from_pretrained("itazap/blt-1b-hf")
>>> prompt = "If I had to write a haiku, it would be:"
>>> inputs = tokenizer(prompt, return_tensors="pt")
>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=40, do_sample=True, temperature=0.6)
>>> result = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
>>> print(result)
If I had to write a haiku, it would be: "Snowflakes gently fall" - simple, yet peaceful.
I love the idea of snowflakes gently falling, each one