Transformers 文档
GraniteMoe
并获得增强的文档体验
开始使用
该模型于 2024-08-23 发布,并于 2024-09-20 添加到 Hugging Face Transformers。
GraniteMoe
概述
GraniteMoe 模型由 Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox 和 Rameswar Panda 在 Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler 中提出。
PowerMoE-3B 是一个 3B 参数的稀疏专家混合(sMoE)语言模型,使用 Power 学习率调度器进行训练。它为每个 token 稀疏激活 800M 参数。它在开源和专有数据集的混合上进行训练。在自然语言多项选择、代码生成和数学推理等多个基准测试中,PowerMoE-3B 与其他具有 2 倍激活参数的密集模型相比,表现出了有希望的结果。
论文摘要如下:
为语言模型预训练找到最佳学习率是一项挑战。这不仅是因为学习率、批量大小、训练 token 数量、模型大小和其他超参数之间存在复杂的关联,而且对于拥有数十亿甚至数万亿参数的大型语言模型来说,进行超参数搜索的成本是 prohibitive 的。最近的研究提出使用小型代理模型和小型语料库进行超参数搜索,并将最优参数转移到大型模型和大型语料库。虽然与模型大小相关的超参数(如深度和宽度)的零样本可迁移性在理论上和实践中都得到了证明,但从小型语料库到大型语料库的零样本迁移则较少被探索。在本文中,我们研究了最近提出的 WSD 调度器的最优学习率、批量大小和训练 token 数量之间的相关性。在数千次小型实验后,我们发现了变量之间的幂律关系,并证明了其跨模型大小的可迁移性。基于这一观察,我们提出了一种新的学习率调度器——Power scheduler,它对训练 token 数量和批量大小不敏感。实验表明,将 Power scheduler 与 Maximum Update Parameterization (mup) 结合使用,可以在不考虑训练 token 数量、批量大小、模型大小甚至模型架构的情况下,始终使用一套超参数获得令人印象深刻的性能。我们使用 Power scheduler 训练的 3B 密集型和 MoE 模型在性能上与最先进的小型语言模型相当。我们开源了这些预训练模型。
技巧
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "ibm/PowerMoE-3b"
tokenizer = AutoTokenizer.from_pretrained(model_path)
# drop device_map if running on CPU
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto")
model.eval()
# change input text as desired
prompt = "Write a code to find the maximum value in a list of numbers."
# tokenize the text
input_tokens = tokenizer(prompt, return_tensors="pt")
# generate output tokens
output = model.generate(**input_tokens, max_new_tokens=100)
# decode output tokens into text
output = tokenizer.batch_decode(output)
# loop over the batch to print, in this example the batch size is 1
for i in output:
print(i)该模型由 mayank-mishra 贡献。
GraniteMoeConfig
class transformers.GraniteMoeConfig
< 源 >( vocab_size: int | None = 32000 hidden_size: int | None = 4096 intermediate_size: int | None = 11008 num_hidden_layers: int | None = 32 num_attention_heads: int | None = 32 num_key_value_heads: int | None = None hidden_act: str | None = 'silu' max_position_embeddings: int | None = 2048 initializer_range: float | None = 0.02 rms_norm_eps: int | None = 1e-06 use_cache: bool | None = True pad_token_id: int | None = None bos_token_id: int | None = 1 eos_token_id: int | None = 2 tie_word_embeddings: bool | None = False rope_parameters: transformers.modeling_rope_utils.RopeParameters | dict[str, transformers.modeling_rope_utils.RopeParameters] | None = None attention_bias: bool | None = False attention_dropout: float | None = 0.0 embedding_multiplier: float | None = 1.0 logits_scaling: float | None = 1.0 residual_multiplier: float | None = 1.0 attention_multiplier: float | None = 1.0 num_local_experts: int | None = 8 num_experts_per_tok: int | None = 2 output_router_logits: bool | None = False router_aux_loss_coef: float | None = 0.001 **kwargs )
参数
- vocab_size (
int, optional, defaults to 32000) — GraniteMoe 模型的词汇表大小。定义了调用 GraniteMoeModel 时传入的inputs_ids可以表示的不同 token 的数量。 - hidden_size (
int, optional, defaults to 4096) — 隐藏表示的维度。 - intermediate_size (
int, optional, defaults to 11008) — MLP 表示的维度。 - num_hidden_layers (
int, optional, defaults to 32) — Transformer 解码器中的隐藏层数量。 - num_attention_heads (
int, optional, defaults to 32) — Transformer 解码器中每个注意力层的注意力头数量。 - num_key_value_heads (
int, optional) — 这是实现分组查询注意力(Grouped Query Attention)所需的 key_value 头数量。如果num_key_value_heads=num_attention_heads,模型将使用多头注意力(MHA);如果num_key_value_heads=1,模型将使用多查询注意力(MQA);否则使用 GQA。在将多头检查点转换为 GQA 检查点时,每个组的 key 和 value 头应通过对该组内的所有原始头进行平均池化来构建。更多详细信息,请参阅 这篇论文。如果未指定,则默认为num_attention_heads。 - hidden_act (
strorfunction, optional, defaults to"silu") — 解码器中的非线性激活函数(函数或字符串)。 - max_position_embeddings (
int, optional, defaults to 2048) — 此模型可能使用的最大序列长度。 - initializer_range (
float, optional, defaults to 0.02) — 用于初始化所有权重矩阵的 truncated_normal_initializer 的标准差。 - rms_norm_eps (
float, optional, defaults to 1e-06) — rms normalization 层使用的 epsilon。 - use_cache (
bool, optional, defaults toTrue) — 模型是否应返回最后一个 key/values 注意力(并非所有模型都使用)。仅在config.is_decoder=True时相关。 - pad_token_id (
int, optional) — 填充 token ID。 - bos_token_id (
int, optional, defaults to 1) — 开始流 token ID。 - eos_token_id (
int, optional, defaults to 2) — 结束流 token ID。 - tie_word_embeddings (
bool, optional, defaults toFalse) — 是否绑定词嵌入。 - rope_parameters (
RopeParameters, optional) — 包含 RoPE 嵌入配置参数的字典。字典应包含rope_theta的值,以及如果您想使用更长的max_position_embeddings进行 RoPE,则可选地包含用于缩放的参数。 - attention_bias (
bool, optional, defaults toFalse) — 在自注意力期间是否在查询、键、值和输出投影层中使用偏置。 - attention_dropout (
float, optional, defaults to 0.0) — 注意力概率的 dropout 比率。 - embedding_multiplier (
float, optional, defaults to 1.0) — embedding multiplier - logits_scaling (
float, optional, defaults to 1.0) — divisor for output logits - residual_multiplier (
float, optional, defaults to 1.0) — residual multiplier - attention_multiplier (
float, optional, defaults to 1.0) — attention multiplier - num_local_experts (
int, optional, defaults to 8) — total number of experts - num_experts_per_tok (
int, optional, defaults to 2) — number of experts per token - output_router_logits (
bool, optional, defaults toFalse) — Whether or not the router logits should be returned by the model. Enabling this will also allow the model to output the auxiliary loss. - router_aux_loss_coef (
float, optional, defaults to 0.001) — router auxiliary loss coefficient
这是用于存储 GraniteMoeModel 配置的类。它用于根据指定的参数实例化一个 GraniteMoe 模型,从而定义模型的架构。使用默认值实例化配置将产生与 GraniteMoe-3B 类似的配置。
配置对象继承自 PreTrainedConfig,可用于控制模型输出。有关更多信息,请阅读 PreTrainedConfig 的文档。
>>> from transformers import GraniteMoeModel, GraniteMoeConfig
>>> # Initializing a GraniteMoe granitemoe-3b style configuration
>>> configuration = GraniteMoeConfig()
>>> # Initializing a model from the granitemoe-7b style configuration
>>> model = GraniteMoeModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.configGraniteMoeModel
class transformers.GraniteMoeModel
< source >( config: GraniteMoeConfig )
参数
- config (GraniteMoeConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
一个裸露的 Granitemoe 模型,输出顶部没有任何特定头的原始隐藏状态。
此模型继承自 PreTrainedModel。查看其父类文档,了解库为所有模型实现的通用方法(例如下载或保存、调整输入嵌入大小、修剪头等)。
此模型也是一个 PyTorch torch.nn.Module 子类。像普通的 PyTorch Module 一样使用它,并参考 PyTorch 文档了解一般用法和行为的所有相关信息。
forward
< source >( input_ids: torch.LongTensor | None = None attention_mask: torch.Tensor | None = None position_ids: torch.LongTensor | None = None past_key_values: transformers.cache_utils.Cache | None = None inputs_embeds: torch.FloatTensor | None = None use_cache: bool | None = None cache_position: torch.LongTensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → transformers.modeling_outputs.MoeModelOutputWithPast or tuple(torch.FloatTensor)
参数
- input_ids (
torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- attention_mask (
torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- position_ids (
torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]. - past_key_values (
~cache_utils.Cache, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.Only Cache instance is allowed as input, see our kv cache guide. If no
past_key_valuesare passed, DynamicCache will be initialized by default.The model will output the same cache format that is fed as input.
If
past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’t have their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length). - inputs_embeds (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_idsindices into associated vectors than the model’s internal embedding lookup matrix. - use_cache (
bool, optional) — If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values). - cache_position (
torch.LongTensorof shape(sequence_length), optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length.
返回
transformers.modeling_outputs.MoeModelOutputWithPast 或 tuple(torch.FloatTensor)
一个 transformers.modeling_outputs.MoeModelOutputWithPast 或一个元组 torch.FloatTensor (如果传入 return_dict=False 或当 config.return_dict=False 时),其中包含根据配置 (GraniteMoeConfig) 和输入而变化的各种元素。
-
last_hidden_state (
torch.FloatTensor, 形状为(batch_size, sequence_length, hidden_size)) — 模型最后一层输出的隐藏状态序列。 -
past_key_values (
Cache, optional, 当传递use_cache=True或当config.use_cache=True时返回) — 它是 Cache 实例。更多详情,请参阅我们的 kv cache 指南。Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if
config.is_encoder_decoder=Truein the cross-attention blocks) that can be used (seepast_key_valuesinput) to speed up sequential decoding. -
hidden_states (
tuple(torch.FloatTensor), optional, 当传递output_hidden_states=True或当config.output_hidden_states=True时返回) —torch.FloatTensor的元组(一个用于嵌入层的输出,如果模型有嵌入层;+一个用于每个层的输出),形状为(batch_size, sequence_length, hidden_size)。模型在每个层输出的隐藏状态以及可选的初始嵌入输出。
-
attentions (
tuple(torch.FloatTensor), optional, 当传递output_attentions=True或当config.output_attentions=True时返回) —torch.FloatTensor的元组(每个层一个),形状为(batch_size, num_heads, sequence_length, sequence_length)。注意力 softmax 后的注意力权重,用于计算自注意力头中的加权平均值。
-
router_logits (
tuple(torch.FloatTensor), 可选, 当传递output_router_probs=True且config.add_router_probs=True时,或config.output_router_probs=True时返回) — 形状为(batch_size, sequence_length, num_experts)的torch.FloatTensor元组(每一层一个)。由 MoE 路由器计算的原始路由器对数(softmax 后),这些术语用于计算专家混合模型的辅助损失。
transformers.GraniteMoeModel 的 forward 方法,重写了 __call__ 特殊方法。
虽然 forward pass 的实现需要在此函数中定义,但你应该在之后调用
Module实例而不是这个,因为前者负责运行预处理和后处理步骤,而后者会静默地忽略它们。
GraniteMoeForCausalLM
class transformers.GraniteMoeForCausalLM
< source >( config: GraniteMoeConfig )
参数
- config (GraniteMoeConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
用于因果语言建模的 GraniteMoe 模型。
此模型继承自 PreTrainedModel。查看其父类文档,了解库为所有模型实现的通用方法(例如下载或保存、调整输入嵌入大小、修剪头等)。
此模型也是一个 PyTorch torch.nn.Module 子类。像普通的 PyTorch Module 一样使用它,并参考 PyTorch 文档了解一般用法和行为的所有相关信息。
forward
< source >( input_ids: torch.LongTensor | None = None attention_mask: torch.Tensor | None = None position_ids: torch.LongTensor | None = None past_key_values: transformers.cache_utils.Cache | None = None inputs_embeds: torch.FloatTensor | None = None labels: torch.LongTensor | None = None output_router_logits: bool | None = None cache_position: torch.LongTensor | None = None logits_to_keep: int | torch.Tensor = 0 **kwargs ) → transformers.modeling_outputs.MoeCausalLMOutputWithPast or tuple(torch.FloatTensor)
参数
- input_ids (
torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- attention_mask (
torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- position_ids (
torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]. - past_key_values (
~cache_utils.Cache, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.Only Cache instance is allowed as input, see our kv cache guide. If no
past_key_valuesare passed, DynamicCache will be initialized by default.The model will output the same cache format that is fed as input.
If
past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’t have their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length). - inputs_embeds (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_idsindices into associated vectors than the model’s internal embedding lookup matrix. - labels (
torch.LongTensorof shape(batch_size, sequence_length), optional) — Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size]or -100 (seeinput_idsdocstring). Tokens with indices set to-100are ignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size]. - output_router_logits (
bool, optional) — Whether or not to return the logits of all the routers. They are useful for computing the router loss, and should not be returned during inference. - cache_position (
torch.LongTensorof shape(sequence_length), optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length. - logits_to_keep (
Union[int, torch.Tensor], optional, defaults to0) — If anint, compute logits for the lastlogits_to_keeptokens. If0, calculate logits for allinput_ids(special case). Only last token logits are needed for generation, and calculating them only for that token can save memory, which becomes pretty significant for long sequences or large vocabulary size. If atorch.Tensor, must be 1D corresponding to the indices to keep in the sequence length dimension. This is useful when using packed tensor format (single dimension for batch and sequence length).
返回
transformers.modeling_outputs.MoeCausalLMOutputWithPast 或 tuple(torch.FloatTensor)
根据配置(None)和输入,transformers.modeling_outputs.MoeCausalLMOutputWithPast 或 torch.FloatTensor 的元组(如果传递了 return_dict=False 或当 config.return_dict=False 时)。
-
loss (
torch.FloatTensor形状为(1,),可选,当提供labels时返回) — 语言建模损失(用于下一个 token 预测)。 -
logits (形状为
(batch_size, sequence_length, config.vocab_size)的torch.FloatTensor) — 语言建模头部的预测分数(SoftMax 之前的每个词汇标记的分数)。 -
aux_loss (
torch.FloatTensor,可选,当提供labels时返回) — 稀疏模块的辅助损失。 -
router_logits (
tuple(torch.FloatTensor), 可选, 当传递output_router_probs=True且config.add_router_probs=True时,或config.output_router_probs=True时返回) — 形状为(batch_size, sequence_length, num_experts)的torch.FloatTensor元组(每一层一个)。由 MoE 路由器计算的原始路由器对数(softmax 后),这些术语用于计算专家混合模型的辅助损失。
-
past_key_values (
Cache, optional, 当传递use_cache=True或当config.use_cache=True时返回) — 它是 Cache 实例。更多详情,请参阅我们的 kv cache 指南。包含预计算的隐藏状态(自注意力块中的键和值),可用于(参见
past_key_values输入)加速顺序解码。 -
hidden_states (
tuple(torch.FloatTensor), optional, 当传递output_hidden_states=True或当config.output_hidden_states=True时返回) —torch.FloatTensor的元组(一个用于嵌入层的输出,如果模型有嵌入层;+一个用于每个层的输出),形状为(batch_size, sequence_length, hidden_size)。模型在每个层输出的隐藏状态以及可选的初始嵌入输出。
-
attentions (
tuple(torch.FloatTensor), optional, 当传递output_attentions=True或当config.output_attentions=True时返回) —torch.FloatTensor的元组(每个层一个),形状为(batch_size, num_heads, sequence_length, sequence_length)。注意力 softmax 后的注意力权重,用于计算自注意力头中的加权平均值。
The GraniteMoeForCausalLM forward method, overrides the __call__ special method.
虽然 forward pass 的实现需要在此函数中定义,但你应该在之后调用
Module实例而不是这个,因为前者负责运行预处理和后处理步骤,而后者会静默地忽略它们。
示例
>>> from transformers import AutoTokenizer, GraniteMoeForCausalLM
>>> model = GraniteMoeForCausalLM.from_pretrained("ibm/PowerMoE-3b")
>>> tokenizer = AutoTokenizer.from_pretrained("ibm/PowerMoE-3b")
>>> prompt = "Hey, are you conscious? Can you talk to me?"
>>> inputs = tokenizer(prompt, return_tensors="pt")
>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."