Transformers 文档
OLMo
并获得增强的文档体验
开始使用
此模型于 2024-02-01 发布,并于 2024-04-17 添加到 Hugging Face Transformers。
OLMo
OLMo 是一个 70 亿参数的密集语言模型。它使用 SwiGLU 激活、非参数层归一化、旋转位置嵌入和 BPE 分词器,后者会屏蔽个人身份信息。它在 Dolma(一个 3T token 的数据集)上进行了预训练。OLMo 的发布旨在提供对模型权重、训练数据、训练代码和评估代码的完全透明,以促进对语言模型的更多研究。
您可以在 OLMo 集合下找到所有原始 OLMo 检查点。
此模型由 shanearora 贡献。
点击右侧边栏的 OLMo 模型,了解如何将 OLMo 应用于不同语言任务的更多示例。
以下示例演示了如何使用 Pipeline 或 AutoModel 类生成文本。
import torch
from transformers import pipeline
pipe = pipeline(
task="text-generation",
model="allenai/OLMo-7B-hf",
dtype=torch.float16,
device=0,
)
result = pipe("Plants create energy through a process known as")
print(result)量化通过以较低精度表示权重来减少大型模型的内存负担。有关更多可用量化后端,请参阅量化概述。
以下示例使用 bitsandbytes 仅将权重量化为 4 位。
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4"
)
model = AutoModelForCausalLM.from_pretrained(
"allenai/OLMo-7B-hf",
attn_implementation="sdpa",
dtype=torch.float16,
device_map="auto",
quantization_config=quantization_config
)
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-7B-hf")
inputs = tokenizer("Bitcoin is", return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
output = model.generate(**inputs, max_length=64)
print(tokenizer.decode(output[0]))OlmoConfig
class transformers.OlmoConfig
< source >( vocab_size: int | None = 50304 hidden_size: int | None = 4096 intermediate_size: int | None = 11008 num_hidden_layers: int | None = 32 num_attention_heads: int | None = 32 num_key_value_heads: int | None = None hidden_act: str | None = 'silu' max_position_embeddings: int | None = 2048 initializer_range: float | None = 0.02 use_cache: bool | None = True pad_token_id: int | None = 1 bos_token_id: int | None = None eos_token_id: int | None = 50279 tie_word_embeddings: int | None = False rope_parameters: transformers.modeling_rope_utils.RopeParameters | dict[str, transformers.modeling_rope_utils.RopeParameters] | None = None attention_bias: bool | None = False attention_dropout: float | None = 0.0 clip_qkv: bool | None = None **kwargs )
参数
- vocab_size (
int, optional, defaults to 50304) — OLMo 模型词汇表大小。定义了在调用 OlmoModel 时传入inputs_ids所能表示的不同 token 的数量。 - hidden_size (
int, optional, defaults to 4096) — 隐藏表示的维度。 - intermediate_size (
int, optional, defaults to 11008) — MLP 表示的维度。 - num_hidden_layers (
int, optional, defaults to 32) — Transformer 解码器中的隐藏层数量。 - num_attention_heads (
int, optional, defaults to 32) — Transformer 解码器中每个注意力层的注意力头数量。 - num_key_value_heads (
int, optional) — 这是实现分组查询注意力所需的 key_value 头数。如果num_key_value_heads=num_attention_heads,模型将使用多头注意力 (MHA),如果num_key_value_heads=1,模型将使用多查询注意力 (MQA),否则使用 GQA。在将多头检查点转换为 GQA 检查点时,每个分组的 key 和 value 头应通过对该分组内的所有原始头进行平均池化来构造。有关更多详细信息,请参阅 本文档。如果未指定,将默认为num_attention_heads。 - hidden_act (
strorfunction, optional, defaults to"silu") — 解码器中的非线性激活函数(函数或字符串)。 - max_position_embeddings (
int, optional, defaults to 2048) — 此模型可能使用的最大序列长度。 - initializer_range (
float, optional, defaults to 0.02) — 用于初始化所有权重矩阵的截断正态初始化器的标准差。 - use_cache (
bool, optional, defaults toTrue) — 模型是否应返回最后一个 key/values 注意力(并非所有模型都使用)。仅当config.is_decoder=True时相关。 - pad_token_id (
int, optional, defaults to 1) — 填充 token ID。 - bos_token_id (
int, optional) — 流开始 token ID。 - eos_token_id (
int, optional, defaults to 50279) — 流结束 token ID。 - tie_word_embeddings (
bool, optional, defaults toFalse) — 是否绑定词嵌入。 - rope_parameters (
RopeParameters, optional) — 包含 RoPE 嵌入配置参数的字典。字典应包含rope_theta的值,如果需要使用更长的max_position_embeddings进行 RoPE 缩放,还可以包含用于缩放的参数。 - attention_bias (
bool, defaults toFalse, optional, defaults toFalse) — 在自注意力过程中是否在查询、键、值和输出投影层使用偏置。 - attention_dropout (
float, optional, defaults to 0.0) — 注意力概率的 dropout 比率。 - clip_qkv (
float, optional) — 如果不为None,则查询、键和值注意力的元素将被裁剪,使其绝对值不超过此值。
这是用于存储 OlmoModel 配置的配置类。它用于根据指定的参数实例化 OLMo 模型,定义模型的架构。使用默认值实例化配置将产生与 allenai/OLMo-7B-hf 类似的配置。
配置对象继承自 PreTrainedConfig,可用于控制模型输出。有关更多信息,请阅读 PreTrainedConfig 的文档。
>>> from transformers import OlmoModel, OlmoConfig
>>> # Initializing a OLMo 7B style configuration
>>> configuration = OlmoConfig()
>>> # Initializing a model from the OLMo 7B style configuration
>>> model = OlmoModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.configOlmoModel
class transformers.OlmoModel
< source >( config: OlmoConfig )
参数
- config (OlmoConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare Olmo Model outputting raw hidden-states without any specific head on top.
此模型继承自 PreTrainedModel。查看其父类文档,了解库为所有模型实现的通用方法(例如下载或保存、调整输入嵌入大小、修剪头等)。
此模型也是一个 PyTorch torch.nn.Module 子类。像普通的 PyTorch Module 一样使用它,并参考 PyTorch 文档了解一般用法和行为的所有相关信息。
forward
< source >( input_ids: torch.LongTensor | None = None attention_mask: torch.Tensor | None = None position_ids: torch.LongTensor | None = None past_key_values: transformers.cache_utils.Cache | None = None inputs_embeds: torch.FloatTensor | None = None cache_position: torch.LongTensor | None = None use_cache: bool | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → transformers.modeling_outputs.BaseModelOutputWithPast or tuple(torch.FloatTensor)
参数
- input_ids (
torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- attention_mask (
torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- position_ids (
torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]. - past_key_values (
~cache_utils.Cache, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.Only Cache instance is allowed as input, see our kv cache guide. If no
past_key_valuesare passed, DynamicCache will be initialized by default.The model will output the same cache format that is fed as input.
If
past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’t have their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length). - inputs_embeds (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_idsindices into associated vectors than the model’s internal embedding lookup matrix. - cache_position (
torch.LongTensorof shape(sequence_length), optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length. - use_cache (
bool, optional) — If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values).
返回
transformers.modeling_outputs.BaseModelOutputWithPast or tuple(torch.FloatTensor)
A transformers.modeling_outputs.BaseModelOutputWithPast or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (OlmoConfig) and inputs.
-
last_hidden_state (
torch.FloatTensor, 形状为(batch_size, sequence_length, hidden_size)) — 模型最后一层输出的隐藏状态序列。如果使用了
past_key_values,则只输出形状为(batch_size, 1, hidden_size)的序列的最后一个隐藏状态。 -
past_key_values (
Cache, optional, 当传递use_cache=True或当config.use_cache=True时返回) — 它是 Cache 实例。更多详情,请参阅我们的 kv cache 指南。Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if
config.is_encoder_decoder=Truein the cross-attention blocks) that can be used (seepast_key_valuesinput) to speed up sequential decoding. -
hidden_states (
tuple(torch.FloatTensor), optional, 当传递output_hidden_states=True或当config.output_hidden_states=True时返回) —torch.FloatTensor的元组(一个用于嵌入层的输出,如果模型有嵌入层;+一个用于每个层的输出),形状为(batch_size, sequence_length, hidden_size)。模型在每个层输出的隐藏状态以及可选的初始嵌入输出。
-
attentions (
tuple(torch.FloatTensor), optional, 当传递output_attentions=True或当config.output_attentions=True时返回) —torch.FloatTensor的元组(每个层一个),形状为(batch_size, num_heads, sequence_length, sequence_length)。注意力 softmax 后的注意力权重,用于计算自注意力头中的加权平均值。
The OlmoModel forward method, overrides the __call__ special method.
虽然 forward pass 的实现需要在此函数中定义,但你应该在之后调用
Module实例而不是这个,因为前者负责运行预处理和后处理步骤,而后者会静默地忽略它们。
OlmoForCausalLM
class transformers.OlmoForCausalLM
< source >( config model_args: ~utils.generic.ModelArgs | None = None adapter_args: ~utils.generic.AdapterArgs | None = None lora_args: ~utils.generic.LoRAArgs | None = None tokenizer_args: ~utils.generic.TokenizerArgs | None = None dataset_args: ~utils.generic.DatasetArgs | None = None data_args: ~utils.generic.DataArgs | None = None training_args: ~utils.generic.TrainingArgs | None = None generation_args: ~utils.generic.GenerationArgs | None = None vision_tower_args: ~utils.generic.VisionTowerArgs | None = None qlora_args: ~utils.generic.QLoRAArgs | None = None vision_tower_template_args: ~utils.generic.VisionTowerTemplateArgs | None = None video_tower_args: ~utils.generic.VideoTowerArgs | None = None vision_config: ~utils.generic.VisionConfig | None = None video_config: ~utils.generic.VideoConfig | None = None load_dataset: bool | None = None load_data_collator: bool | None = None load_processor: bool | None = None load_lora_adapter: bool | None = None load_adapter: bool | None = None load_qlora_adapter: bool | None = None **kwargs: typing_extensions.Unpack[transformers.modeling_utils.PreTrainedModelKwargs] )
参数
- config (OlmoForCausalLM) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The Olmo Model for causal language modeling.
此模型继承自 PreTrainedModel。查看其父类文档,了解库为所有模型实现的通用方法(例如下载或保存、调整输入嵌入大小、修剪头等)。
此模型也是一个 PyTorch torch.nn.Module 子类。像普通的 PyTorch Module 一样使用它,并参考 PyTorch 文档了解一般用法和行为的所有相关信息。
forward
< source >( input_ids: torch.LongTensor | None = None attention_mask: torch.Tensor | None = None position_ids: torch.LongTensor | None = None past_key_values: transformers.cache_utils.Cache | None = None inputs_embeds: torch.FloatTensor | None = None labels: torch.LongTensor | None = None use_cache: bool | None = None cache_position: torch.LongTensor | None = None logits_to_keep: int | torch.Tensor = 0 **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
参数
- input_ids (
torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- attention_mask (
torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- position_ids (
torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]. - past_key_values (
~cache_utils.Cache, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.Only Cache instance is allowed as input, see our kv cache guide. If no
past_key_valuesare passed, DynamicCache will be initialized by default.The model will output the same cache format that is fed as input.
If
past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’t have their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length). - inputs_embeds (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_idsindices into associated vectors than the model’s internal embedding lookup matrix. - labels (
torch.LongTensorof shape(batch_size, sequence_length), optional) — Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size]or -100 (seeinput_idsdocstring). Tokens with indices set to-100are ignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size]. - use_cache (
bool, optional) — 如果设置为True,则返回past_key_values键值状态,并可用于加速解码(请参阅past_key_values)。 - cache_position (
torch.LongTensorof shape(sequence_length), optional) — 描述输入序列 token 在序列中的位置的索引。与position_ids不同,此张量不受填充的影响。它用于在正确的位置更新缓存并推断完整的序列长度。 - logits_to_keep (
Union[int, torch.Tensor], optional, defaults to0) — 如果是int,则计算最后logits_to_keep个 token 的 logits。如果为0,则计算所有input_ids的 logits(特殊情况)。仅最后 token 的 logits 对于生成是必需的,并且仅为该 token 计算它们可以节省内存,这对于长序列或大词汇量来说非常显著。如果为torch.Tensor,则必须是 1D 的,对应于要保留在序列长度维度上的索引。这在使用打包张量格式(批处理和序列长度的单维度)时很有用。
返回
transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
A transformers.modeling_outputs.CausalLMOutputWithPast 或一个 torch.FloatTensor 的元组(如果传递了 return_dict=False 或当 config.return_dict=False 时),包含各种元素,具体取决于配置(OlmoConfig)和输入。
-
loss (
torch.FloatTensor形状为(1,),可选,当提供labels时返回) — 语言建模损失(用于下一个 token 预测)。 -
logits (形状为
(batch_size, sequence_length, config.vocab_size)的torch.FloatTensor) — 语言建模头部的预测分数(SoftMax 之前的每个词汇标记的分数)。 -
past_key_values (
Cache, optional, 当传递use_cache=True或当config.use_cache=True时返回) — 它是 Cache 实例。更多详情,请参阅我们的 kv cache 指南。包含预计算的隐藏状态(自注意力块中的键和值),可用于(参见
past_key_values输入)加速顺序解码。 -
hidden_states (
tuple(torch.FloatTensor), optional, 当传递output_hidden_states=True或当config.output_hidden_states=True时返回) —torch.FloatTensor的元组(一个用于嵌入层的输出,如果模型有嵌入层;+一个用于每个层的输出),形状为(batch_size, sequence_length, hidden_size)。模型在每个层输出的隐藏状态以及可选的初始嵌入输出。
-
attentions (
tuple(torch.FloatTensor), optional, 当传递output_attentions=True或当config.output_attentions=True时返回) —torch.FloatTensor的元组(每个层一个),形状为(batch_size, num_heads, sequence_length, sequence_length)。注意力 softmax 后的注意力权重,用于计算自注意力头中的加权平均值。
OlmoForCausalLM 的前向方法,覆盖了 __call__ 特殊方法。
虽然 forward pass 的实现需要在此函数中定义,但你应该在之后调用
Module实例而不是这个,因为前者负责运行预处理和后处理步骤,而后者会静默地忽略它们。
示例
>>> from transformers import AutoTokenizer, OlmoForCausalLM
>>> model = OlmoForCausalLM.from_pretrained("meta-olmo/Olmo-2-7b-hf")
>>> tokenizer = AutoTokenizer.from_pretrained("meta-olmo/Olmo-2-7b-hf")
>>> prompt = "Hey, are you conscious? Can you talk to me?"
>>> inputs = tokenizer(prompt, return_tensors="pt")
>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."