Transformers 文档

DeepSeek-V3

Transformers

加入 Hugging Face 社区

并获得增强的文档体验

在模型、数据集和 Spaces 上进行协作

通过加速推理获得更快的示例

切换文档主题

开始使用

DeepSeek-V3

概述

DeepSeek-V3 模型在 DeepSeek-AI 团队的《DeepSeek-V3 技术报告》中提出。

论文摘要如下：我们推出了 DeepSeek-V3，一个强大的专家混合（MoE）语言模型，总参数量为 671B，每个 token 激活 37B 参数。为实现高效推理和经济的训练，DeepSeek-V3 采用了在 DeepSeek-V2 中经过充分验证的多头潜在注意力（MLA）和 DeepSeekMoE 架构。此外，DeepSeek-V3 首创了无辅助损失的负载均衡策略，并设定了多 token 预测训练目标以提升性能。我们在 14.8 万亿个多样化、高质量的 token 上对 DeepSeek-V3 进行了预训练，随后通过监督微调和强化学习阶段充分发挥其能力。全面评估显示，DeepSeek-V3 优于其他开源模型，并达到了与领先闭源模型相当的性能。尽管性能卓越，DeepSeek-V3 完成全部训练仅需 278.8 万 H800 GPU 小时。此外，其训练过程非常稳定。在整个训练过程中，我们没有遇到任何不可恢复的损失飙升，也没有进行任何回滚。模型检查点可在 https://github.com/deepseek-ai/DeepSeek-V3 获取。

局限性与贡献呼吁！

我们非常高兴能让这段代码由社区驱动，并期待看到您如何能最好地优化以下内容

当前实现使用了“朴素”的注意力计算（因此并非真正的 MLA）
当前实现通过循环遍历专家。这应该被替换掉。建议使用 `integrations/tensor_parallel` 中的 `get_packed_weights`。
当前实现使用了 eleuther 的 ROPE 公式，使用原始公式会更高效！（但仍应遵循我们的 API）
静态缓存不受支持（这应该只是一个生成配置问题/配置形状问题）

使用技巧

该模型使用多头潜在注意力（MLA）和 DeepSeekMoE 架构，以实现高效推理和经济的训练。它采用无辅助损失策略进行负载均衡，并使用多 token 预测训练目标。该模型在 14.8 万亿个 token 上进行了预训练，并经过监督微调和强化学习阶段，可用于各种语言任务。

您可以使用 `FP8` 模式自动运行模型，使用 2 个包含 8 个 H100 的节点应该绰绰有余！

# `run_deepseek_v1.py`
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
torch.manual_seed(30)

tokenizer = AutoTokenizer.from_pretrained("deepseek-r1")

chat = [
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
  {"role": "user", "content": "I'd like to show off how chat templating works!"},
]


model = AutoModelForCausalLM.from_pretrained("deepseek-r1", device_map="auto", torch_dtype=torch.bfloat16)
inputs = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
import time
start = time.time()
outputs = model.generate(inputs, max_new_tokens=50)
print(tokenizer.batch_decode(outputs))
print(time.time()-start)

这生成了

<｜Assistant｜><think>
Okay, the user wants to demonstrate how chat templating works. Let me break down what that means. Chat templating is about structuring the conversation data, especially for models that need specific input formats. Maybe they're referring to something like how messages are formatted with roles (user, assistant, system) in APIs like OpenAI.

First, I should explain what chat templating is. It's the process of formatting conversation data into a structured format that the model can understand. This usually includes roles and content. For example, user messages, assistant responses, and system messages each have their own role tags.

They might want an example. Let me think of a simple conversation. The user says "Hello, how are you?" and the assistant responds "I'm doing great. How can I help you today?" Then the user follows up with wanting to show off chat templating. So the example should include the history and the new message.

In some frameworks, like Hugging Face's Transformers, chat templates are applied using Jinja2 templates. The template might look something like combining system messages, then looping through user and assistant messages with appropriate tags. For instance, using {% for message in messages %} and assigning roles like <|user|>, <|assistant|>, etc.

I should structure the example with the messages array, showing each role and content. Then apply a hypothetical template to convert that into a formatted string the model uses. Also, mention that different models have different templating requirements, like using special tokens or varying role labels.

Wait, the user mentioned "chat templating" in the context of showing off. Maybe they want a practical example they can present. So providing a code snippet or a structured data example would be helpful. Let me outline a typical messages array and then the templated output.

Also, it's important to note that proper templating ensures the model knows the conversation flow, which is crucial for generating coherent responses. Maybe include a note about why it's important, like maintaining context and role-specific processing.

Let me check if there are any common mistakes or things to avoid. For example, not closing tags properly, or mismatching roles. But maybe that's too detailed unless the user asks. Focus on the positive example first.

Putting it all together, the response should have an example messages array, the applied template, and the final formatted string. Maybe use angle brackets or special tokens as placeholders. Also, mention that this helps in training or fine-tuning models with structured data.

I think that's a solid approach. Let me structure it step by step to make it clear.
</think>

Chat templating is a way to structure conversation data (e.g., user/assistant interactions) into a format that language models understand. This is especially important for models trained to handle multi-turn dialogues, where the input must explicitly separate roles (user, assistant, system, etc.) and messages. Let’s break this down with an example!

---

### **Step 1: Raw Conversation History**
Suppose we have this conversation:
- **User**: "Hello, how are you?"
- **Assistant**: "I'm doing great. How can I help you today?"
- **User**: "I'd like to show off how chat templating works!"

---

### **Step 2: Structured Messages**
In frameworks like Hugging Face Transformers or OpenAI, conversations are often formatted as a list of dictionaries with `role` and `content`:
```python
messages = [
    {"role": "user", "content": "Hello, how are you?"},
    {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
    {"role": "user", "content": "I'd like to show off how chat templating works!"},
]
```

---

### **Step 3: Apply a Chat Template**
A **chat template** converts this structured data into a single string formatted for the model. For example, using a Jinja-style template (common in Hugging Face):

```jinja
{% for message in messages %}
    {% if message['role'] == 'user' %}
        <|user|>{{ message['content'] }}<|end|>
    {% elif message['role'] == 'assistant' %}
        <|assistant|>{{ message['content'] }}<|end|>
    {% endif %}
{% endfor %}
<|assistant|>
```

---

### **Step 4: Final Templated Output**
Applying the template to our `messages` list would produce:
```text
<|user|>Hello, how are you?<|end|>
<|assistant|>I'm doing great. How can I help you today?<|end|>
<|user|>I'd like to show off how chat templating works!<|end|>
<|assistant|>
```

This tells the model:  
1. The conversation history (user/assistant turns).  
2. The model’s turn to generate a response (`<|assistant|>` at the end).  

---

### **Key Notes**:
- **Role Separation**: Tags like `<|user|>` and `<|assistant|>` help the model distinguish speakers.
- **Special Tokens**: Models often use unique tokens (e.g., `<|end|>`) to mark message boundaries.
- **Flexibility**: Templates vary by model (e.g., OpenAI uses `{"role": "user", "content": "..."}` instead of tags).

---

### **Why This Matters**:
- **Consistency**: Ensures the model understands dialogue structure.
- **Context Preservation**: Maintains the flow of multi-turn conversations.
- **Alignment**: Matches the format the model was trained on for better performance.

Want to dive deeper or see a specific framework’s implementation (e.g., OpenAI, Llama, Mistral)? Let me know! 😊<｜end▁of▁sentence｜>

使用以下命令运行它

torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0|1 --rdzv-id an_id --rdzv-backend c10d --rdzv-endpoint master_addr:master_port run_deepseek_r1.py

如果您遇到

[rank0]: ncclInternalError: Internal check failed.
[rank0]: Last error:
[rank0]: Bootstrap : no socket interface found

错误，意味着 NCCL 可能没有加载。

DeepseekV3Config

class transformers.DeepseekV3Config

< 源码 >

( vocab_size = 129280 hidden_size = 7168 intermediate_size = 18432 moe_intermediate_size = 2048 num_hidden_layers = 61 num_attention_heads = 128 num_key_value_heads = 128 n_shared_experts = 1 n_routed_experts = 256 routed_scaling_factor = 2.5 kv_lora_rank = 512 q_lora_rank = 1536 qk_rope_head_dim = 64 v_head_dim = 128 qk_nope_head_dim = 128 n_group = 8 topk_group = 4 num_experts_per_tok = 8 first_k_dense_replace = 3 norm_topk_prob = True hidden_act = 'silu' max_position_embeddings = 4096 initializer_range = 0.02 rms_norm_eps = 1e-06 use_cache = True pad_token_id = None bos_token_id = 0 eos_token_id = 1 pretraining_tp = 1 tie_word_embeddings = False rope_theta = 10000.0 rope_scaling = None rope_interleave = True attention_bias = False attention_dropout = 0.0 **kwargs )

参数

vocab_size (int, 可选, 默认为 129280) — Deep 模型的词汇表大小。定义了在调用 DeepseekV3Model 时传递的 inputs_ids 可以表示的不同 token 的数量
hidden_size (int, 可选, 默认为 7168) — 隐藏表示的维度。
intermediate_size (int, 可选, 默认为 18432) — MLP 表示的维度。
moe_intermediate_size (int, 可选, 默认为 2048) — MoE 表示的维度。
num_hidden_layers (int, 可选, 默认为 61) — Transformer 解码器中的隐藏层数量。
num_attention_heads (int, 可选, 默认为 128) — Transformer 解码器中每个注意力层的注意力头数量。
num_key_value_heads (int, 可选, 默认为 128) — 这是用于实现分组查询注意力 (Grouped Query Attention) 的键值头数量。如果 num_key_value_heads=num_attention_heads，模型将使用多头注意力 (MHA)；如果 num_key_value_heads=1，模型将使用多查询注意力 (MQA)；否则使用 GQA。将多头检查点转换为 GQA 检查点时，每个组的键和值头应通过对该组内所有原始头进行平均池化来构建。更多详情，请参阅[这篇论文](https://huggingface.co/papers/2305.13245)。如果未指定，将默认为 `num_attention_heads`。
n_shared_experts (int, 可选, 默认为 1) — 共享专家的数量。
n_routed_experts (int, 可选, 默认为 256) — 路由专家的数量。
routed_scaling_factor (float, 可选, 默认为 2.5) — 路由专家的缩放因子。
kv_lora_rank (int, 可选, 默认为 512) — 键和值投影的 LoRA 矩阵的秩。
q_lora_rank (int, 可选, 默认为 1536) — 查询投影的 LoRA 矩阵的秩。
qk_rope_head_dim (int, 可选, 默认为 64) — 使用旋转位置嵌入的查询/键头的维度。
v_head_dim (int, 可选, 默认为 128) — 值头的维度。
qk_nope_head_dim (int, 可选, 默认为 128) — 不使用旋转位置嵌入的查询/键头的维度。
n_group (int, 可选, 默认为 8) — 路由专家的组数。
topk_group (int, 可选, 默认为 4) — 每个 token 选择的组数（确保每个 token 选择的专家仅在 `topk_group` 组内）。
num_experts_per_tok (int, 可选, 默认为 8) — 选择的专家数量，None 表示密集模型。
first_k_dense_replace (int, 可选, 默认为 3) — 浅层中的密集层数量 (embed->dense->dense->…->dense->moe->moe…->lm_head)。 --k 个密集层—/
norm_topk_prob (bool, 可选, 默认为 True) — 是否对路由专家的权重进行归一化。
hidden_act (str 或 function, 可选, 默认为 "silu") — 解码器中的非线性激活函数（函数或字符串）。
max_position_embeddings (int, 可选, 默认为 4096) — 此模型可能使用的最大序列长度。
initializer_range (float, 可选, 默认为 0.02) — 用于初始化所有权重矩阵的 truncated_normal_initializer 的标准差。
rms_norm_eps (float, 可选, 默认为 1e-06) — rms 归一化层使用的 epsilon。
use_cache (bool, 可选, 默认为 True) — 模型是否应返回最后一个键/值注意力（并非所有模型都使用）。仅当 `config.is_decoder=True` 时相关。
pad_token_id (int, 可选) — 填充 token 的 ID。
bos_token_id (int, 可选, 默认为 0) — 序列开始 token 的 ID。
eos_token_id (int, 可选, 默认为 1) — 序列结束符的 ID。
pretraining_tp (int, 可选, 默认为 1) — 实验性功能。预训练期间使用的张量并行等级。请参阅此文档以了解更多信息。此值对于确保预训练结果的精确复现是必需的。请参阅此问题。
tie_word_embeddings (bool, 可选, 默认为 False) — 是否绑定词嵌入权重
rope_theta (float, 可选, 默认为 10000.0) — RoPE 嵌入的基础周期。
rope_scaling (Dict, 可选) — 包含 RoPE 嵌入缩放配置的字典。目前支持两种缩放策略：`linear` 和 `dynamic`。它们的缩放因子必须是大于 1 的浮点数。期望的格式是 `{"type": 策略名称, "factor": 缩放因子}`。使用此标志时，不要将 `max_position_embeddings` 更新为期望的新最大值。
rope_interleave (bool, 可选, 默认为 True) — 是否交错旋转位置嵌入。
attention_bias (bool, 默认为 False, 可选, 默认为 False) — 在自注意力机制中，是否在 query、key、value 和输出投影层中使用偏置。
attention_dropout (float, 可选, 默认为 0.0) — 注意力概率的 dropout 比率。

这是用于存储 DeepseekV3Model 配置的配置类。它用于根据指定的参数实例化一个 DeepSeek 模型，定义模型架构。使用默认值实例化配置将产生与 DeepSeek-V3 类似的配置，例如 bzantium/tiny-deepseek-v3。配置对象继承自 PretrainedConfig，可用于控制模型输出。有关更多信息，请阅读 PretrainedConfig 的文档。

>>> from transformers import DeepseekV3Model, DeepseekV3Config

>>> # Initializing a Deepseek-V3 style configuration
>>> configuration = DeepseekV3Config()

>>> # Accessing the model configuration
>>> configuration = model.config

DeepseekV3Model

class transformers.DeepseekV3Model

< 源 >

( config: DeepseekV3Config )

参数

config (DeepseekV3Config) — 包含模型所有参数的模型配置类。使用配置文件进行初始化不会加载与模型相关的权重，只会加载配置。请查看 from_pretrained() 方法来加载模型权重。

基础的 Deepseek V3 模型，输出原始的隐藏状态，顶部没有任何特定的头。

该模型继承自 PreTrainedModel。请查看超类的文档，了解该库为其所有模型实现的通用方法（例如下载或保存、调整输入嵌入大小、修剪头部等）。

该模型也是 PyTorch 的 torch.nn.Module 子类。可以像常规的 PyTorch Module 一样使用它，并参考 PyTorch 文档了解所有与常规用法和行为相关的事项。

forward

< 源 >

( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[transformers.cache_utils.Cache] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None **flash_attn_kwargs: typing_extensions.Unpack[transformers.modeling_flash_attention_utils.FlashAttentionKwargs] ) → transformers.modeling_outputs.BaseModelOutputWithPast 或 tuple(torch.FloatTensor)

参数

input_ids (torch.LongTensor，形状为 (batch_size, sequence_length)，可选) — 词汇表中输入序列词元的索引。默认情况下，填充将被忽略。

可以使用 AutoTokenizer 获取索引。有关详细信息，请参阅 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。

什么是输入 ID？
attention_mask (torch.Tensor，形状为 (batch_size, sequence_length)，可选) — 用于避免对填充词元索引执行注意力的掩码。掩码值选自 [0, 1]：
- 1 表示未被掩码的词元，
- 0 表示被掩码的词元。
什么是注意力掩码？
position_ids (torch.LongTensor，形状为 (batch_size, sequence_length)，可选) — 每个输入序列词元在位置嵌入中的位置索引。选自范围 [0, config.n_positions - 1]。

什么是位置 ID？
past_key_values (~cache_utils.Cache, 可选) — 预先计算的隐藏状态（自注意力块和交叉注意力块中的键和值），可用于加速序列解码。这通常包括模型在解码的前一个阶段返回的 `past_key_values`，当 `use_cache=True` 或 `config.use_cache=True` 时。

允许两种格式：
- Cache 实例，请参阅我们的 kv 缓存指南；
- 长度为 `config.n_layers` 的 `tuple(torch.FloatTensor)` 元组，每个元组包含 2 个形状为 `(batch_size, num_heads, sequence_length, embed_size_per_head)` 的张量。这也被称为传统缓存格式。
模型将输出与输入相同的缓存格式。如果没有传递 `past_key_values`，将返回传统缓存格式。

如果使用 `past_key_values`，用户可以选择只输入最后一个 `input_ids`（即那些没有为其提供过去键值状态的 `input_ids`），形状为 `(batch_size, 1)`，而不是所有形状为 `(batch_size, sequence_length)` 的 `input_ids`。
inputs_embeds (torch.FloatTensor，形状为 (batch_size, sequence_length, hidden_size)，可选) — 可选地，您可以选择直接传递嵌入表示，而不是传递 `input_ids`。如果您希望比模型内部嵌入查找矩阵更多地控制如何将 `input_ids` 索引转换为关联向量，这将非常有用。
use_cache (bool, 可选) — 如果设置为 `True`，则返回 `past_key_values` 键值状态，可用于加速解码（参见 `past_key_values`）。
output_attentions (bool, 可选) — 是否返回所有注意力层的注意力张量。有关详细信息，请参阅返回张量下的 `attentions`。
output_hidden_states (bool, 可选) — 是否返回所有层的隐藏状态。有关详细信息，请参阅返回张量下的 `hidden_states`。
cache_position (torch.LongTensor，形状为 (sequence_length)，可选) — 描绘输入序列词元在序列中位置的索引。与 `position_ids` 不同，此张量不受填充影响。它用于在正确的位置更新缓存并推断完整的序列长度。

transformers.modeling_outputs.BaseModelOutputWithPast 或 tuple(torch.FloatTensor)

一个 transformers.modeling_outputs.BaseModelOutputWithPast 或一个 `torch.FloatTensor` 的元组（如果传递了 `return_dict=False` 或当 `config.return_dict=False` 时），根据配置（DeepseekV3Config）和输入，包含各种元素。

last_hidden_state (torch.FloatTensor, 形状为 (batch_size, sequence_length, hidden_size)) — 模型最后一层输出的隐藏状态序列。

如果使用了 past_key_values，则只输出形状为 (batch_size, 1, hidden_size) 的序列的最后一个隐藏状态。
past_key_values (Cache, 可选, 当传递 `use_cache=True` 或 `config.use_cache=True` 时返回) — 这是一个 Cache 实例。更多详情请参阅我们的 kv 缓存指南。

包含预计算的隐藏状态（自注意力块中的键和值，如果 `config.is_encoder_decoder=True`，则还包括交叉注意力块中的键和值），可用于（参见 `past_key_values` 输入）加速序列解码。
hidden_states (tuple(torch.FloatTensor), 可选, 当传递 `output_hidden_states=True` 或 `config.output_hidden_states=True` 时返回) — `torch.FloatTensor` 的元组（如果模型有嵌入层，则一个用于嵌入层的输出，+ 每个层的输出一个），形状为 `(batch_size, sequence_length, hidden_size)`。

模型在每个层输出的隐藏状态以及可选的初始嵌入输出。
attentions (tuple(torch.FloatTensor), 可选, 当传递 `output_attentions=True` 或 `config.output_attentions=True` 时返回) — `torch.FloatTensor` 的元组（每层一个），形状为 `(batch_size, num_heads, sequence_length, sequence_length)`。

注意力 softmax 后的注意力权重，用于计算自注意力头中的加权平均值。

DeepseekV3Model 的 forward 方法，覆盖了 `__call__` 特殊方法。

尽管前向传播的流程需要在此函数内定义，但之后应调用 `Module` 实例而不是此函数，因为前者会处理预处理和后处理步骤，而后者会静默地忽略它们。

DeepseekV3ForCausalLM

class transformers.DeepseekV3ForCausalLM

< 源 >

( config )

参数

config (DeepseekV3ForCausalLM) — 包含模型所有参数的模型配置类。使用配置文件进行初始化不会加载与模型相关的权重，只会加载配置。请查看 from_pretrained() 方法来加载模型权重。

用于因果语言建模的 Deepseek V3 模型。

该模型继承自 PreTrainedModel。请查看超类的文档，了解该库为其所有模型实现的通用方法（例如下载或保存、调整输入嵌入大小、修剪头部等）。

该模型也是 PyTorch 的 torch.nn.Module 子类。可以像常规的 PyTorch Module 一样使用它，并参考 PyTorch 文档了解所有与常规用法和行为相关的事项。

forward

< 源 >

( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[transformers.cache_utils.Cache] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None logits_to_keep: typing.Union[int, torch.Tensor] = 0 **kwargs: typing_extensions.Unpack[transformers.models.deepseek_v3.modeling_deepseek_v3.KwargsForCausalLM] ) → transformers.modeling_outputs.CausalLMOutputWithPast 或 tuple(torch.FloatTensor)

参数

input_ids (torch.LongTensor，形状为 (batch_size, sequence_length)，可选) — 词汇表中输入序列词元的索引。默认情况下，填充将被忽略。

可以使用 AutoTokenizer 获取索引。有关详细信息，请参阅 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。

什么是输入 ID？
attention_mask (torch.Tensor，形状为 (batch_size, sequence_length)，可选) — 用于避免对填充词元索引执行注意力的掩码。掩码值选自 [0, 1]：
- 1 表示未被掩码的词元，
- 0 表示被掩码的词元。
什么是注意力掩码？
position_ids (torch.LongTensor，形状为 (batch_size, sequence_length)，可选) — 每个输入序列词元在位置嵌入中的位置索引。选自范围 [0, config.n_positions - 1]。

什么是位置 ID？
past_key_values (~cache_utils.Cache, 可选) — 预先计算的隐藏状态（自注意力块和交叉注意力块中的键和值），可用于加速序列解码。这通常包括模型在解码的前一个阶段返回的 `past_key_values`，当 `use_cache=True` 或 `config.use_cache=True` 时。

允许两种格式：
- Cache 实例，请参阅我们的 kv 缓存指南；
- 长度为 `config.n_layers` 的 `tuple(torch.FloatTensor)` 元组，每个元组包含 2 个形状为 `(batch_size, num_heads, sequence_length, embed_size_per_head)` 的张量。这也被称为传统缓存格式。
模型将输出与输入相同的缓存格式。如果没有传递 `past_key_values`，将返回传统缓存格式。

如果使用 `past_key_values`，用户可以选择只输入最后一个 `input_ids`（即那些没有为其提供过去键值状态的 `input_ids`），形状为 `(batch_size, 1)`，而不是所有形状为 `(batch_size, sequence_length)` 的 `input_ids`。
inputs_embeds (torch.FloatTensor，形状为 (batch_size, sequence_length, hidden_size)，可选) — 可选地，您可以选择直接传递嵌入表示，而不是传递 `input_ids`。如果您希望比模型内部嵌入查找矩阵更多地控制如何将 `input_ids` 索引转换为关联向量，这将非常有用。
labels (torch.LongTensor，形状为 (batch_size, sequence_length)，可选) — 用于计算掩码语言建模损失的标签。索引应在 [0, ..., config.vocab_size] 或 -100（参见 `input_ids` 文档字符串）之间。索引设置为 -100 的词元将被忽略（掩码），损失仅对标签在 [0, ..., config.vocab_size] 中的词元计算。
use_cache (bool, 可选) — 如果设置为 `True`，则返回 `past_key_values` 键值状态，可用于加速解码（参见 `past_key_values`）。
output_attentions (bool, 可选) — 是否返回所有注意力层的注意力张量。有关详细信息，请参阅返回张量下的 `attentions`。
output_hidden_states (bool, 可选) — 是否返回所有层的隐藏状态。有关详细信息，请参阅返回张量下的 `hidden_states`。
cache_position (torch.LongTensor，形状为 (sequence_length)，可选) — 描绘输入序列词元在序列中位置的索引。与 `position_ids` 不同，此张量不受填充影响。它用于在正确的位置更新缓存并推断完整的序列长度。
logits_to_keep (Union[int, torch.Tensor], 默认为 0) — 如果是 `int`，则计算最后 `logits_to_keep` 个词元的 logits。如果为 `0`，则计算所有 `input_ids` 的 logits（特殊情况）。生成时只需要最后一个词元的 logits，仅为其计算可以节省内存，这对于长序列或大词汇表来说非常重要。如果是 `torch.Tensor`，则必须是一维的，对应于序列长度维度中要保留的索引。这在使用打包张量格式（批次和序列长度为单个维度）时很有用。

transformers.modeling_outputs.CausalLMOutputWithPast 或 tuple(torch.FloatTensor)

一个 transformers.modeling_outputs.CausalLMOutputWithPast 或一个 `torch.FloatTensor` 的元组（如果传递了 `return_dict=False` 或当 `config.return_dict=False` 时），根据配置（DeepseekV3Config）和输入，包含各种元素。

loss (torch.FloatTensor 形状为 (1,)，可选，当提供 labels 时返回) — 语言建模损失（用于下一个 token 预测）。
logits (形状为 (batch_size, sequence_length, config.vocab_size) 的 torch.FloatTensor) — 语言建模头部的预测分数（SoftMax 之前的每个词汇标记的分数）。
past_key_values (Cache, 可选, 当传递 `use_cache=True` 或 `config.use_cache=True` 时返回) — 这是一个 Cache 实例。更多详情请参阅我们的 kv 缓存指南。

包含预计算的隐藏状态（自注意力块中的键和值），可用于（参见 past_key_values 输入）加速顺序解码。
hidden_states (tuple(torch.FloatTensor), 可选, 当传递 `output_hidden_states=True` 或 `config.output_hidden_states=True` 时返回) — `torch.FloatTensor` 的元组（如果模型有嵌入层，则一个用于嵌入层的输出，+ 每个层的输出一个），形状为 `(batch_size, sequence_length, hidden_size)`。

模型在每个层输出的隐藏状态以及可选的初始嵌入输出。
attentions (tuple(torch.FloatTensor), 可选, 当传递 `output_attentions=True` 或 `config.output_attentions=True` 时返回) — `torch.FloatTensor` 的元组（每层一个），形状为 `(batch_size, num_heads, sequence_length, sequence_length)`。

注意力 softmax 后的注意力权重，用于计算自注意力头中的加权平均值。

DeepseekV3ForCausalLM 的 forward 方法，覆盖了 `__call__` 特殊方法。

尽管前向传播的流程需要在此函数内定义，但之后应调用 `Module` 实例而不是此函数，因为前者会处理预处理和后处理步骤，而后者会静默地忽略它们。

示例

>>> from transformers import AutoTokenizer, DeepseekV3ForCausalLM

>>> model = DeepseekV3ForCausalLM.from_pretrained("meta-deepseek_v3/DeepseekV3-2-7b-hf")
>>> tokenizer = AutoTokenizer.from_pretrained("meta-deepseek_v3/DeepseekV3-2-7b-hf")

>>> prompt = "Hey, are you conscious? Can you talk to me?"
>>> inputs = tokenizer(prompt, return_tensors="pt")

>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."

< > 在 GitHub 上更新

←DeBERTa-v2 DialoGPT→