从零开始的稀疏专家混合语言模型：用专家容量扩展 makeMoE

社区文章发布于 2024 年 3 月 18 日

我之前的博客详细介绍了稀疏专家混合语言模型“makeMoE”的端到端实现（灵感来自 Andrej Karpathy 的 makemore 和 nanoGPT），并获得了社区的广泛关注（https://huggingface.co/blog/AviSoori1x/makemoe-from-scratch）。最近，x.ai 开源了另一个稀疏 MoE LLM Grok-1，这进一步促使我通过加入一个我最初遗漏的功能——专家容量——来增强 makeMoE。

这里提供的 Github 仓库提供了端到端的实现（包含专家容量）：https://github.com/AviSoori1x/makeMoE

为什么专家容量如此重要？

在预训练稀疏专家混合语言模型或任何大型语言模型时，该过程通常涉及多个 GPU，甚至许多机器。训练在这些硬件资源上的并行方式对于平衡计算负载至关重要。然而，如果某些专家或一组专家过于受偏爱——这反映了对利用而非探索的偏好——不仅可能导致模型性能问题，还会导致集群计算负载的不平衡。

Switch Transformer 的实现通过专家容量来规避这个问题。专家容量决定了每个专家在训练或推理过程中负责处理的 token 数量，并对每个专家处理的 token 数量设定了限制。它根据批次中的 token 数量和可用专家的数量来定义，通常通过容量因子进行调整。这个因子允许在分配上具有灵活性，提供了缓冲区以适应数据分布的变化，并确保没有单个专家由于过载而成为瓶颈。在训练这些大型模型数周甚至数月时，硬件故障很常见，因此这非常重要。

专家容量通常按以下方式计算：

专家容量 = (每批次 Token 数 / 专家数量) × 容量因子，其中：

每批次 Token 数是指需要处理的批次中存在的 Token 总数。专家数量是指 MoE 层中可用于处理数据的专家总数。容量因子是一个乘数，用于调整基本容量（每批次 Token 数除以专家数量）。容量因子大于 1 允许每个专家处理超出均匀分配份额的缓冲区，以适应 Token 分配中的不平衡。这个值的通常范围是 1-1.25。

以下代码块对实现一个简单版本的专家容量进行了轻微调整。

class SparseMoE(nn.Module):
    def __init__(self, n_embed, num_experts, top_k, capacity_factor=1.0):
        super(SparseMoE, self).__init__()
        self.router = NoisyTopkRouter(n_embed, num_experts, top_k)
        self.experts = nn.ModuleList([Expert(n_embed) for _ in range(num_experts)])
        self.top_k = top_k
        self.capacity_factor = capacity_factor
        self.num_experts = num_experts
    
    def forward(self, x):
    # Assuming x has shape [batch_size, seq_len, n_embd]
        batch_size, seq_len, _ = x.shape
        gating_output, indices = self.router(x)
        final_output = torch.zeros_like(x)

        # Flatten the batch and sequence dimensions to treat each token independently
        flat_x = x.view(-1, x.size(-1))  # Now shape [batch_size * seq_len, n_embd]
        flat_gating_output = gating_output.view(-1, gating_output.size(-1))

        tokens_per_batch = batch_size * seq_len * self.top_k
        expert_capacity = int((tokens_per_batch / self.num_experts) * self.capacity_factor)

        updates = torch.zeros_like(flat_x)

        for i, expert in enumerate(self.experts):
            expert_mask = (indices == i).any(dim=-1)
            flat_mask = expert_mask.view(-1)
            selected_indices = torch.nonzero(flat_mask).squeeze(-1)

            limited_indices = selected_indices[:expert_capacity] if selected_indices.numel() > expert_capacity else selected_indices
            if limited_indices.numel() > 0:
                expert_input = flat_x[limited_indices]
                expert_output = expert(expert_input)

                gating_scores = flat_gating_output[limited_indices, i].unsqueeze(1)
                weighted_output = expert_output * gating_scores

                updates.index_add_(0, limited_indices, weighted_output)

        # Reshape updates to match the original dimensions of x
        final_output += updates.view(batch_size, seq_len, -1)

        return final_output

为了确保形状对齐（这在此类实现中很常见），需要进行大量的张量形状操作，但实现中最重要的部分仅在几行代码中。让我们放大看看这些部分。

首先，让我们看一下专家容量的计算。

expert_capacity = int((tokens_per_batch / self.num_experts) * self.capacity_factor)

这非常简单。将其包含在前向传播中，是为了应对使用动态批次大小的情况。

下一行重要的代码是：

limited_indices = selected_indices[:expert_capacity] if selected_indices.numel() > expert_capacity else selected_indices
if limited_indices.numel() > 0:
  #remaining logic to process and accumulate weighted expert outputs for selected tokens.

`selected_indices` 张量标识了指定由第 i 个专家处理的 token。如果分配给该专家的 token 总数超过其容量，则该张量将被截断以匹配专家的最大处理容量。否则，它将按原样用于进一步的计算。

这些计算包括通过专家确定每个 token 的输出，然后应用相应的门控值以得出加权输出。此加权输出逐步与最终输出张量结合，从而构成模型的整体输出。

包含实现的 Jupyter Notebook 在这里：https://github.com/AviSoori1x/makeMoE/blob/main/makeMoE_from_Scratch_with_Expert_Capacity.ipynb

这种管理专家容量的方法相对基础。在文献中探索了更高级的策略，例如 Google 论文中讨论的 Switch Transformer 架构，可在此处获取：https://arxiv.org/abs/2101.03961。尽管此处提出的方法简化了容量处理，但它为这一概念提供了一个直观的介绍，并使 makeMoE 更加完整！

社区

通过拖放到文本输入框、粘贴或点击此处上传图片、音频和视频。

点击或粘贴此处以上传图片

· 注册或登录以评论