在不破坏LLM的情况下使其更小：一种GLU感知剪枝方法

社区文章发布于2024年11月24日

Pere Martra

oopere

TL;DR

剪枝是创建小型语言模型的关键技术，但成功的剪枝过程需要理解目标模型的结构。

本文演示了如何对具有门控线性单元（GLU）结构的MLP层进行剪枝，这适用于许多当前模型，如LLaMA 3.2、Gemma、Mistral、QWen等。

通过在剪枝过程中保留GLU结构，可以在显著减小模型大小的同时，保持连贯的输出生成，并在BoolQ等任务上实现惊人的高精度。

探索notebook，实验剪枝模型，创建您自己的模型，并分享您的反馈！

引言。

随着大型语言模型为实现更强能力而不断增大，对更高效、更小版本的需求变得日益迫切。然而，在不损失核心功能的情况下缩小模型尺寸，是一个微妙的平衡之举。量化和剪枝等技术常用于减小模型尺寸，而知识蒸馏或迁移学习等方法则有助于在尺寸减小过程中保留或恢复能力。

在这些技术中，剪枝是减小模型尺寸最有效的策略之一。与量化简化数值表示不同，剪枝涉及移除模型中的特定部分，例如神经元或整个层。但这种有效性是有代价的：剪枝难以正确应用。你不仅需要识别模型中要剪枝的部分，还必须仔细选择要移除的元素，以最大程度地减少对模型能力的影响。

本文重点关注结构化宽度剪枝，即移除选定的神经元，并演示了如何将其有效地应用于具有门控线性单元（GLU）结构的MLP层。通过遵循所概述的步骤，您将看到剪枝如何在显著减小模型尺寸的同时，保持其生成连贯输出的能力，并在关键基准测试中表现良好。

什么是剪枝以及它如何影响模型？

正如我之前解释的，剪枝涉及移除模型中被认为对最终输出贡献最小的部分。通过仔细选择这些不太关键的组件，剪枝旨在创建一个参数更少、计算需求更低的高效模型，同时不牺牲其核心能力。

剪枝的主要挑战在于决定移除模型中的哪些部分。模型中并非所有部分都对性能产生同等影响；每个部分都有其独特的作用。

为了说明这一点，让我们检查本文中使用的模型结构：LLaMA 3.2-1B。

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=2048, out_features=128256, bias=False)
)

在检查结构时，我们可以识别出三个主要的剪枝目标模块：嵌入层、自注意力机制和MLP层。要决定剪枝过程应侧重于这些模块中的哪一个，了解潜在的益处以及对模型的可能影响至关重要。

第一步是评估这些部分在模型中占据的比例，这能让我们了解潜在的尺寸减小程度。

参数分布分析。

嵌入层和输出层（embed_tokens，lm_head）：每层128256×2048≈262M参数，共两层，总计524M参数。
自注意力机制（self_attn）：包含16层，每层有四个投影子层。每层的大小约为：2048×(2048+512+512+2048)≈10.5M参数。乘以16层得到：10.5×16≈168M参数。
MLP层（mlp）：同样包含16层，由于它们遵循GLU结构，每层包括gate_proj、up_proj和down_proj。每层的大小约为：2048×8192+2048×8192+8192×2048≈50M参数。乘以16层得到：50×16≈805M参数。

影响分析。

正如我们所见，MLP层占模型大小的50%以上，使其成为剪枝的明确候选对象。然而，在做出此决定之前，了解每个部分对模型行为的贡献至关重要。

嵌入层负责将输入转换为模型可以有效处理的密集向量表示。剪枝嵌入层可能导致模型理解某些词的能力丧失，或者至少降低创建正确捕获输入语义含义的向量的能力。如果您想创建一个高度特化的模型，只使用其输入词汇表中非常特定的一部分，例如用于金融或医疗分析的模型，剪枝此层可能是一个选项。

注意力机制允许模型在处理每个token时关注输入序列中最相关的部分。它计算输入序列中每对token之间的加权重要性分数，使模型能够捕获上下文并关注相关信息。剪枝此部分可能会降低模型执行需要广泛理解输入上下文的任务（如文本摘要或翻译）的能力。它还会影响生成文本的连贯性。

MLP层伴随着注意力机制，并通过一系列数据扩展和收缩增强了模型理解复杂模式的能力。剪枝此部分可能会限制模型对未见数据或训练期间未涵盖任务的响应。换句话说，它降低了模型的泛化能力和对不熟悉输入提供连贯响应的能力。

一旦你决定了要针对模型的哪个部分，下一步是确定是执行宽度剪枝（移除单个神经元）还是深度剪枝（移除整个层）。正如你所见，剪枝模型是一个相当复杂的过程，涉及到许多决策。你不仅要评估所得模型的能力，还要评估其训练能力。这些模型旨在进行微调，通常用于特定任务，因此它们可以比基础模型更有效和高效地完成它们被创建的任务。

门控线性单元的特点

门控线性单元（GLU）架构常用于现代神经网络，包括LLaMA和类似的大型语言模型。GLU引入了一种逐元素门控机制，允许模型选择性地过滤和控制信息流。这种架构由配对的层组成，通常是gate_proj、up_proj和down_proj（如上述模型结构所示），它们协同工作以扩展和收缩数据。

这种机制使模型能够处理更复杂的模式，同时保持效率。然而，这也意味着GLU结构中的层是紧密耦合的，剪枝这些层需要仔细考虑。

对一个层（例如，移除神经元）的任何操作都必须在其对应的配对层中镜像。例如，如果从gate_proj中移除一个神经元，则相同的神经元也必须从up_proj中移除，并且down_proj层的大小必须相应调整。最重要的是，在计算神经元的重要性以决定保留哪些神经元时，您需要一起评估这对神经元。

即使只移除一小部分神经元，破坏这些层之间的平衡也可能导致性能下降甚至模型完全失效。

剪枝Llama 3.2模型。(GLU)。

本示例将使用Llama模型进行演示，但该代码也已在Gemma和QWen模型上成功测试。

您可以在我的GitHub存储库中找到完整的代码。在本文中，我将只展示与剪枝过程相关的代码，省略一些支持函数。该notebook还包括用于评估模型并将其上传到Hugging Face Hub的代码。

我对原始模型进行的第一步是在内存中执行一个简短的提示并保存结果。这使我能够轻松、直观、快速地检查通过剪枝过程生成的模型是否连贯，或者相反，是否失去了生成可理解文本的能力。我可以向您保证，在第一次尝试中，由于没有遵循模型的GLU结构，所产生的文本毫无疑问地表明剪枝过程存在根本性缺陷。原始提示是：“巴黎是……”让我们看看原始模型的响应并将其与我第一次剪枝尝试返回的响应进行比较。

基础模型

“巴黎是法国的首都，也是世界上访问量最大的城市之一。它是一座艺术、文化、时尚和美食之城。这座城市历史悠久，拥有许多著名的地标，包括埃菲尔铁塔。”

第一次尝试，仅剪枝20%

“巴黎是法国的首都。这是这个这个这个这个主要区域。这是这个这个这个这个这个这个这个这个这个这个这个这个这个这个这个城市。”

很明显，第一次尝试有些问题。这可能看起来微不足道，但像这样的经验性检查可以为您节省相当多的时间。

实施细节。

让我们先看看负责计算神经元重要性的函数，它最终将决定哪些神经元保留在模型中，哪些被移除。

def compute_neuron_pair_importance(gate_weight, up_weight):
  """
  compute neuron pair importance scores (Maximum Absolute Weight)

  Args:
  - gate_weight: Weight matrix from the gate_proj layer.
  - up_weight: Weight matrix from the up_weight layer.

  Returns:
  - importance_scores: Importance scores for each neuron pair.
  """

  gate_max_abs = torch.max(gate_weight, dim=1).values + torch.abs(torch.min(gate_weight, dim=1).values)
  up_max_abs = torch.max(up_weight, dim=1).values + torch.abs(torch.min(up_weight, dim=1).values)
  importance_scores = gate_max_abs + up_max_abs
  return importance_scores

该函数接收gate_proj层和up_proj层的权重，正如我所解释的，它们是成对工作的。因此，神经元的重要性必须联合计算。

计算方法非常直接：它计算每个神经元权重的绝对值。正值和负值都被考虑在内，因为理论上，具有最极端值的神经元通过显著改变通过它们的值，对模型的输出影响更大。

在此，我必须感谢Mariusz Kurman在将最小值纳入计算方面的贡献。尽管该方法在没有它们的情况下也能正常工作，但它们的加入改善了结果。

重要性分别针对每个层计算，但函数返回的是组合值。

下一个函数负责创建新层并将其合并到模型中，以替换原始层。

#Prunes a specific percentatge of neurons from the MLP (feed forward layers).
def prune_neuron_pairs(mlp, prune_percent):
    """
    Reduces the dimensions of the **gate_proj**,**up_proj**, **down_proj**
    layers removing the least important neurons.

    Args:
    - mlp: Layers to prune.
    - prune_percent: Percentage of neurons to prune.

    Returns:
    - new_gate_proj, new_up_proj, new_down_proj:  New pruned layers.
    - k: New intermediate size.

    """
    # Extract the weights from the MLP layers
    #  these weights are used to calculate each neuron's
    #  importance score in the next step.
    gate_weight = mlp.gate_proj.weight.data.float()
    up_weight = mlp.up_proj.weight.data.float()

    #Compute importance stores. Neurons with higher importance scores
    # are considered more important and less likely to be pruned.
    importance_scores = compute_neuron_pair_importance(gate_weight, up_weight)

    #Store the original number of neurons in the intermediate layer.
    original_intermediate_size = gate_weight.size(0)
    #Computes the number of neurons to prune.
    num_neuron_pairs_to_prune = min(int(prune_percent * original_intermediate_size), original_intermediate_size - 1)
    #Calculate the number of neurons to keep. The new intermediate size.
    k = original_intermediate_size - num_neuron_pairs_to_prune

    #Just check that there is no big error calculating k. We can't prune all the neurons.
    if k <= 0:
        raise ValueError(f"Invalid number of neuron pairs to keep: {k}. Adjust the prune_percent.")

    _, indices_to_keep = torch.topk(importance_scores, k, largest=True, sorted=True)
    indices_to_keep = indices_to_keep.sort().values

    #create the new layers
    new_gate_proj = nn.Linear(mlp.gate_proj.in_features, k, bias=False).to(device)
    new_up_proj = nn.Linear(mlp.up_proj.in_features, k, bias=False).to(device)
    new_down_proj = nn.Linear(k, mlp.down_proj.out_features, bias=False).to(device)

    #copy weights to the new layers.
    new_gate_proj.weight.data = mlp.gate_proj.weight.data[indices_to_keep, :]
    new_up_proj.weight.data = mlp.up_proj.weight.data[indices_to_keep, :]
    new_down_proj.weight.data = mlp.down_proj.weight.data[:, indices_to_keep]

    #return new layers and intermediate size.
    return new_gate_proj, new_up_proj, new_down_proj, k

这个函数稍微复杂一些。它接收MLP块中的一个层和要应用的剪枝百分比。通过调用compute_neuron_pair_importance函数，它确定要保留哪些神经元。

让我们一步一步来分解它。

  # Extract the weights from the MLP layers
  #  these weights are used to calculate each neuron's
  #  importance score in the next step.
  gate_weight = mlp.gate_proj.weight.data.float()
  up_weight = mlp.up_proj.weight.data.float()

通过这两行，我们检索当前层的权重。

importance_scores = compute_neuron_pair_importance(gate_weight, up_weight)

现在，获得了一个张量，其中包含为每个神经元计算的重要性分数。这些分数反映了每个神经元对最终输出的贡献，指示应保留哪些神经元。

 #Store the original number of neurons in the intermediate layer.
 original_intermediate_size = gate_weight.size(0)
 #Computes the number of neurons to prune.
 num_neuron_pairs_to_prune = min(int(prune_percent * original_intermediate_size), original_intermediate_size - 1)
 #Calculate the number of neurons to keep. The new intermediate size.
 k = original_intermediate_size - num_neuron_pairs_to_prune

要保留的神经元总数是根据提供的剪枝百分比和层的原始大小计算的。由于层的尺寸相同，因此无需存储两个层的尺寸。最后，确定中间层的新尺寸。

 #Select the neuros to keep, by obtaining the indices to keep.
 _, indices_to_keep = torch.topk(importance_scores, k, largest=True, sorted=True)
 indices_to_keep = indices_to_keep.sort().values

这些行至关重要。在这里，使用torch来检索具有最高重要性分数的神经元，同时也将它们从最重要到最不重要进行排序。由于torch返回的数据是降序排列的，因此使用sort方法将其重新排列为升序，这正是我们需要的。

使用计算出的索引，创建新层。

 #create the new layers
 new_gate_proj = nn.Linear(mlp.gate_proj.in_features, k, bias=False).to(device)
 new_up_proj = nn.Linear(mlp.up_proj.in_features, k, bias=False).to(device)
 new_down_proj = nn.Linear(k, mlp.down_proj.out_features, bias=False).to(device)

 #copy weights to the new layers.
 new_gate_proj.weight.data = mlp.gate_proj.weight.data[indices_to_keep, :]
 new_up_proj.weight.data = mlp.up_proj.weight.data[indices_to_keep, :]
 new_down_proj.weight.data = mlp.down_proj.weight.data[:, indices_to_keep]

首先，创建三个新层，其维度根据选定的索引进行调整。在`new_gate_proj`和`new_up_proj`中，输入维度保持不变，而输出维度减小。相反，在`new_down_proj`中，输入维度进行调整，而输出维度保持不变。

这些层在没有权重的情况下初始化，并在最后几行中，将相关的权重从原始层转移到新层，确保只保留与选定神经元对应的权重。

    #return new layers and intermediate size.
    return new_gate_proj, new_up_proj, new_down_proj, k

最后，返回新层。

现在，让我们看看负责遍历所有层并构建修改后的模型的函数。

#Iterates through the model layers and applies pruning.
def update_model(model, prune_percent):
   """
   It modifies each mlp layer present in model, to retain only the most
   important neurons. Creating new smaller versions of each layer pruned.

   Args:
   - model: Model to prune.
   - prune_percent: Percentage of neurons to prune.

   Returns:
   - model: New pruned model.
   """
   new_intermediate_size = None

   #loop for each model layer.
   for idx, layer in enumerate(model.model.layers):
       #Since each layer is a LlamaDecoderLayer it contains multiple components
       # Attention, MLP and Layer norms. We're targetting MLP component
       # by accesing layer.mlp.
       mlp = layer.mlp

       #Call the prune_neiron_pairs with the layers and receiving the pruned.
       new_gate_proj, new_up_proj, new_down_proj, new_size = prune_neuron_pairs(mlp, prune_percent)

       #Replace the Origiginal Layers with Pruned Layers.
       mlp.gate_proj = new_gate_proj
       mlp.up_proj = new_up_proj
       mlp.down_proj = new_down_proj

       #new_intermediate_size only needs to be set once
       if new_intermediate_size is None:
           new_intermediate_size = new_size

   #Update the model config file.
   model.config.intermediate_size = new_intermediate_size

   return model

可以说这个函数很简单。它以模型和剪枝百分比作为输入。它遍历模型的每一层，从每层中提取MLP部分。然后，它调用prune_neuron_pairs函数，并用该函数返回的层替换模型的层。

       #Call the prune_neiron_pairs with the layers and receiving the pruned.
       new_gate_proj, new_up_proj, new_down_proj, new_size = prune_neuron_pairs(mlp, prune_percent)

       #Replace the Origiginal Layers with Pruned Layers.
       mlp.gate_proj = new_gate_proj
       mlp.up_proj = new_up_proj
       mlp.down_proj = new_down_proj

最后，它还会更新模型配置文件中的一个变量：new_intermediate_size。

   #Update the model config file.
   model.config.intermediate_size = new_intermediate_size

如果此文件未更新，则模型在保存后（无论是在Hugging Face上还是本地）都无法使用。许多库，例如Hugging Face的Transformers，都依赖于model.config来解释模型的架构。如果配置与实际结构不匹配，通过这些库执行的微调或推理操作可能会失败。

结果分析。

通过此代码，我创建了几个模型，它们可在Hugging Face Hub上获取。

这些模型包括：

三个来自Llama-3.2-1b的模型，它们的MLP层分别被剪枝了20%、40%和60%的神经元。
一个基于Gemma-2-2B的模型，被剪枝了40%。
您可以下载这些模型，除了使用它们之外，还可以研究它们的架构以及与基于它们的原始模型相比的变化。

让我们分析一下对Llama3.2-1b模型进行20%剪枝后架构的变化。

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=6554, bias=False)
          (up_proj): Linear(in_features=2048, out_features=6554, bias=False)
          (down_proj): Linear(in_features=6554, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=2048, out_features=128256, bias=False)
)

模型的结构保持不变，除了MLP块中间层的大小。正如你所看到的，gate_proj和up_proj层的特征从8192减少到6554，down_proj层也发生了相同的变化，但其输入特征改变了。

这种变化与代码功能完全一致：修改这些层，同时保留对模型性能最关键的神经元。如果我们从8192中移除20%，我们将得到6553.6，这证实了正确比例的神经元已被剪枝。

现在，让我们看看剪枝后的模型在测试提示下表现如何：巴黎是法国的首都。它也是世界上最美丽的城市之一。巴黎有很多景点和活动，一天之内不可能全部游览。然而，有一些事情你

响应与原始模型并非完全相同，但保持了连贯性。这表明模型保留了大部分能力，更重要的是，它有可能通过知识蒸馏或微调等过程恢复任何损失。

除了这种经验性检查，我还使用一些最常见的基准测试对模型进行了评估。让我们分析不同程度的剪枝如何影响模型的性能。

正如我们所看到的，剪枝的效果有些不对称。BoolQ测试评估的任务没有出现显著退化——对于MLP层中丢失40%神经元的模型，只下降了约2%。

相反，对Lambada测试的影响是显著的，准确率下降了50%以上。这表明模型保留了大部分理解能力，但在需要更开放式生成的测试中表现不佳。

BoolQ只是向模型提供一段文本和一个问题，要求以“是/否”回答。这是一个专注于衡量模型理解输入文本中关系能力的任务。

另一方面，Lambada要求模型猜测段落的最后一个词，这是一个复杂的任务，最后一个词测试了模型在复杂语言建模方面的能力。

这些结果与被剪枝的MLP层的功能一致。

结论。

模型的剪枝过程取得了成功。这种处理GLU层的方法使我们能够在保留模型大部分能力的同时进行剪枝，从而显著减小其尺寸和资源消耗。

值得注意的是，测试结果是在剪枝模型未经任何能力恢复过程（如知识蒸馏或微调，这通常是剪枝模型后进行的操作）之前获得的。

未来工作。

有许多值得探索的剪枝技术。也许最直接的是深度剪枝，它涉及移除对模型性能贡献最小的层。

另一个重要的研究领域是将这些剪枝后的模型进行知识蒸馏，并评估它们是否保留了学习新任务的能力。这有可能使它们的性能更接近基础模型，尤其是在剪枝模型表现出最显著损失的基准测试中。

开发更小、更高效的模型仍然是一个有吸引力的领域，特别是对于寻求在没有大量基础设施要求下部署LLM能力的公司。这项工作为进一步研究如何使这些强大的模型更易于访问和部署奠定了基础。

社区

通过拖放到文本输入框、粘贴或点击此处上传图片、音频和视频。

点击或粘贴此处以上传图片

· 注册或登录以评论