为新架构添加 BetterTransformer 支持

您想为 Better Transformer（PyTorch Transformer API 的快速路径）添加新模型吗？请查看本指南！

应该支持的模型

理论上，任何具有 Transformer 编码器层的模型（类似于 “Attention Is All You Need” 论文中描述的经典编码器）都应该得到支持。更具体地说，具有带有多头注意力模块（具有前注意力或后注意力层归一化）的编码器块的模型应该可以转换为其 BetterTransformer 等效项。条件可以总结如下：

使用经典的多头注意力模块（例如，DeBERTa 不支持）
使用 gelu 或 relu 激活函数
具有偶数个注意力头
不使用任何注意力偏差（例如 T5 使用注意力偏差，因此不支持）
每个层的第一层范数和第二层范数之间的 eps 必须相等

如何将模型转换为 BetterTransformer 格式？

步骤 1：确定要更改的源层

首先，转到 optimum/bettertransformer/__init__.py，您将看到字典 BetterTransformerManager.MODEL_MAPPING。这应该包含模型类型与 Tuple[str, BetterTransformerBaseLayer] 之间的映射，后者由可以转换为其 BetterTransformer 等效项的 nn.Module 的名称以及有效的 BetterTransformer 层类组成。

让我们尝试为 Bert 逐步完成，首先我们需要确定需要替换的层

>>> from transformers import AutoModel

>>> model = AutoModel.from_pretrained("bert-base-uncased")
>>> print(model)
...
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (11): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
  )
  (pooler): BertPooler(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (activation): Tanh()
  )
)

您可以清楚地看到需要替换的层是 BertLayer 模块，因为它们包含整个编码器层模块。

步骤 2：构建 xxxLayerBetterTransformer 模块

检查已识别的模块是否尚未从另一个模块复制（通过检查 transformers 中的源代码并检查类定义是否未以 # Copied from ... 开头） - 如果没有，请在 bettertransformer/models/encoder_model.py 中创建一个类。从以下几行开始

import torch
import torch.nn as nn

from ..base import BetterTransformerBaseLayer


class BertLayerBetterTransformer(BetterTransformerBaseLayer):
    def __init__(self, bert_layer, config):
...

现在，请确保填写所有必要的属性，属性列表为

in_proj_weight
in_proj_bias
out_proj_weight
out_proj_bias
linear1_weight
linear1_bias
linear2_weight
linear2_bias
norm1_eps
norm1_weight
norm1_bias
norm2_weight
norm2_bias
num_heads
embed_dim

请注意，这些属性对应于运行 Transformer 编码器模块所需的所有组件，请查看 “Attention Is All You Need” 论文中的图 1。

填写完所有这些属性后（有时 query、key 和 value 层需要“连续化”，请查看 modeling_encoder.py 文件以了解更多信息。）

还要确保添加以下几行

self.is_last_layer = False
self.validate_bettertransformer()

步骤 3：构建前向传播

首先，以 super().forward_checker() 行开始，这是必需的，以便父类可以运行所有安全检查器。

在第一次前向传播之后，需要使用注意力掩码嵌套隐藏状态。嵌套后，不再需要注意力掩码，因此可以将其设置为 None。这就是为 Bert 构建前向传播的方式，这些行在模型之间应该非常相似，但有时注意力掩码的形状在模型之间是不同的。

super().forward_checker()

if hidden_states.is_nested:
    attention_mask = None

if attention_mask is not None:
    # attention mask comes in with values 0 and -inf. we convert to torch.nn.TransformerEncoder style bool mask
    # 0->false->keep this token -inf->true->mask this token
    attention_mask = attention_mask.bool()
    attention_mask = torch.reshape(attention_mask, (attention_mask.shape[0], attention_mask.shape[-1]))
    seqlen = attention_mask.shape[1]
    lengths = torch.sum(~attention_mask, 1)
    if not all([l == seqlen for l in lengths]):
        hidden_states = torch._nested_tensor_from_mask(hidden_states, ~attention_mask)
    attention_mask = None

一旦 hidden_states 被嵌套，使用正确的参数调用 torch._transformer_encoder_layer_fwd，如下所示

hidden_states = torch._transformer_encoder_layer_fwd(
    hidden_states,
    self.embed_dim,
    self.num_heads,
    self.in_proj_weight,
    self.in_proj_bias,
    self.out_proj_weight,
    self.out_proj_bias,
    self.use_gelu,
    self.norm_first,
    self.norm1_eps,
    self.norm1_weight,
    self.norm1_bias,
    self.norm2_weight,
    self.norm2_bias,
    self.linear1_weight,
    self.linear1_bias,
    self.linear2_weight,
    self.linear2_bias,
    attention_mask,
)

在最后一层，重要的是“取消嵌套”隐藏状态，以便它可以被后续模块处理，这是通过以下几行完成的

if hidden_states.is_nested and self.is_last_layer:
    hidden_states = hidden_states.to_padded_tensor(0.0)
return (hidden_states,)

还要确保返回 tuple 以遵循 transformers 的约定。

在您自己的模型上重现此实验的最佳方法是尝试从提供的建模脚本中获得一些灵感。当然，如果您在 optimum 上打开 issue 或 Pull Request，我们将很乐意帮助您转换模型！

步骤 4：健全性检查！

作为最后一步，请确保使用正确的名称更新 optimum/bettertransformer/__init__.py 中的 BetterTransformerManager.MODEL_MAPPING 字典，您应该可以转换模型了。例如，对于 Bert，这将是

MODEL_MAPPING = {
  ...
  "bert": ("BertLayer", BertLayerBetterTransformer),
  ...
}

使用教程部分中介绍的转换方法试用一下！

< > 在 GitHub 上更新

Optimum