Transformers 文档

DistilBERT

Transformers

加入 Hugging Face 社区

并获取增强的文档体验

在模型、数据集和 Spaces 上协作

通过加速推理获得更快的示例

切换文档主题

开始使用

DistilBERT

概述

DistilBERT 模型在博客文章更小、更快、更便宜、更轻：DistilBERT 介绍，BERT 的精馏版本和论文 DistilBERT，BERT 的精馏版本：更小、更快、更便宜、更轻中提出。DistilBERT 是一款通过精馏 BERT base 训练的小型、快速、廉价且轻量级的 Transformer 模型。它的参数比 google-bert/bert-base-uncased 少 40%，运行速度快 60%，同时在 GLUE 语言理解基准测试中保留了 BERT 超过 95% 的性能。

以下是论文的摘要

随着从大规模预训练模型进行迁移学习在自然语言处理 (NLP) 中变得越来越普遍，在边缘和/或计算训练或推理预算受限的情况下操作这些大型模型仍然具有挑战性。在这项工作中，我们提出了一种预训练一个更小的通用语言表示模型的方法，称为 DistilBERT，然后可以像其更大的同类模型一样，在各种任务上进行微调并获得良好的性能。虽然大多数先前的工作都研究了使用蒸馏来构建特定于任务的模型，但我们在预训练阶段利用知识蒸馏，并表明可以将 BERT 模型的大小减少 40%，同时保留其 97% 的语言理解能力，并提高 60% 的速度。为了利用较大模型在预训练期间学习到的归纳偏差，我们引入了三重损失，结合了语言建模、蒸馏和余弦距离损失。我们更小、更快、更轻的模型预训练成本更低，并且我们在概念验证实验和比较设备研究中证明了其在设备上计算方面的能力。

此模型由 victorsanh 贡献。此模型的 jax 版本由 kamalkraj 贡献。原始代码可以在这里找到。

使用技巧

DistilBERT 没有 token_type_ids，您无需指明哪个 token 属于哪个 segment。只需使用分隔 token tokenizer.sep_token (或 [SEP]) 分隔您的 segment 即可。
DistilBERT 没有选择输入位置的选项 (position_ids 输入)。如果需要，可以添加此选项，如果您需要此选项，请告知我们。
与 BERT 相同，但规模更小。通过预训练 BERT 模型的蒸馏进行训练，这意味着它经过训练以预测与较大模型相同的概率。实际目标是以下各项的组合：
- 找到与教师模型相同的概率
- 正确预测 masked tokens (但没有下一句预测目标)
- 学生模型和教师模型的隐藏状态之间的余弦相似度

使用缩放点积注意力 (SDPA)

PyTorch 在 torch.nn.functional 中包含一个原生的缩放点积注意力 (SDPA) 运算符。此函数包含多个实现，可以根据输入和正在使用的硬件应用。有关更多信息，请参阅官方文档或 GPU 推理页面。

当实现可用时，torch>=2.1.1 默认使用 SDPA，但您也可以在 from_pretrained() 中设置 attn_implementation="sdpa" 以显式请求使用 SDPA。

from transformers import DistilBertModel
model = DistilBertModel.from_pretrained("distilbert-base-uncased", torch_dtype=torch.float16, attn_implementation="sdpa")

为了获得最佳加速效果，我们建议以半精度 (例如 torch.float16 或 torch.bfloat16) 加载模型。

在一个本地基准测试 (NVIDIA GeForce RTX 2060-8GB, PyTorch 2.3.1, OS Ubuntu 20.04) 中，使用 float16 和带有 MaskedLM head 的 distilbert-base-uncased 模型，我们看到了以下训练和推理期间的加速效果。

训练

num_training_steps	batch_size	seq_len	is cuda	每个批次的时间 (eager - s)	每个批次的时间 (sdpa - s)	加速 (%)	Eager 峰值内存 (MB)	sdpa 峰值内存 (MB)	内存节省 (%)
100	1	128	False	0.010	0.008	28.870	397.038	399.629	-0.649
100	1	256	False	0.011	0.009	20.681	412.505	412.606	-0.025
100	2	128	False	0.011	0.009	23.741	412.213	412.606	-0.095
100	2	256	False	0.015	0.013	16.502	427.491	425.787	0.400
100	4	128	False	0.015	0.013	13.828	427.491	425.787	0.400
100	4	256	False	0.025	0.022	12.882	594.156	502.745	18.182
100	8	128	False	0.023	0.022	8.010	545.922	502.745	8.588
100	8	256	False	0.046	0.041	12.763	983.450	798.480	23.165

推理

num_batches	batch_size	seq_len	is cuda	is half	use mask	每个 token 的延迟 eager (ms)	每个 token 的延迟 SDPA (ms)	加速 (%)	内存 eager (MB)	内存 BT (MB)	内存节省 (%)
50	2	64	True	True	True	0.032	0.025	28.192	154.532	155.531	-0.642
50	2	128	True	True	True	0.033	0.025	32.636	157.286	157.482	-0.125
50	4	64	True	True	True	0.032	0.026	24.783	157.023	157.449	-0.271
50	4	128	True	True	True	0.034	0.028	19.299	162.794	162.269	0.323
50	8	64	True	True	True	0.035	0.028	25.105	160.958	162.204	-0.768
50	8	128	True	True	True	0.052	0.046	12.375	173.155	171.844	0.763
50	16	64	True	True	True	0.051	0.045	12.882	172.106	171.713	0.229
50	16	128	True	True	True	0.096	0.081	18.524	191.257	191.517	-0.136

资源

以下是官方 Hugging Face 和社区 (用 🌎 表示) 资源列表，可帮助您开始使用 DistilBERT。如果您有兴趣提交资源以包含在此处，请随时打开 Pull Request，我们将对其进行审核！理想情况下，资源应展示一些新内容，而不是重复现有资源。

文本分类

一篇关于使用 DistilBERT 开始使用 Python 进行情感分析的博文。
一篇关于如何使用 Blurr 训练 DistilBERT 进行序列分类的博文。
一篇关于如何使用 Ray 调整 DistilBERT 超参数的博文。
一篇关于如何使用 Hugging Face 和 Amazon SageMaker 训练 DistilBERT的博文。
一个关于如何微调 DistilBERT 用于多标签分类的 notebook。 🌎
一个关于如何使用 PyTorch 微调 DistilBERT 用于多类别分类的 notebook。 🌎
一个关于如何在 TensorFlow 中微调 DistilBERT 用于文本分类的 notebook。 🌎
DistilBertForSequenceClassification 受到此示例脚本和notebook的支持。
TFDistilBertForSequenceClassification 受到此示例脚本和notebook的支持。
FlaxDistilBertForSequenceClassification 受到此示例脚本和notebook的支持。
文本分类任务指南

Token 分类

DistilBertForTokenClassification 受到此示例脚本和notebook的支持。
TFDistilBertForTokenClassification 受到此示例脚本和notebook的支持。
FlaxDistilBertForTokenClassification 受到此示例脚本的支持。
Token 分类是 🤗 Hugging Face Course 的章节。
Token 分类任务指南

Fill-Mask

DistilBertForMaskedLM 受到此示例脚本和notebook的支持。
TFDistilBertForMaskedLM 受到此示例脚本和notebook的支持。
FlaxDistilBertForMaskedLM 受到此示例脚本和notebook的支持。
Masked language modeling 是 🤗 Hugging Face Course 的章节。
Masked language modeling 任务指南

问题解答

DistilBertForQuestionAnswering 受到此示例脚本和notebook的支持。
TFDistilBertForQuestionAnswering 受到此示例脚本和notebook的支持。
FlaxDistilBertForQuestionAnswering 受到此示例脚本的支持。
Question answering 是 🤗 Hugging Face Course 的章节。
问题解答任务指南

多项选择

DistilBertForMultipleChoice 受到此示例脚本和notebook的支持。
TFDistilBertForMultipleChoice 受到此示例脚本和notebook的支持。
多项选择任务指南

⚗️ 优化

一篇关于如何使用 🤗 Optimum 和 Intel 量化 DistilBERT 的博文。
一篇关于如何使用 🤗 Optimum 优化 Transformers 以在 GPU 上运行的博文。
一篇关于使用 Hugging Face Optimum 优化 Transformers 的博文。

⚡️ 推理

一篇关于如何使用 DistilBERT 通过 Hugging Face Transformers 和 AWS Inferentia 加速 BERT 推理的博文。
一篇关于使用 Hugging Face 的 Transformers、DistilBERT 和 Amazon SageMaker 进行 Serverless 推理的博文。

🚀 部署

一篇关于如何在 Google Cloud 上部署 DistilBERT 的博文。
一篇关于如何使用 Amazon SageMaker 部署 DistilBERT 的博文。
一篇关于如何使用 Hugging Face Transformers、Amazon SageMaker 和 Terraform 模块部署 BERT 的博文。

结合 DistilBERT 和 Flash Attention 2

首先，请确保安装最新版本的 Flash Attention 2 以包含滑动窗口注意力功能。

pip install -U flash-attn --no-build-isolation

还要确保您的硬件与 Flash-Attention 2 兼容。请阅读 flash-attn 仓库的官方文档以了解更多信息。同时确保以半精度 (例如 torch.float16) 加载您的模型

要加载和运行使用 Flash Attention 2 的模型，请参考以下代码片段

>>> import torch
>>> from transformers import AutoTokenizer, AutoModel

>>> device = "cuda" # the device to load the model onto

>>> tokenizer = AutoTokenizer.from_pretrained('distilbert/distilbert-base-uncased')
>>> model = AutoModel.from_pretrained("distilbert/distilbert-base-uncased", torch_dtype=torch.float16, attn_implementation="flash_attention_2")

>>> text = "Replace me by any text you'd like."

>>> encoded_input = tokenizer(text, return_tensors='pt').to(device)
>>> model.to(device)

>>> output = model(**encoded_input)

Transformers

DistilBERT

概述

使用技巧

使用缩放点积注意力 (SDPA)

训练

推理

资源

结合 DistilBERT 和 Flash Attention 2

DistilBertConfig

class transformers.DistilBertConfig

DistilBertTokenizer

class transformers.DistilBertTokenizer

build_inputs_with_special_tokens

convert_tokens_to_string

create_token_type_ids_from_sequences

get_special_tokens_mask

DistilBertTokenizerFast

类 transformers.DistilBertTokenizerFast

build_inputs_with_special_tokens

create_token_type_ids_from_sequences

DistilBertModel

类 transformers.DistilBertModel

forward

DistilBertForMaskedLM

class transformers.DistilBertForMaskedLM

forward

DistilBertForSequenceClassification

class transformers.DistilBertForSequenceClassification

forward

DistilBertForMultipleChoice

class transformers.DistilBertForMultipleChoice

forward

DistilBertForTokenClassification

class transformers.DistilBertForTokenClassification

forward

DistilBertForQuestionAnswering

class transformers.DistilBertForQuestionAnswering

forward

TFDistilBertModel

class transformers.TFDistilBertModel

call

TFDistilBertForMaskedLM

class transformers.TFDistilBertForMaskedLM

call

TFDistilBertForSequenceClassification

class transformers.TFDistilBertForSequenceClassification

call

TFDistilBertForMultipleChoice

class transformers.TFDistilBertForMultipleChoice

call

TFDistilBertForTokenClassification

class transformers.TFDistilBertForTokenClassification

call

TFDistilBertForQuestionAnswering

class transformers.TFDistilBertForQuestionAnswering

call

FlaxDistilBertModel

class transformers.FlaxDistilBertModel

__call__

FlaxDistilBertForMaskedLM

class transformers.FlaxDistilBertForMaskedLM

__call__

FlaxDistilBertForSequenceClassification

class transformers.FlaxDistilBertForSequenceClassification

__call__

FlaxDistilBertForMultipleChoice

类 transformers.FlaxDistilBertForMultipleChoice

__call__

FlaxDistilBertForTokenClassification

类 transformers.FlaxDistilBertForTokenClassification

__call__

FlaxDistilBertForQuestionAnswering

类 transformers.FlaxDistilBertForQuestionAnswering

__call__

call

call

call

call

call

call