Transformers 文档

torchao

Hugging Face's logo
加入 Hugging Face 社区

并获得增强的文档体验

开始使用

torchao

Open In Colab: Torchao Demo

torchao 是一个 PyTorch 架构优化库,支持自定义高性能数据类型、量化和稀疏性。它可以与原生的 PyTorch 功能(如 torch.compile)组合,以实现更快的推理和训练。

有关 torchao 的其他功能,请参阅下表。

特性 描述
量化感知训练 (QAT) 以最小的精度损失训练量化模型(请参阅 QAT README
Float8 训练 使用 float8 格式进行高吞吐量训练(请参阅 torchtitanAccelerate 文档)
稀疏性支持 半结构化 (2:4) 稀疏性可实现更快的推理(请参阅使用半结构化 (2:4) 稀疏性加速神经网络训练博客文章)
优化器量化 使用 Adam 的 4 位和 8 位变体减少优化器状态内存
KV 缓存量化 以更低的内存实现长上下文推理(请参阅 KV 缓存量化
自定义内核支持 使用您自己的 torch.compile 兼容操作
FSDP2 可与 FSDP2 组合用于训练

有关该库的更多详细信息,请参阅 torchao README.md

torchao 支持以下量化技术

  • A16W8 Float8 动态量化
  • A16W8 Float8 仅权重 量化
  • A8W8 Int8 动态量化
  • A16W8 Int8 仅权重 量化
  • A16W4 Int4 仅权重 量化
  • A16W4 Int4 仅权重 量化 + 2:4 稀疏性
  • 自动量化

torchao 还支持通过指定模块的完全限定名称及其对应的量化配置的字典来进行模块级配置。这允许跳过某些层的量化,并为不同的模块使用不同的量化配置。

查看下表以检查您的硬件是否兼容。

组件 兼容性
CUDA 版本 ✅ cu118, cu126, cu128
CPU ✅ 更改 device_map="cpu" (请参阅以下示例)

使用以下命令从 PyPi 或 PyTorch 索引安装 torchao。

PyPi
PyTorch 索引
# Updating 🤗 Transformers to the latest version, as the example script below uses the new auto compilation
# Stable release from Pypi which will default to CUDA 12.6
pip install --upgrade torchao transformers

如果您的 torchao 版本低于 0.10.0,您需要升级它,请参阅弃用通知以获取更多详细信息。

量化示例

TorchAO 提供各种量化配置。每个配置都可以通过诸如 group_sizeschemelayout 等参数进一步自定义,以针对特定硬件和模型架构进行优化。

有关可用配置的完整列表,请参阅 量化 API 文档

您可以手动选择量化类型和设置,也可以自动选择量化类型。

创建一个 TorchAoConfig,并指定要量化的权重的量化类型和 group_size(仅适用于 int8 权重和 int4 权重)。将 cache_implementation 设置为 "static" 以自动 torch.compile 前向方法。

我们将根据硬件(例如 A100 GPU、H100 GPU、CPU)展示推荐的量化方法示例。

H100 GPU

float8-dynamic-and-weight-only
int4-weight-only
import torch
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, Float8WeightOnlyConfig

quant_config = Float8DynamicActivationFloat8WeightConfig()
# or float8 weight only quantization
# quant_config = Float8WeightOnlyConfig()
quantization_config = TorchAoConfig(quant_type=quant_config)

# Load and quantize the model
quantized_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    torch_dtype="auto",
    device_map="auto",
    quantization_config=quantization_config
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

# auto-compile the quantized model with `cache_implementation="static"` to get speed up
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))
</hfoption> <hfoption id="int4-weight-only-24sparse">
import torch
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
from torchao.quantization import Int4WeightOnlyConfig
from torchao.dtypes import MarlinSparseLayout

quant_config = Int4WeightOnlyConfig(layout=MarlinSparseLayout())
quantization_config = TorchAoConfig(quant_type=quant_config)

# Load and quantize the model with sparsity. A sparse checkpoint is needed to accelerate without accuraccy loss
quantized_model = AutoModelForCausalLM.from_pretrained(
    "RedHatAI/Sparse-Llama-3.1-8B-2of4",
    torch_dtype=torch.float16,
    device_map="cuda",
    quantization_config=quantization_config
)

tokenizer = AutoTokenizer.from_pretrained("RedHatAI/Sparse-Llama-3.1-8B-2of4")
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

# auto-compile the quantized model with `cache_implementation="static"` to get speed up
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))
</hfoption> </hfoptions>

A100 GPU

int8-dynamic-and-weight-only
int4-weight-only
import torch
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
from torchao.quantization import Int8DynamicActivationInt8WeightConfig, Int8WeightOnlyConfig

quant_config = Int8DynamicActivationInt8WeightConfig()
# or int8 weight only quantization
# quant_config = Int8WeightOnlyConfig()
quantization_config = TorchAoConfig(quant_type=quant_config)

# Load and quantize the model
quantized_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    torch_dtype="auto",
    device_map="auto",
    quantization_config=quantization_config
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

# auto-compile the quantized model with `cache_implementation="static"` to get speed up
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))
</hfoption> <hfoption id="int4-weight-only-24sparse">
import torch
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
from torchao.quantization import Int4WeightOnlyConfig
from torchao.dtypes import MarlinSparseLayout

quant_config = Int4WeightOnlyConfig(layout=MarlinSparseLayout())
quantization_config = TorchAoConfig(quant_type=quant_config)

# Load and quantize the model with sparsity. A sparse checkpoint is needed to accelerate without accuraccy loss
quantized_model = AutoModelForCausalLM.from_pretrained(
    "RedHatAI/Sparse-Llama-3.1-8B-2of4",
    torch_dtype=torch.float16,
    device_map="cuda",
    quantization_config=quantization_config
)

tokenizer = AutoTokenizer.from_pretrained("RedHatAI/Sparse-Llama-3.1-8B-2of4")
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

# auto-compile the quantized model with `cache_implementation="static"` to get speed up
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))
</hfoption> </hfoptions>

CPU

int8-dynamic-and-weight-only
int4-weight-only
import torch
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
from torchao.quantization import Int8DynamicActivationInt8WeightConfig, Int8WeightOnlyConfig

quant_config = Int8DynamicActivationInt8WeightConfig()
# quant_config = Int8WeightOnlyConfig()
quantization_config = TorchAoConfig(quant_type=quant_config)

# Load and quantize the model
quantized_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    torch_dtype="auto",
    device_map="cpu",
    quantization_config=quantization_config
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt")

# auto-compile the quantized model with `cache_implementation="static"` to get speed up
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))

逐模块量化

1. 跳过某些层的量化

使用 ModuleFqnToConfig,我们可以为所有层指定默认配置,同时跳过某些层的量化。

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig

model_id = "meta-llama/Llama-3.1-8B-Instruct"

from torchao.quantization import Int4WeightOnlyConfig, ModuleFqnToConfig
config = Int4WeightOnlyConfig(group_size=128)

# set default to int4 (for linears), and skip quantizing `model.layers.0.self_attn.q_proj`
quant_config = ModuleFqnToConfig({"_default": config, "model.layers.0.self_attn.q_proj": None})
quantization_config = TorchAoConfig(quant_type=quant_config)
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
# lm_head is not quantized and model.layers.0.self_attn.q_proj is not quantized
print("quantized model:", quantized_model)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Manual Testing
prompt = "Hey, are you conscious? Can you talk to me?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
output_text = tokenizer.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

2. 使用不同量化配置量化不同层

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig

model_id = "facebook/opt-125m"

from torchao.quantization import Int4WeightOnlyConfig, ModuleFqnToConfig, Int8DynamicActivationInt4WeightConfig, IntxWeightOnlyConfig, PerAxis, MappingType

weight_dtype = torch.int8
granularity = PerAxis(0)
mapping_type = MappingType.ASYMMETRIC
embedding_config = IntxWeightOnlyConfig(
    weight_dtype=weight_dtype,
    granularity=granularity,
    mapping_type=mapping_type,
)
linear_config = Int8DynamicActivationInt4WeightConfig(group_size=128)
quant_config = ModuleFqnToConfig({"_default": linear_config, "model.decoder.embed_tokens": embedding_config, "model.decoder.embed_positions": None})
# set `include_embedding` to True in order to include embedding in quantization
# when `include_embedding` is True, we'll remove input embedding from `modules_not_to_convert` as well
quantization_config = TorchAoConfig(quant_type=quant_config, include_embedding=True)
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cpu", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
print("quantized model:", quantized_model)
# make sure embedding is quantized
print("embed_tokens weight:", quantized_model.model.decoder.embed_tokens.weight)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Manual Testing
prompt = "Hey, are you conscious? Can you talk to me?"
inputs = tokenizer(prompt, return_tensors="pt").to("cpu")
generated_ids = quantized_model.generate(**inputs, max_new_tokens=128, cache_implementation="static")
output_text = tokenizer.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

自动量化

如果您想为可量化层(nn.Linear)自动选择量化类型,可以使用 autoquant API。

autoquant API 通过对输入类型和形状进行微基准测试并编译单个线性层来自动选择量化类型。

注意:目前 autoquant 仅支持 GPU。

创建一个 TorchAoConfig 并将其设置为 "autoquant"。将 cache_implementation 设置为 "static" 以自动 torch.compile 前向方法。最后,在量化模型上调用 finalize_autoquant 以完成量化并记录输入形状。

import torch
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer

quantization_config = TorchAoConfig("autoquant", min_sqnr=None)
quantized_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    torch_dtype="auto",
    device_map="auto",
    quantization_config=quantization_config
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

# auto-compile the quantized model with `cache_implementation="static"` to get speed up
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
# explicitly call `finalize_autoquant` (may be refactored and removed in the future)
quantized_model.finalize_autoquant()
print(tokenizer.decode(output[0], skip_special_tokens=True))

序列化

torchao 实现了 torch.Tensor 子类,以最大限度地灵活支持新的量化 torch.Tensor 格式。Safetensors 序列化和反序列化不适用于 torchao。

为了避免任意用户代码执行,torchao 将 torch.load 中的 weights_only 设置为 True,以确保只加载张量。任何已知的用户函数都可以通过 add_safe_globals 列入白名单。

本地保存
推送到 Hugging Face Hub
# don't serialize model with Safetensors
output_dir = "llama3-8b-int4wo-128"
quantized_model.save_pretrained("llama3-8b-int4wo-128", safe_serialization=False)

加载量化模型

加载量化模型取决于量化方案。对于 int8 和 float8 等量化方案,您可以在任何设备上量化模型,也可以在任何设备上加载它。下面的示例演示了在 CPU 上量化模型,然后将其加载到 CUDA 上。

import torch
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
from torchao.quantization import Int8WeightOnlyConfig

quant_config = Int8WeightOnlyConfig(group_size=128)
quantization_config = TorchAoConfig(quant_type=quant_config)

# Load and quantize the model
quantized_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    torch_dtype="auto",
    device_map="cpu",
    quantization_config=quantization_config
)
# save the quantized model
output_dir = "llama-3.1-8b-torchao-int8-cuda"
quantized_model.save_pretrained(output_dir, safe_serialization=False)

# reload the quantized model
reloaded_model = AutoModelForCausalLM.from_pretrained(
    output_dir,
    device_map="auto",
    torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

output = reloaded_model.generate(**input_ids, max_new_tokens=10)
print(tokenizer.decode(output[0], skip_special_tokens=True))

对于 int4,模型只能加载到量化它的同一设备上,因为布局是特定于设备的。下面的示例演示了在 CPU 上量化和加载模型。

import torch
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
from torchao.quantization import Int4WeightOnlyConfig
from torchao.dtypes import Int4CPULayout

quant_config = Int4WeightOnlyConfig(group_size=128, layout=Int4CPULayout())
quantization_config = TorchAoConfig(quant_type=quant_config)

# Load and quantize the model
quantized_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    torch_dtype="auto",
    device_map="cpu",
    quantization_config=quantization_config
)
# save the quantized model
output_dir = "llama-3.1-8b-torchao-int4-cpu"
quantized_model.save_pretrained(output_dir, safe_serialization=False)

# reload the quantized model
reloaded_model = AutoModelForCausalLM.from_pretrained(
    output_dir,
    device_map="cpu",
    torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt")

output = reloaded_model.generate(**input_ids, max_new_tokens=10)
print(tokenizer.decode(output[0], skip_special_tokens=True))

⚠️ 弃用通知

从 0.10.0 版本开始,基于字符串的量化配置 API(例如,TorchAoConfig("int4_weight_only", group_size=128))已**弃用**,并将在未来的版本中移除。

请改用新的基于 AOBaseConfig 的方法

# Old way (deprecated)
quantization_config = TorchAoConfig("int4_weight_only", group_size=128)

# New way (recommended)
from torchao.quantization import Int4WeightOnlyConfig
quant_config = Int4WeightOnlyConfig(group_size=128)
quantization_config = TorchAoConfig(quant_type=quant_config)

新 API 提供更大的灵活性、更好的类型安全性和对 torchao 中可用功能的完整访问。

迁移指南

以下是如何从常用字符串标识符迁移到其 AOBaseConfig 等效项的方法

旧字符串 API 新的 AOBaseConfig API
"int4_weight_only" Int4WeightOnlyConfig()
"int8_weight_only" Int8WeightOnlyConfig()
"int8_dynamic_activation_int8_weight" Int8DynamicActivationInt8WeightConfig()

所有配置对象都接受用于自定义的参数(例如,group_sizeschemelayout)。

资源

为了更好地了解预期性能,请查看 CUDA 和 XPU 后端各种模型的基准测试。您还可以运行以下代码来对模型进行基准测试。

from torch._inductor.utils import do_bench_using_profiling
from typing import Callable

def benchmark_fn(func: Callable, *args, **kwargs) -> float:
    """Thin wrapper around do_bench_using_profiling"""
    no_args = lambda: func(*args, **kwargs)
    time = do_bench_using_profiling(no_args)
    return time * 1e3

MAX_NEW_TOKENS = 1000
print("int4wo-128 model:", benchmark_fn(quantized_model.generate, **input_ids, max_new_tokens=MAX_NEW_TOKENS, cache_implementation="static"))

bf16_model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16)
output = bf16_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static") # auto-compile
print("bf16 model:", benchmark_fn(bf16_model.generate, **input_ids, max_new_tokens=MAX_NEW_TOKENS, cache_implementation="static"))

为了获得最佳性能,您可以使用 torchao.quantization.utils.recommended_inductor_config_setter() 调用推荐设置

有关更多示例和文档,请参阅其他可用量化技术

问题

如果您在 Transformers 集成中遇到任何问题,请在 Transformers 仓库中提出问题。对于与 torchao 直接相关的问题,请在 torchao 仓库中提出问题。

< > 在 GitHub 上更新