压缩张量
该 compressed-tensors
库提供了一种通用且高效的方式来存储和管理压缩的模型检查点。该库支持各种量化和稀疏方案,使其成为处理不同模型优化(如 GPTQ、AWQ、SmoothQuant、INT8、FP8、SparseGPT 等)的统一格式。
一些受支持的格式包括
密集
int-quantized
:INT8 量化模型- 示例 模型/配置
float-quantized
:FP8 量化模型;目前支持 E4M3- 示例 模型/配置
pack-quantized
:INT4 或 INT8 权重量化模型,打包到 INT32 中。对于 INT4,权重具有 INT4 范围,但存储为 INT8,然后打包到 INT32 中。- 示例 模型/配置
可以使用 llm-compressor 轻松创建压缩模型。或者,可以独立创建模型并使用压缩张量配置进行序列化。
要在 Hugging Face 模型中心查找现有模型,请搜索 compressed-tensors
标签。
功能:
- 权重和激活精度:FP8、INT4、INT8(对于 Q/DQ,INT 允许任意精度)
- 量化尺度和零点策略:张量、通道、组、块、token
- 动态逐 token 激活量化(或任何静态策略)
- 稀疏性可以
- 支持任意模块的量化,而不仅仅是线性模块
- 按名称或类对模块进行目标支持或忽略
安装
建议从 PyPI 安装 compressed-tensors 的稳定版本。
pip install compressed-tensors
想要尝试最新功能的开发者也可以从源代码安装软件包。
git clone https://github.com/neuralmagic/compressed-tensors
cd compressed-tensors
pip install -e .
快速入门模型加载
量化模型可以像下面所示一样轻松加载以进行推理。目前,只能加载已经过量化的模型。要将模型量化为 compressed-tensors 格式,请参阅 llm-compressor。
from transformers import AutoModelForCausalLM
# Load the model in compressed-tensors format
ct_model = AutoModelForCausalLM.from_pretrained("nm-testing/Meta-Llama-3.1-8B-Instruct-FP8-hf")
# Measure memory usage
mem_params = sum([param.nelement()*param.element_size() for param in ct_model.parameters()])
print(f"{mem/2**30:.4f} GB")
# 8.4575 GB
我们可以看到,Llama 3.1 8B 的 compressed-tensors FP8 检查点能够使用未量化参考检查点一半的内存进行加载以进行推理。
示例用例 - 加载并运行 FP8 模型
from transformers import AutoModelForCausalLM, AutoTokenizer
prompt = [
"Hello, my name is",
"The capital of France is",
"The future of AI is"
]
model_name = "nm-testing/Meta-Llama-3-8B-Instruct-fp8-hf_compat"
quantized_model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer(prompt, return_tensors="pt")
generated_ids = quantized_model.generate(**inputs, max_length=50, do_sample=False)
outputs = tokenizer.batch_decode(generated_ids)
print(outputs)
"""
['<|begin_of_text|>Hello, my name is [Name]. I am a [Your Profession/Student] and I am here to learn about the [Course/Program] at [University/Institution]. I am excited to be here and I am looking forward to', '<|begin_of_text|>The capital of France is Paris, which is located in the north-central part of the country. Paris is the most populous city in France and is known for its stunning architecture, art museums, fashion, and romantic atmosphere. The city is home to', "<|begin_of_text|>The future of AI is here, and it's already changing the way we live and work. From virtual assistants to self-driving cars, AI is transforming industries and revolutionizing the way we interact with technology. But what does the future of AI hold"]
"""
上面显示了使用 compressed-tensors
模型运行生成的快速示例。目前,加载模型后无法保存。
深入了解 compressed-tensors 模型检查点
在本例中,我们将检查 compressed-tensors 模型 nm-testing/Meta-Llama-3.1-8B-Instruct-FP8-hf 如何通过其配置条目进行定义,并了解这如何转换为加载的模型表示。
首先,让我们看看 模型的 quantization_config
。乍一看,由于条目数量众多,它看起来很复杂,但这是因为 compressed-tensors 是一种允许在模型压缩期间和之后灵活表达的格式。
在实践中,对于检查点加载和推理,可以简化配置以不包含所有默认或空条目,因此我们将在此处执行此操作,以专注于实际表示的压缩内容。
"quantization_config": {
"config_groups": {
"group_0": {
"input_activations": {
"num_bits": 8,
"strategy": "tensor",
"type": "float"
},
"targets": ["Linear"],
"weights": {
"num_bits": 8,
"strategy": "tensor",
"type": "float"
}
}
},
"format": "naive-quantized",
"ignore": ["lm_head"],
"quant_method": "compressed-tensors",
"quantization_status": "frozen"
},
我们可以从上面的配置中看到,它指定了一个配置组,其中包括将权重和激活量化为 FP8 并使用静态逐张量策略。还值得注意的是,在 ignore
列表中,有一个条目跳过 lm_head
模块的量化,因此该模块在检查点中应保持不变。
为了在实践中查看配置的结果,我们可以简单地使用模型卡片上的 safetensors 查看器 来查看第一个模型层中所有线性模块的量化权重、input_scale 和 weight_scale(以及其余层的类似内容)。
张量 | 形状 | 精度 |
---|---|---|
model.layers.0.input_layernorm.weight | [4 096] | BF16 |
model.layers.0.mlp.down_proj.input_scale | [1] | BF16 |
model.layers.0.mlp.down_proj.weight | [4 096, 14 336] | F8_E4M3 |
model.layers.0.mlp.down_proj.weight_scale | [1] | BF16 |
model.layers.0.mlp.gate_proj.input_scale | [1] | BF16 |
model.layers.0.mlp.gate_proj.weight | [14 336, 4 096] | F8_E4M3 |
model.layers.0.mlp.gate_proj.weight_scale | [1] | BF16 |
model.layers.0.mlp.up_proj.input_scale | [1] | BF16 |
model.layers.0.mlp.up_proj.weight | [14 336, 4 096] | F8_E4M3 |
model.layers.0.mlp.up_proj.weight_scale | [1] | BF16 |
model.layers.0.post_attention_layernorm.weight | [4 096] | BF16 |
model.layers.0.self_attn.k_proj.input_scale | [1] | BF16 |
model.layers.0.self_attn.k_proj.weight | [1 024, 4 096] | F8_E4M3 |
model.layers.0.self_attn.k_proj.weight_scale | [1] | BF16 |
model.layers.0.self_attn.o_proj.input_scale | [1] | BF16 |
model.layers.0.self_attn.o_proj.weight | [4 096, 4 096] | F8_E4M3 |
model.layers.0.self_attn.o_proj.weight_scale | [1] | BF16 |
model.layers.0.self_attn.q_proj.input_scale | [1] | BF16 |
model.layers.0.self_attn.q_proj.weight | [4 096, 4 096] | F8_E4M3 |
model.layers.0.self_attn.q_proj.weight_scale | [1] | BF16 |
model.layers.0.self_attn.v_proj.input_scale | [1] | BF16 |
model.layers.0.self_attn.v_proj.weight | [1 024, 4 096] | F8_E4M3 |
model.layers.0.self_attn.v_proj.weight_scale | [1] | BF16 |
当我们使用 compressed-tensors HFQuantizer 集成加载模型时,我们可以看到,量化配置中指定的全部线性模块都已被管理压缩权重和推理前向传递的 CompressedLinear
模块替换。请注意,之前在 ignore 列表中提到的 lm_head
仍然保留为未量化的线性模块。
from transformers import AutoModelForCausalLM
ct_model = AutoModelForCausalLM.from_pretrained("nm-testing/Meta-Llama-3.1-8B-Instruct-FP8-hf")
print(ct_model)
"""
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(128256, 4096)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaSdpaAttention(
(q_proj): CompressedLinear(
in_features=4096, out_features=4096, bias=False
(input_observer): MovingAverageMinMaxObserver()
(weight_observer): MovingAverageMinMaxObserver()
)
(k_proj): CompressedLinear(
in_features=4096, out_features=1024, bias=False
(input_observer): MovingAverageMinMaxObserver()
(weight_observer): MovingAverageMinMaxObserver()
)
(v_proj): CompressedLinear(
in_features=4096, out_features=1024, bias=False
(input_observer): MovingAverageMinMaxObserver()
(weight_observer): MovingAverageMinMaxObserver()
)
(o_proj): CompressedLinear(
in_features=4096, out_features=4096, bias=False
(input_observer): MovingAverageMinMaxObserver()
(weight_observer): MovingAverageMinMaxObserver()
)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): CompressedLinear(
in_features=4096, out_features=14336, bias=False
(input_observer): MovingAverageMinMaxObserver()
(weight_observer): MovingAverageMinMaxObserver()
)
(up_proj): CompressedLinear(
in_features=4096, out_features=14336, bias=False
(input_observer): MovingAverageMinMaxObserver()
(weight_observer): MovingAverageMinMaxObserver()
)
(down_proj): CompressedLinear(
in_features=14336, out_features=4096, bias=False
(input_observer): MovingAverageMinMaxObserver()
(weight_observer): MovingAverageMinMaxObserver()
)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((4096,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=4096, out_features=128256, bias=False)
)
"""