8位量化
LLM.int8() 是一种不会降低性能的量化方法,它使大型模型推理更容易访问。关键是从输入和权重中提取异常值,并以16位进行乘法。所有其他值都以8位进行乘法,并在量化为Int8之前反量化为16位。16位和8位乘法的输出组合在一起以产生最终输出。
Linear8bitLt
类 bitsandbytes.nn.Linear8bitLt
< 源代码 >( input_features: int output_features: int bias = True has_fp16_weights = True memory_efficient_backward = False threshold = 0.0 index = None device = None )
此类是 LLM.int8() 算法的基本模块。要了解更多信息,请查看论文。
为了量化线性层,应该首先将原始fp16/bf16权重加载到Linear8bitLt模块中,然后调用int8_module.to("cuda")
来量化fp16权重。
示例
import torch
import torch.nn as nn
import bitsandbytes as bnb
from bnb.nn import Linear8bitLt
fp16_model = nn.Sequential(
nn.Linear(64, 64),
nn.Linear(64, 64)
)
int8_model = nn.Sequential(
Linear8bitLt(64, 64, has_fp16_weights=False),
Linear8bitLt(64, 64, has_fp16_weights=False)
)
int8_model.load_state_dict(fp16_model.state_dict())
int8_model = int8_model.to(0) # Quantization happens here
__init__
< 源代码 >( input_features: int output_features: int bias = True has_fp16_weights = True memory_efficient_backward = False threshold = 0.0 index = None device = None )
初始化 Linear8bitLt 类。
Int8Params
class bitsandbytes.nn.Int8Params
< 源代码 >( data = None requires_grad = True has_fp16_weights = False CB = None SCB = None )