量化

量化技术通过使用较低精度的数据类型（如 8 位整数 int8）来表示权重和激活值，从而降低内存和计算成本。这使得你可以加载通常无法放入内存的更大模型，并加快推理速度。Transformers 支持 AWQ 和 GPTQ 量化算法，并且支持使用 bitsandbytes 进行 8 位和 4 位量化。

Transformers 中不支持的量化技术可以通过 HfQuantizer 类添加。

在量化指南中学习如何量化模型。

QuantoConfig

class transformers.QuantoConfig

< 源码 >

( weights = 'int8' activations = None modules_to_not_convert: typing.Optional[list] = None **kwargs )

参数

weights (str, 可选, 默认为 "int8") — 量化后权重的目标数据类型。支持的值为（“float8”,“int8”,“int4”,“int2”）
activations (str, 可选) — 量化后激活值的目标数据类型。支持的值为 (None,“int8”,“float8”)
modules_to_not_convert (list, 可选, 默认为 None) — 不进行量化的模块列表，这对于量化那些明确要求某些模块保持原始精度的模型很有用（例如 Whisper 编码器、Llava 编码器、Mixtral 门控层）。

这是一个封装类，包含了所有在使用 quanto 加载的模型中可以调整的属性和功能。

post_init

< 源码 >

( )

检查参数是否正确的安全检查器

AqlmConfig

class transformers.AqlmConfig

< 源码 >

( in_group_size: int = 8 out_group_size: int = 1 num_codebooks: int = 1 nbits_per_codebook: int = 16 linear_weights_not_to_quantize: typing.Optional[list[str]] = None **kwargs )

参数

in_group_size (int, 可选, 默认为 8) — 沿输入维度的分组大小。
out_group_size (int, 可选, 默认为 1) — 沿输出维度的分组大小。建议始终使用 1。
num_codebooks (int, 可选, 默认为 1) — 加性量化（Additive Quantization）过程中的码本数量。
nbits_per_codebook (int, 可选, 默认为 16) — 编码单个码本向量的位数。码本大小为 2**nbits_per_codebook。
linear_weights_not_to_quantize (Optional[list[str]], 可选) — 不应被量化的 nn.Linear 权重参数的完整路径列表。
kwargs (dict[str, Any], 可选) — 用于初始化配置对象的附加参数。

这是一个关于 aqlm 参数的封装类。

post_init

< 源码 >

( )

检查参数是否正确的安全检查器 - 同时会将一些 NoneType 参数替换为其默认值。

VptqConfig

class transformers.VptqConfig

< 源码 >

( enable_proxy_error: bool = False config_for_layers: dict = {} shared_layer_config: dict = {} modules_to_not_convert: typing.Optional[list] = None **kwargs )

参数

enable_proxy_error (bool, 可选, 默认为 False) — 计算每层的代理误差
config_for_layers (Dict, 可选, 默认为 {}) — 每层的量化参数
shared_layer_config (Dict, 可选, 默认为 {}) — 层之间共享的量化参数
modules_to_not_convert (list, 可选, 默认为 None) — 不进行量化的模块列表，这对于量化那些明确要求某些模块保持原始精度的模型很有用（例如 Whisper 编码器、Llava 编码器、Mixtral 门控层）。
kwargs (dict[str, Any], 可选) — 用于初始化配置对象的附加参数。

这是一个关于 vptq 参数的封装类。

post_init

< 源码 >

( )

检查参数是否正确的安全检查器

AwqConfig

class transformers.AwqConfig

< 源码 >

( bits: int = 4 group_size: int = 128 zero_point: bool = True version: AWQLinearVersion = <AWQLinearVersion.GEMM: 'gemm'> backend: AwqBackendPackingMethod = <AwqBackendPackingMethod.AUTOAWQ: 'autoawq'> do_fuse: typing.Optional[bool] = None fuse_max_seq_len: typing.Optional[int] = None modules_to_fuse: typing.Optional[dict] = None modules_to_not_convert: typing.Optional[list] = None exllama_config: typing.Optional[dict[str, int]] = None **kwargs )

参数

bits (int, 可选, 默认为 4) — 量化到的位数。
group_size (int, 可选, 默认为 128) — 用于量化的分组大小。推荐值为 128，-1 表示使用逐列量化。
zero_point (bool, 可选, 默认为 True) — 是否使用零点量化。
version (AWQLinearVersion, 可选, 默认为 AWQLinearVersion.GEMM) — 使用的量化算法版本。GEMM 对于大批量大小（例如 >= 8）更好，否则 GEMV 更好（例如 < 8）。GEMM 模型与 Exllama 内核兼容。
backend (AwqBackendPackingMethod, 可选, 默认为 AwqBackendPackingMethod.AUTOAWQ) — 量化后端。一些模型可能使用 llm-awq 后端进行量化。这对于使用 llm-awq 库量化自己模型的用户很有用。
do_fuse (bool, 可选, 默认为 False) — 是否将注意力和 mlp 层融合在一起以加快推理速度
fuse_max_seq_len (int, 可选) — 使用融合时生成的最长序列长度。
modules_to_fuse (dict, 可选, 默认为 None) — 使用用户指定的融合方案覆盖原生支持的融合方案。
modules_to_not_convert (list, 可选, 默认为 None) — 不进行量化的模块列表，这对于量化那些明确要求某些模块保持原始精度的模型很有用（例如 Whisper 编码器、Llava 编码器、Mixtral 门控层）。请注意，您不能直接使用 transformers 进行量化，请参阅 AutoAWQ 文档以量化 HF 模型。
exllama_config (dict[str, Any], 可选) — 您可以通过 version 键指定 exllama 内核的版本，通过 max_input_len 键指定最大序列长度，以及通过 max_batch_size 键指定最大批次大小。如果未设置，则默认为 {"version": 2, "max_input_len": 2048, "max_batch_size": 8}。

这是一个封装类，包含了所有在使用 auto-awq 库的 awq 量化（依赖 auto_awq 后端）加载的模型中可以调整的属性和功能。

post_init

< 源码 >

( )

检查参数是否正确的安全检查器

EetqConfig

class transformers.EetqConfig

< 源码 >

( weights: str = 'int8' modules_to_not_convert: typing.Optional[list] = None **kwargs )

参数

weights (str, 可选, 默认为 "int8") — 权重的目标数据类型。仅支持 “int8”。
modules_to_not_convert (list, 可选, 默认为 None) — 不进行量化的模块列表，这对于量化那些明确要求某些模块保持原始精度的模型很有用。

这是一个封装类，包含了所有在使用 eetq 加载的模型中可以调整的属性和功能。

post_init

< 源码 >

( )

检查参数是否正确的安全检查器

GPTQConfig

class transformers.GPTQConfig

< 源码 >

( bits: int tokenizer: typing.Any = None dataset: typing.Union[list[str], str, NoneType] = None group_size: int = 128 damp_percent: float = 0.1 desc_act: bool = False sym: bool = True true_sequential: bool = True checkpoint_format: str = 'gptq' meta: typing.Optional[dict[str, typing.Any]] = None backend: typing.Optional[str] = None use_cuda_fp16: bool = False model_seqlen: typing.Optional[int] = None block_name_to_quantize: typing.Optional[str] = None module_name_preceding_first_block: typing.Optional[list[str]] = None batch_size: int = 1 pad_token_id: typing.Optional[int] = None use_exllama: typing.Optional[bool] = None max_input_length: typing.Optional[int] = None exllama_config: typing.Optional[dict[str, typing.Any]] = None cache_block_outputs: bool = True modules_in_block_to_quantize: typing.Optional[list[list[str]]] = None **kwargs )

参数

bits (int) — 量化的位数，支持的数字为 (2, 3, 4, 8)。
tokenizer (str 或 PreTrainedTokenizerBase, optional) — 用于处理数据集的 tokenizer。您可以传递以下任一类型：
- 一个自定义的 tokenizer 对象。
- 一个字符串，即托管在 huggingface.co 模型仓库中的预定义 tokenizer 的模型 ID。
- 一个包含 tokenizer 所需词汇文件的目录路径，例如使用 save_pretrained() 方法保存的目录，例如 ./my_model_directory/。
dataset (Union[list[str]], optional) — 用于量化的数据集。您可以提供自己的字符串列表形式的数据集，或者直接使用 GPTQ 论文中使用的原始数据集 [‘wikitext2’,‘c4’,‘c4-new’]。
group_size (int, optional, defaults to 128) — 用于量化的分组大小。推荐值为 128，-1 表示使用逐列量化。
damp_percent (float, optional, defaults to 0.1) — 用于阻尼的平均海森矩阵对角线百分比。推荐值为 0.1。
desc_act (bool, optional, defaults to False) — 是否按激活值大小降序量化列。将其设置为 False 可以显著加快推理速度，但困惑度可能会略有下降。也称为 act-order。
sym (bool, optional, defaults to True) — 是否使用对称量化。
true_sequential (bool, optional, defaults to True) — 是否在单个 Transformer 块内也执行顺序量化。我们不一次性量化整个块，而是执行逐层量化。因此，每一层都使用通过先前量化层的输入进行量化。
checkpoint_format (str, optional, defaults to "gptq") — GPTQ 权重格式。gptq(v1) 同时被 gptqmodel 和 auto-gptq 支持。gptq_v2 仅 gptqmodel 支持。
meta (dict[str, any], optional) — 不直接影响量化或量化推理的属性（如 tooling:version）存储在 meta 中。例如 meta.quantizer: [“optimum:version”, “gptqmodel:version”]
backend (str, optional) — 控制使用哪个 gptq 内核。对于 gptqmodel，有效值为 auto、auto_trainable 等。对于 auto-gptq，有效值仅为 None 和 auto_trainable。参考 gptqmodel 后端： https://github.com/ModelCloud/GPTQModel/blob/main/gptqmodel/utils/backend.py
use_cuda_fp16 (bool, optional, defaults to False) — 是否为 fp16 模型使用优化的 CUDA 内核。需要模型为 fp16 格式。仅 auto-gptq 支持。
model_seqlen (int, optional) — 模型可以接受的最大序列长度。
block_name_to_quantize (str, optional) — 要量化的 transformers 块名称。如果为 None，将使用通用模式（例如 model.layers）推断块名称。
module_name_preceding_first_block (list[str], optional) — 在第一个 Transformer 块之前的层。
batch_size (int, optional, defaults to 1) — 处理数据集时使用的批大小。
pad_token_id (int, optional) — 填充标记的 ID。当 batch_size > 1 时，准备数据集需要此参数。
use_exllama (bool, optional) — 是否使用 exllama 后端。如果未设置，默认为 True。仅在 bits = 4 时有效。
max_input_length (int, optional) — 最大输入长度。需要此参数来初始化一个依赖于最大预期输入长度的缓冲区。此参数特定于使用 act-order 的 exllama 后端。
exllama_config (dict[str, Any], optional) — exllama 配置。您可以通过 version 键指定 exllama 内核的版本。如果未设置，默认为 {"version": 1}。
cache_block_outputs (bool, optional, defaults to True) — 是否缓存块输出，以作为后续块的输入重用。
modules_in_block_to_quantize (list[list[str]], optional) — 在指定块中要量化的模块名称列表的列表。此参数可用于排除某些线性模块不被量化。要量化的块可以通过设置 block_name_to_quantize 来指定。我们将按顺序量化每个列表。如果未设置，将量化所有线性层。例如：modules_in_block_to_quantize =[["self_attn.k_proj", "self_attn.v_proj", "self_attn.q_proj"], ["self_attn.o_proj"]]。在此示例中，我们将首先同时量化 q、k、v 层，因为它们是独立的。然后，在 q、k、v 层被量化后，我们将量化 self_attn.o_proj 层。这样，我们将获得更好的结果，因为它反映了模型量化后 self_attn.o_proj 将获得的真实输入。

这是一个包装类，包含了在使用 `optimum` API 加载模型时，所有可以用于 gptq 量化（依赖于 auto_gptq 后端）的属性和功能。

from_dict_optimum

< 源码 >

( config_dict )

从 optimum gptq 配置字典获取兼容的类

post_init

< 源码 >

( )

检查参数是否正确的安全检查器

to_dict_optimum

< 源码 >

( )

获取与 optimum gptq 配置兼容的字典

BitsAndBytesConfig

class transformers.BitsAndBytesConfig

< 源码 >

( load_in_8bit = False load_in_4bit = False llm_int8_threshold = 6.0 llm_int8_skip_modules = None llm_int8_enable_fp32_cpu_offload = False llm_int8_has_fp16_weight = False bnb_4bit_compute_dtype = None bnb_4bit_quant_type = 'fp4' bnb_4bit_use_double_quant = False bnb_4bit_quant_storage = None **kwargs )

参数

load_in_8bit (bool, optional, defaults to False) — 此标志用于通过 LLM.int8() 启用 8 位量化。
load_in_4bit (bool, optional, defaults to False) — 此标志用于通过将 Linear 层替换为 bitsandbytes 中的 FP4/NF4 层来启用 4 位量化。
llm_int8_threshold (float, optional, defaults to 6.0) — 这对应于论文 `LLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale` 中描述的用于离群点检测的离群点阈值：https://huggingface.co/papers/2208.07339。任何高于此阈值的隐藏状态值都将被视为离群点，对这些值的操作将以 fp16 进行。值通常呈正态分布，即大多数值在 [-3.5, 3.5] 范围内，但对于大型模型，存在一些分布非常不同的系统性离群点。这些离群点通常在 [-60, -6] 或 [6, 60] 区间内。Int8 量化对于幅度约为 5 的值效果很好，但超过这个范围，性能会显著下降。一个好的默认阈值是 6，但对于更不稳定的模型（小模型、微调），可能需要更低的阈值。
llm_int8_skip_modules (list[str], optional) — 一个明确的模块列表，我们不希望将其转换为 8 位。这对于像 Jukebox 这样在不同位置有多个头的模型很有用，这些头不一定在最后的位置。例如，对于 CausalLM 模型，最后的 lm_head 会保持其原始的 dtype。
llm_int8_enable_fp32_cpu_offload (bool, optional, defaults to False) — 此标志用于高级用例和了解此功能的用户。如果你想将模型分成不同的部分，一部分在 GPU 上以 int8 运行，另一部分在 CPU 上以 fp32 运行，你可以使用此标志。这对于卸载大型模型如 google/flan-t5-xxl 很有用。请注意，int8 操作不会在 CPU 上运行。
llm_int8_has_fp16_weight (bool, optional, defaults to False) — 此标志使用 16 位主权重运行 LLM.int8()。这对于微调很有用，因为权重在反向传播时不必来回转换。
bnb_4bit_compute_dtype (torch.dtype 或 str, optional, defaults to torch.float32) — 这设置了计算类型，可能与输入类型不同。例如，输入可能是 fp32，但计算可以设置为 bf16 以加速。
bnb_4bit_quant_type (str, optional, defaults to "fp4") — 这设置了 bnb.nn.Linear4Bit 层中的量化数据类型。选项是 FP4 和 NF4 数据类型，由 fp4 或 nf4 指定。
bnb_4bit_use_double_quant (bool, optional, defaults to False) — 此标志用于嵌套量化，即第一次量化的量化常数再次被量化。
bnb_4bit_quant_storage (torch.dtype 或 str, optional, defaults to torch.uint8) — 这设置了用于打包量化后 4 位参数的存储类型。
kwargs (dict[str, Any], optional) — 用于初始化配置对象的附加参数。

这是一个包装类，包含了在使用 bitsandbytes 加载模型时，所有可以使用的属性和功能。

这将替换 load_in_8bit 或 load_in_4bit，因此这两个选项是互斥的。

目前仅支持 LLM.int8()、FP4 和 NF4 量化。如果 bitsandbytes 中增加了更多方法，则会向此类添加更多参数。

is_quantizable

< 源码 >

( )

如果模型可以量化，则返回 True，否则返回 False。

post_init

< 源码 >

( )

检查参数是否正确的安全检查器 - 同时会将一些 NoneType 参数替换为其默认值。

quantization_method

< 源码 >

( )

此方法返回模型使用的量化方法。如果模型不可量化，则返回 None。

to_diff_dict

< 源码 >

( ) → dict[str, Any]

dict[str, Any]

所有构成此配置实例的属性的字典，

从配置中删除所有与默认配置属性对应的属性，以提高可读性，并序列化为 Python 字典。

HfQuantizer

class transformers.quantizers.HfQuantizer

< 源码 >

( quantization_config: QuantizationConfigMixin **kwargs )

HuggingFace 量化器的抽象类。目前支持对 HF transformers 模型进行推理和/或量化。此类仅用于 transformers.PreTrainedModel.from_pretrained，目前尚不能轻易在该方法范围之外使用。

属性 quantization_config (transformers.utils.quantization_config.QuantizationConfigMixin)：定义要量化的模型的量化参数的配置。 modules_to_not_convert (list[str], optional)：量化模型时不转换的模块名称列表。 required_packages (list[str], optional)：在使用量化器之前需要安装的 pip 包列表。 requires_calibration (bool)：量化方法是否需要在模型使用前对其进行校准。 requires_parameters_quantization (bool)：量化方法是否需要创建新的参数。例如，对于 bitsandbytes，需要创建新的 xxxParameter 才能正确量化模型。

adjust_max_memory

< 源码 >

( max_memory: dict )

如果量化需要额外内存，则调整 infer_auto_device_map() 的 max_memory 参数

adjust_target_dtype

< 源码 >

( torch_dtype: torch.dtype )

参数

torch_dtype (torch.dtype, optional) — 用于计算 device_map 的 torch_dtype。

如果你想调整 from_pretrained 中用于计算 device_map 的 target_dtype 变量（当 device_map 是 str 时），请重写此方法。例如，对于 bitsandbytes，我们强制将 target_dtype 设置为 torch.int8，对于 4 位量化，我们传递一个自定义枚举 accelerate.CustomDtype.int4。

check_quantized_param

< 源码 >

( model: PreTrainedModel param_value: torch.Tensor param_name: str state_dict: dict **kwargs )

检查加载的 state_dict 组件是否是量化参数的一部分，并进行一些验证；仅在 requires_parameters_quantization == True 的量化方法（需要为量化创建新参数）中定义。

create_quantized_param

< 源码 >

( *args **kwargs )

从 state_dict 中获取所需组件并创建量化参数；仅当 requires_parameters_quantization == True 时适用。

Transformers

量化

QuantoConfig

class transformers.QuantoConfig

post_init

AqlmConfig

class transformers.AqlmConfig

post_init

VptqConfig

class transformers.VptqConfig

post_init

AwqConfig

class transformers.AwqConfig

post_init

EetqConfig

class transformers.EetqConfig

post_init

GPTQConfig

class transformers.GPTQConfig

from_dict_optimum

post_init

to_dict_optimum

BitsAndBytesConfig

class transformers.BitsAndBytesConfig

is_quantizable

post_init

quantization_method

to_diff_dict

HfQuantizer

class transformers.quantizers.HfQuantizer

adjust_max_memory

adjust_target_dtype

check_quantized_param

create_quantized_param

dequantize

get_cuda_warm_up_factor

get_special_dtypes_update

postprocess_model

preprocess_model

update_device_map

update_expected_keys

update_missing_keys

update_missing_keys_after_loading

update_torch_dtype

update_tp_plan

update_unexpected_keys

validate_environment

HiggsConfig

class transformers.HiggsConfig

post_init

HqqConfig

class transformers.HqqConfig

from_dict

post_init

to_diff_dict

FbgemmFp8Config

class transformers.FbgemmFp8Config

CompressedTensorsConfig

class transformers.CompressedTensorsConfig

from_dict

to_dict

to_diff_dict

TorchAoConfig

class transformers.TorchAoConfig

from_dict

get_apply_tensor_subclass

post_init

to_dict

BitNetQuantConfig

class transformers.BitNetQuantConfig

post_init

SpQRConfig

class transformers.SpQRConfig

post_init

FineGrainedFP8Config

class transformers.FineGrainedFP8Config

post_init

QuarkConfig

class transformers.QuarkConfig

AutoRoundConfig