使用量化模型 (dtypes)

在 Transformers.js v3 之前，我们使用 quantized 选项来指定是使用模型的量化 (q8) 版本还是全精度 (fp32) 版本，通过将 quantized 分别设置为 true 或 false 来实现。现在，我们新增了使用 dtype 参数从更长的列表中进行选择的功能。

可用的量化类型列表取决于具体模型，但一些常见的类型包括：全精度 ("fp32")、半精度 ("fp16")、8 位 ("q8", "int8", "uint8") 和 4 位 ("q4", "bnb4", "q4f16")。

Available dtypes for mixedbread-ai/mxbai-embed-xsmall-v1 （例如，mixedbread-ai/mxbai-embed-xsmall-v1）

基本用法

示例： 以 4 位量化运行 Qwen2.5-0.5B-Instruct (演示)

import { pipeline } from "@huggingface/transformers";

// Create a text generation pipeline
const generator = await pipeline(
  "text-generation",
  "onnx-community/Qwen2.5-0.5B-Instruct",
  { dtype: "q4", device: "webgpu" },
);

// Define the list of messages
const messages = [
  { role: "system", content: "You are a helpful assistant." },
  { role: "user", content: "Tell me a funny joke." },
];

// Generate a response
const output = await generator(messages, { max_new_tokens: 128 });
console.log(output[0].generated_text.at(-1).content);

按模块设置 dtypes

一些编码器-解码器模型，如 Whisper 或 Florence-2，对量化设置（尤其是编码器的设置）极为敏感。因此，我们添加了按模块选择 dtypes 的功能，可以通过提供一个从模块名称到 dtype 的映射来实现。

示例： 在 WebGPU 上运行 Florence-2 (演示)

import { Florence2ForConditionalGeneration } from "@huggingface/transformers";

const model = await Florence2ForConditionalGeneration.from_pretrained(
  "onnx-community/Florence-2-base-ft",
  {
    dtype: {
      embed_tokens: "fp16",
      vision_encoder: "fp16",
      encoder_model: "q4",
      decoder_model_merged: "q4",
    },
    device: "webgpu",
  },
);

Florence-2 running on WebGPU

查看完整代码示例

import {
  Florence2ForConditionalGeneration,
  AutoProcessor,
  AutoTokenizer,
  RawImage,
} from "@huggingface/transformers";

// Load model, processor, and tokenizer
const model_id = "onnx-community/Florence-2-base-ft";
const model = await Florence2ForConditionalGeneration.from_pretrained(
  model_id,
  {
    dtype: {
      embed_tokens: "fp16",
      vision_encoder: "fp16",
      encoder_model: "q4",
      decoder_model_merged: "q4",
    },
    device: "webgpu",
  },
);
const processor = await AutoProcessor.from_pretrained(model_id);
const tokenizer = await AutoTokenizer.from_pretrained(model_id);

// Load image and prepare vision inputs
const url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg";
const image = await RawImage.fromURL(url);
const vision_inputs = await processor(image);

// Specify task and prepare text inputs
const task = "<MORE_DETAILED_CAPTION>";
const prompts = processor.construct_prompts(task);
const text_inputs = tokenizer(prompts);

// Generate text
const generated_ids = await model.generate({
  ...text_inputs,
  ...vision_inputs,
  max_new_tokens: 100,
});

// Decode generated text
const generated_text = tokenizer.batch_decode(generated_ids, {
  skip_special_tokens: false,
})[0];

// Post-process the generated text
const result = processor.post_process_generation(
  generated_text,
  task,
  image.size,
);
console.log(result);
// { '<MORE_DETAILED_CAPTION>': 'A green car is parked in front of a tan building. The building has a brown door and two brown windows. The car is a two door and the door is closed. The green car has black tires.' }

< > 在 GitHub 上更新