使用 Gemma 3 和 Qwen 2.5 基准测试辅助生成:代码优先指南
社区文章 发布于 2025 年 3 月 12 日
在本篇博客文章中,我们将探讨使用新发布的 Gemma 3 (27B) 和 Qwen 2.5 (0.5B) 模型进行辅助生成的性能。辅助生成利用一个较小的模型来提高较大模型的吞吐量——很酷,对吧?让我们深入了解代码和结果。
什么是辅助生成?
辅助生成利用一个更小、更快的模型来引导大型模型在令牌生成过程中,提高效率而不牺牲质量。好奇吗?请查看Hugging Face 的详细解释。
设置
我们将使用 PyTorch 和 Hugging Face Transformers 对有无辅助的生成吞吐量进行基准测试。以下是代码
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
import torch
from torch.utils import benchmark
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.models.gemma3 import Gemma3ForCausalLM
def load_models():
# Load Gemma 3 (27B)
large_ckpt = "google/gemma-3-27b-it"
large_model = Gemma3ForCausalLM.from_pretrained(large_ckpt, torch_dtype=torch.bfloat16).to("cuda")
large_tokenizer = AutoTokenizer.from_pretrained(large_ckpt)
# Load Qwen 2.5 (0.5B)
small_ckpt = "Qwen/Qwen2.5-0.5B-Instruct"
small_model = AutoModelForCausalLM.from_pretrained(small_ckpt, torch_dtype=torch.bfloat16).to("cuda")
small_tokenizer = AutoTokenizer.from_pretrained(small_ckpt)
return large_tokenizer, small_tokenizer, small_model, large_model
def generate_large(large_model, model_inputs):
large_model.generate(**model_inputs, do_sample=False, max_new_tokens=256, eos_token_id=-1)
def generate_assisted(large_model, small_model, tokenizer, assistant_tokenizer, model_inputs):
large_model.generate(
**model_inputs, do_sample=False, max_new_tokens=256, eos_token_id=-1,
assistant_model=small_model, tokenizer=tokenizer, assistant_tokenizer=assistant_tokenizer
)
if __name__ == "__main__":
large_tokenizer, small_tokenizer, small_model, large_model = load_models()
# Disable caching for fair comparison
small_model.generation_config.cache_implementation = None
large_model.generation_config.cache_implementation = None
# Input prompt
messages = [{"role": "user", "content": [{"type": "text", "text": "Write me a long essay on Deep Learning"}]}]
model_inputs = large_tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt").to("cuda")
# Benchmarking
results = []
label = "Generation"
results.append(benchmark.Timer(
stmt="generate_large(large_model, model_inputs)",
setup="from __main__ import generate_large",
globals={"large_model": large_model, "model_inputs": model_inputs},
num_threads=torch.get_num_threads(),
label=label, sub_label="without assistant", description="generation"
).blocked_autorange())
results.append(benchmark.Timer(
stmt="generate_assisted(large_model, small_model, tokenizer, assistant_tokenizer, model_inputs)",
setup="from __main__ import generate_assisted",
globals={"large_model": large_model, "small_model": small_model, "tokenizer": large_tokenizer,
"assistant_tokenizer": small_tokenizer, "model_inputs": model_inputs},
num_threads=torch.get_num_threads(),
label=label, sub_label="with assistant", description="generation"
).blocked_autorange())
benchmark.Compare(results).print()
结果
在启用 CUDA 的 GPU 上运行此代码,并使用 64 个线程,结果如下
[------------ Generation ------------]
| generation
64 threads: --------------------------
without assistant | 23.9
with assistant | 20.5
Times are in seconds (s).
辅助设置(20.5 秒)比单独的 Gemma 3(23.9 秒)快约 14%。与微小的 Qwen 2.5 搭配效果不错!