Gemma 3n 现已全面登陆开源生态系统！

发布于 2025 年 6 月 26 日

在 GitHub 上更新

114

Aritra Roy Gosthipaty

Christopher Fleetwood

Gemma 3n 在 Google I/O 大会上作为 *预览版* 宣布。端侧设备社区对此感到非常兴奋，因为这是一款从零开始设计，旨在您的硬件上 **本地运行** 的模型。更重要的是，它原生支持 **多模态**，支持图像、文本、音频和视频输入 🤯

今天，Gemma 3n 终于在最常用的开源库中可用了。这包括 transformers 和 timm、MLX、llama.cpp（文本输入）、transformers.js、ollama、Google AI Edge 等。

本文将通过实用的代码片段快速演示如何使用这些库来运行该模型，以及如何轻松地为其他领域对其进行微调。

今日发布的模型

这是 Gemma 3n 发布模型合集

今天发布了两种尺寸的模型，每种尺寸都有两种变体（基础版和指令版）。模型名称遵循非标准命名法：它们被称为 `gemma-3n-E2B` 和 `gemma-3n-E4B`。参数数量前的 `E` 代表 `Effective`（等效）。它们的实际参数数量分别为 `5B` 和 `8B`，但由于内存效率的提升，它们在 VRAM（GPU 显存）中分别只需要 2B 和 4B。

因此，这些模型在硬件支持方面表现得像 2B 和 4B 模型，但在质量方面却超越了 2B/4B 的水平。`E2B` 模型仅需 2GB 的 GPU RAM 即可运行，而 `E4B` 仅需 3GB 的 GPU RAM 即可运行。

大小	基础版	指令版
2B	google/gemma-3n-e2b	google/gemma-3n-e2b-it
4B	google/gemma-3n-e4b	google/gemma-3n-e4b-it

模型详情

除了语言解码器，Gemma 3n 还使用了一个 **音频编码器** 和一个 **视觉编码器**。我们在下面重点介绍它们的主要特性，并描述它们是如何被添加到 `transformers` 和 `timm` 中的，因为它们是其他实现的参考。

视觉编码器 (MobileNet-V5)。Gemma 3n 使用了新版本的 MobileNet：MobileNet-v5-300，该版本已添加到今天发布的 `timm` 新版本中。
- 拥有 3 亿参数。
- 支持 `256x256`、`512x512` 和 `768x768` 的分辨率。
- 在 Google Pixel 上达到 60 FPS，性能优于 ViT Giant，而参数量减少了 3 倍。
音频编码器
- 基于通用语音模型 (USM)。
- 以 `160ms` 的块处理音频。
- 支持语音转文本和翻译功能（例如，英语到西班牙语/法语）。
Gemma 3n 架构和语言模型。该架构本身已添加到今天发布的 `transformers` 新版本中。此实现会调用 `timm` 进行图像编码，因此 MobileNet 架构只有一个参考实现。

架构亮点

MatFormer 架构
- 这是一种嵌套式 transformer 设计，类似于 Matryoshka 嵌入，允许提取层的不同子集，就好像它们是独立的模型一样。
- E2B 和 E4B 是联合训练的，将 E2B 配置为 E4B 的子模型。
- 用户可以根据其硬件特性和内存预算“混合搭配”层。
逐层嵌入 (PLE)：通过将嵌入卸载到 CPU 来减少加速器内存使用。这就是为什么 E2B 模型虽然有 5B 的实际参数，但占用的 GPU 内存却与 2B 参数模型相当。
KV 缓存共享：加速音频和视频的长上下文处理，与 Gemma 3 4B 相比，预填充速度快 2 倍。

性能与基准测试：

LMArena 分数：E4B 是第一个得分超过 1300 的 10B 以下模型。
MMLU 分数：Gemma 3n 在各种尺寸（E4B、E2B 和几种混合搭配配置）下都表现出有竞争力的性能。
多语言支持：支持 140 种语言的文本交互和 35 种语言的多模态交互。

演示空间

体验模型最简单的方式是使用该模型专用的 Hugging Face Space。您可以在这里尝试不同的提示，使用不同的模态。

📱 空间

使用 transformers 进行推理

安装最新版本的 timm（用于视觉编码器）和 transformers 来运行推理，或者如果您想对其进行微调。

pip install -U -q timm
pip install -U -q transformers

使用 pipeline 进行推理

开始使用 Gemma 3n 的最简单方法是使用 transformers 中的 pipeline 抽象

import torch
from transformers import pipeline

pipe = pipeline(
   "image-text-to-text",
   model="google/gemma-3n-E4B-it", # "google/gemma-3n-E4B-it"
   device="cuda",
   torch_dtype=torch.bfloat16
)

messages = [
   {
       "role": "user",
       "content": [
           {"type": "image", "url": "https://huggingface.co/datasets/ariG23498/demo-data/resolve/main/airplane.jpg"},
           {"type": "text", "text": "Describe this image"}
       ]
   }
]

output = pipe(text=messages, max_new_tokens=32)
print(output[0]["generated_text"][-1]["content"])

输出

The image shows a futuristic, sleek aircraft soaring through the sky. It's designed with a distinctive, almost alien aesthetic, featuring a wide body and large

使用 transformers 进行详细推理

从 Hub 初始化模型和处理器，并编写一个 `model_generation` 函数，该函数负责处理提示并对模型运行推理。

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_id = "google/gemma-3n-e4b-it" # google/gemma-3n-e2b-it
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(model_id).to(device)

def model_generation(model, messages):
    inputs = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True,
        return_tensors="pt",
    )
    input_len = inputs["input_ids"].shape[-1]

    inputs = inputs.to(model.device, dtype=model.dtype)

    with torch.inference_mode():
        generation = model.generate(**inputs, max_new_tokens=32, disable_compile=False)
        generation = generation[:, input_len:]

    decoded = processor.batch_decode(generation, skip_special_tokens=True)
    print(decoded[0])

由于该模型支持所有模态作为输入，以下是通过 transformers 使用它们的简要代码说明。

纯文本

# Text Only

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What is the capital of France?"}
        ]
    }
]
model_generation(model, messages)

输出

The capital of France is **Paris**.

与音频交错

# Interleaved with Audio

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Transcribe the following speech segment in English:"},
            {"type": "audio", "audio": "https://huggingface.co/datasets/ariG23498/demo-data/resolve/main/speech.wav"},
        ]
    }
]
model_generation(model, messages)

输出

Send a text to Mike. I'll be home late tomorrow.

与图像/视频交错

对视频的支持是通过一系列图像帧来实现的

# Interleaved with Image

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://huggingface.co/datasets/ariG23498/demo-data/resolve/main/airplane.jpg"},
            {"type": "text", "text": "Describe this image."}
        ]
    }
]
model_generation(model, messages)

输出

The image shows a futuristic, sleek, white airplane against a backdrop of a clear blue sky transitioning into a cloudy, hazy landscape below. The airplane is tilted at

使用 MLX 进行推理

Gemma 3n 在发布首日即支持 MLX 的全部 3 种模态。请确保升级您的 mlx-vlm 安装。

pip install -u mlx-vlm

从视觉开始

python -m mlx_vlm.generate --model google/gemma-3n-E4B-it --max-tokens 100 --temperature 0.5 --prompt "Describe this image in detail." --image https://huggingface.co/datasets/ariG23498/demo-data/resolve/main/airplane.jpg

以及音频

python -m mlx_vlm.generate --model google/gemma-3n-E4B-it --max-tokens 100 --temperature 0.0 --prompt "Transcribe the following speech segment in English:" --audio https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/audio-samples/jfk.wav

使用 llama.cpp 进行推理

除了 MLX，Gemma 3n（仅文本）也可以直接与 llama.cpp 配合使用。请确保从源码安装 llama.cpp/Ollama。

在此查看 llama.cpp 的安装说明：https://github.com/ggml-org/llama.cpp/blob/master/docs/install.md

您可以这样运行它

llama-server -hf ggml-org/gemma-3n-E4B-it-GGUF:Q8_0

使用 Transformers.js 和 ONNXRuntime 进行推理

最后，我们还发布了 gemma-3n-E2B-it 模型变体的 ONNX 权重，从而可以在不同的运行时和平台上灵活部署。对于 JavaScript 开发者，Gemma3n 已被集成到 Transformers.js 中，并从 3.6.0 版本开始可用。

有关如何使用这些库运行模型的更多信息，请查看模型卡片中的使用部分。

在免费的 Google Colab 中进行微调

考虑到模型的尺寸，对特定下游任务跨模态进行微调是非常方便的。为了让您更容易地微调模型，我们创建了一个简单的 notebook，让您可以在免费的 Google Colab 上进行实验！

我们还提供了一个专门的用于音频任务微调的 notebook，以便您可以轻松地将模型应用于您的语音数据集和基准测试！

Hugging Face Gemma Recipes

随着这次发布，我们还推出了 Hugging Face Gemma Recipes 代码仓库。您可以在其中找到用于运行和微调模型的 `notebooks` 和 `scripts`。

我们非常希望您能使用 Gemma 系列模型，并为其添加更多的 recipes！欢迎随时在该仓库中提出 Issues 和创建 Pull Requests。

结论

我们总是很高兴能托管 Google 及其 Gemma 系列模型。我们希望社区能够齐心协力，充分利用这些模型。多模态、小尺寸、高能力，成就了一次伟大的模型发布！

如果您想更详细地讨论这些模型，请直接在本博客文章下方发起讨论。我们将非常乐意提供帮助！

非常感谢 Arthur、Cyril、Raushan、Lysandre 以及 Hugging Face 的每一位成员，他们负责了集成工作并将其提供给社区！

更多博客文章

nanoVLM: 最简单的纯 PyTorch 训练 VLM 代码库

作者 2025 年 5 月 21 日 • 202

视觉语言模型 (更好、更快、更强)

作者 2025 年 5 月 12 日 • 501

社区

mrdbourke

6 月 26 日

轰动性的发布！

MobileNet-v5 的权重会加入到 https://huggingface.co/timm 吗？

是否有关于此模型（仅视觉编码器部分）性能的结果可供查阅？

感谢大家的努力。

rishiraj

6 月 27 日

写得真好，想要更深入了解架构的朋友，请阅读 https://huggingface.co/blog/rishiraj/matformer-in-gemma-3n

evo42

6 月 27 日

感谢发布首日即支持 MLX 🙏

已删除

6 月 27 日

此评论已被隐藏

jobesu

6 月 27 日

•

编辑于 6 月 28 日

非常感谢这次发布！
文章中提到该模型“在 Google Pixel 上达到 60 FPS”，因此是图像作为输入。如果我没记错的话，该模型在 Google Pixel 上运行于 Google Tensor G4 芯片。要在高通芯片（例如 QCS8550）上运行该模型，我的理解是我们应该使用 llama.cpp 库，但它似乎不提供 ViT 编码器（文章中写道“llama.cpp（文本输入）”）。我的理解是否正确？或者在高通上最好的方式是使用 Onnxruntime 版本？基本上我的问题是，在支持多模态的情况下，在设备上使用 Gemma 的最佳方式是什么？
谢谢。

通过拖放到文本输入框、粘贴或点击此处上传图片、音频和视频。

点击或粘贴此处以上传图片

· 注册或登录发表评论

114