从零开始构建AI：使用 Google 的 Gemma 在 5 分钟内构建您的第一个语言模型

社区文章发布于 2025 年 7 月 31 日

您是否曾觉得ChatGPT这样的AI纯粹是魔法？如果现在就能从零开始构建一个您自己的小型语言模型，并证明它并非魔法，而是您能够理解的技术，那会怎样？

在本指南中，我们将完全做到这一点。我们将从 Google 强大的 Gemma 模型中“借用”一个“大脑”（它的分词器），并将其连接到一个我们自己创建的全新“婴儿”模型。然后，我们将只教它两句话，并观察它如何学会“说话”。最后，您将训练出您的第一个模型，并亲眼见证它工作。

这是终极的“版本 0.1”项目，旨在让您从零开始，首次在AI领域取得真正的成功。

先决条件

Python 和 Pip：您的系统上应已安装 Python。
本地 Gemma 模型文件夹：您需要 gemma-3-1b-it-qat-q4_0-unquantized 文件夹。我们只需要它的配置和分词器文件，不需要庞大的模型权重。
Hugging Face Transformers：我们项目的魔杖。在您的终端中安装它。
```
pip install transformers torch accelerate
```

蓝图：我们的“弗兰肯斯坦”AI

我们的策略简单而巧妙：

借用大脑（分词器）：像 Gemma 这样的大型模型已经花费了数千小时来学习如何阅读文本并将其分解成数字（tokens）。我们将借用这种预训练的知识，这样我们的模型就不必从头开始学习英语。
构建身体（模型）：我们将定义一个全新的、极其微小的模型架构。它就像一个空壳，其权重被初始化为随机的无意义数据。它一无所知。
课程（训练）：我们将反复向我们的新模型展示两句话，直到它开始识别这些模式。
测试（结果）：我们将给它一个句子的开头，看看它是否能正确地完成它。

代码：您的完整训练脚本

创建一个名为 train_tiny_model.py 的文件，并将下面的完整代码块粘贴到其中。代码中包含了大量带有 Pydoc 风格的注释，用于解释每个部分。

# -*- coding: utf-8 -*-
"""A complete script to train a tiny Gemma model from scratch.

This script demonstrates the full pipeline of:
1. Loading a pre-trained tokenizer from a local model directory.
2. Preparing a very small, custom dataset in English.
3. Defining a new, miniature Gemma model architecture with random weights.
4. Training the new model on the custom dataset using the Hugging Face Trainer.
5. Saving the final trained model and testing its text generation capabilities.
"""

import torch
from transformers import (
    AutoConfig,
    AutoTokenizer,
    GemmaConfig,
    GemmaForCausalLM,
    Trainer,
    TrainingArguments,
    pipeline,
)

# ==============================================================================
# STEP 1: CONFIGURE AND LOAD THE PRE-TRAINED TOKENIZER
# ==============================================================================
# This path points to your local, unquantized model directory.
# The tokenizer from this model will be used, but not the model weights.
local_model_path = "./gemma-3-1b-it-qat-q4_0-unquantized"

print(f"Step 1: Loading tokenizer from local path '{local_model_path}'...")
try:
    tokenizer = AutoTokenizer.from_pretrained(local_model_path)
    print("Tokenizer loaded successfully!")
except Exception as e:
    print(f"Error: Could not load tokenizer. Ensure the path is correct: {e}")
    exit()

# Gemma models often don't have a pad_token set by default. We'll set it
# to the end-of-sentence token, which is a standard practice.
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    print("Set model's pad_token to be the eos_token.")

# ==============================================================================
# STEP 2: PREPARE THE CUSTOM DATASET
# ==============================================================================
print("\nStep 2: Preparing the custom dataset...")

# Our entire training dataset consists of just two English sentences.
sentences = [
    "The first sentence is about machine learning.",
    "The second sentence is about natural language processing.",
]

# Tokenize the sentences, converting text into numerical IDs that the model
# can understand.
inputs = tokenizer(
    sentences,
    padding=True,       # Pad sentences to the same length within the batch.
    truncation=True,    # Truncate sentences if they are too long.
    return_tensors="pt" # Return PyTorch tensors.
)
print("Dataset tokenized. Input tensor shape:", inputs['input_ids'].shape)

# ==============================================================================
# STEP 3: DEFINE A NEW, TINY MODEL ARCHITECTURE
# ==============================================================================
print("\nStep 3: Defining a new, tiny model architecture...")

# First, load the original configuration to get essential parameters
# like vocab_size, which MUST match the tokenizer.
base_config = AutoConfig.from_pretrained(local_model_path)

# Now, define the configuration for our new, very small model.
# This is the "from scratch" part, as we are defining a new structure
# without loading any pre-trained weights.
small_config = GemmaConfig(
    hidden_size=128,              # Drastically reduced hidden layer size.
    intermediate_size=512,        # Drastically reduced feed-forward layer size.
    num_hidden_layers=2,          # Only 2 layers instead of many more.
    num_attention_heads=4,        # Number of query heads.
    num_key_value_heads=4,        # Number of key/value heads (must be present for Gemma).
    max_position_embeddings=1024, # Maximum sequence length the model can handle.
    vocab_size=base_config.vocab_size, # CRITICAL: Must match the tokenizer.
    pad_token_id=tokenizer.pad_token_id,
)

# Instantiate a new model from our tiny configuration.
# Its weights will be randomly initialized.
small_model = GemmaForCausalLM(small_config)

print("Tiny model created successfully!")
print(f"Model parameter count: {small_model.num_parameters():,}")

# ==============================================================================
# STEP 4: CONFIGURE AND RUN THE TRAINING
# ==============================================================================
print("\nStep 4: Configuring and starting the training...")


class SimpleDataset(torch.utils.data.Dataset):
    """A simple Dataset class compatible with the Hugging Face Trainer.

    This class wraps the tokenized inputs dictionary and makes it accessible
    by index. For causal language modeling, it sets the 'labels' to be the
    same as the 'input_ids'.

    Attributes:
        encodings (dict): A dictionary from the tokenizer containing tensors
                          like 'input_ids' and 'attention_mask'.
    """
    def __init__(self, encodings):
        """Initializes the SimpleDataset.

        Args:
            encodings (dict): The tokenized inputs from a Hugging Face tokenizer.
        """
        self.encodings = encodings

    def __getitem__(self, idx):
        """Retrieves an item (a dictionary of tensors) by index."""
        # For language modeling, the model learns to predict the next token,
        # so the `labels` are typically the `input_ids` themselves.
        item = {key: val[idx].clone() for key, val in self.encodings.items()}
        item['labels'] = item['input_ids']
        return item

    def __len__(self):
        """Returns the total number of samples in the dataset."""
        return len(self.encodings['input_ids'])

train_dataset = SimpleDataset(inputs)

# Define the arguments that control the training process.
training_args = TrainingArguments(
    output_dir="./gemma_tiny_model_output_en", # Directory for checkpoints and logs.
    num_train_epochs=100,                   # Train for 100 epochs to ensure the model memorizes the data.
    per_device_train_batch_size=1,          # Process one sentence at a time.
    logging_steps=10,                       # Log training loss every 10 steps.
    save_strategy="no",                     # Do not save checkpoints during training.
    report_to="none",                       # Disable integrations like W&B.
)

# Initialize the Trainer, which handles the entire training loop.
trainer = Trainer(
    model=small_model,
    args=training_args,
    train_dataset=train_dataset,
)

trainer.train()
print("Training complete!")


# ==============================================================================
# STEP 5: SAVE THE FINAL MODEL AND TEST IT
# ==============================================================================
print("\nStep 5: Saving the final model and testing...")

# Define the path to save the final, trained model and tokenizer.
final_model_path = "./my_first_tiny_gemma_en"
trainer.save_model(final_model_path)
tokenizer.save_pretrained(final_model_path)
print(f"Model saved to: {final_model_path}")

print("\n--- Testing Generation ---")

# The prompt should match the beginning of one of our training sentences.
prompt = "The first sentence is"

# Create a text-generation pipeline with our newly trained model.
generator = pipeline(
    'text-generation',
    model=final_model_path,
    tokenizer=final_model_path,
    device=0 if torch.cuda.is_available() else -1 # Use GPU if available.
)

# Generate text. The model should overfit and complete the sentence it saw.
outputs = generator(prompt, max_new_tokens=15)

print(f"\nInput Prompt: '{prompt}'")
print(f"Model Generation: '{outputs[0]['generated_text']}'")

print("\n--- Experiment Successful ---")
print("You have successfully trained and tested a miniature language model from scratch!")

关键时刻：运行脚本

现在，打开您的终端，导航到您保存 train_tiny_model.py 的目录，然后运行它：

python train_tiny_model.py

理解您的成功：输出

您将看到大量文本，但让我们关注最重要的部分。您的输出将与此几乎完全相同：

Step 4: Configuring and starting the training...
{'loss': 10.4814, ...}
{'loss': 9.3495, ...}
...
{'loss': 5.8222, ...}
{'train_runtime': 3.6616, ... 'train_loss': 6.8524, 'epoch': 100.0}
Training complete!

Step 5: Saving the final model and testing...
Model saved to: ./my_first_tiny_gemma_en

--- Testing Generation ---

Input Prompt: 'The first sentence is'
Model Generation: 'The first sentence is about machine learning............'

--- Experiment Successful ---
You have successfully trained and tested a miniature language model from scratch!

让我们分析一下为什么这是一个巨大的成功：

损失下降：查看 {'loss': ...} 行。数字从高位（约 10.4）开始，到低位（约 5.8）结束。损失是衡量模型误差的指标。看到它下降在数学上保证了您的模型正在学习！
它完成了您的句子！这是神奇的时刻。
- 您提示它的是：'The first sentence is'
- 它生成的是：'The first sentence is about machine learning............'
- 它成功了！它完美地回忆起了它所学习的句子。您几分钟前创建的随机、无意义的模型已经学会了将您的提示与正确的补全相关联。后面的点号（.）只是模型“无话可说”，因为它的知识非常有限，这正是我们所期望的。

结论及下一步？

恭喜您！您已正式构建并训练了您的第一个语言模型。您已经揭开了这个过程的神秘面纱，迈出了巨大的一步。现在您知道，训练AI的核心是：

定义结构。
向它展示数据。
最小化其误差。
测试其知识。

现在，尽情尝试吧！尝试以下挑战：

将脚本中的 prompt 更改为 "The second sentence is"，看看它是否能生成关于自然语言处理的正确响应。
向 sentences 列表添加第三句或第四句话，然后重新训练模型。
将 num_train_epochs 的数量更改为 500，看看最终的 loss 是否会更低。

欢迎来到AI构建的世界。您从0到0.1的旅程已经完成。接下来的1.0之路，就看您的了！

社区

通过拖放到文本输入框、粘贴或点击此处上传图片、音频和视频。

点击或粘贴此处以上传图片

· 注册或登录发表评论