使用LLM生成的领域特定合成数据微调SmolLM

社区文章发布于2025年1月3日

davidberenstein1957

是的，小型模型在特定领域任务上可以击败类似GPT4的模型，但不要期望奇迹。在比较小型模型和大型模型时，请考虑所有成本和收益，例如性能差异以及使用您拥有并可以随意使用的私有本地模型和数据的价值。

Hugging Face SmolLM 模型速度极快，功能强大。凭借其135M、360M和1.7B参数的模型，它是小型快速模型的绝佳选择。SmolLM的优点在于它是一个通用模型，可以在特定领域数据上进行微调。

缺乏领域特定数据集是小型化和专业化模型面临的常见问题。这是因为很难找到一个既具有代表性又足够多样化的数据集来完成特定任务。我们通过使用synthetic-data-generator从LLM生成合成数据集来解决这个问题，synthetic-data-generator可作为Hugging Face Space或在GitHub上获取。

在本例中，我们将使用synthetic-data-generator从meta-llama/Meta-Llama-3.1-8B-Instruct生成的合成数据集上微调SmolLM2模型。

安装依赖项

我们将安装一些用于使用trl进行微调的基本依赖项，但我们将使用Synthetic Data Generator UI来生成合成数据集。

!pip install transformers datasets trl torch

问题

推理数据已被证明是生成模型性能的根本性改变。推理很棒，但这也意味着模型在令牌生成过程中会变得更“健谈”，导致模型变慢且成本更高。因此，我们希望创建一个能够推理但不过于健谈的模型。因此，我们将生成一个简洁的推理数据集，并在此基础上微调SmolLM2模型。

让我们生成一些数据

让我们前往Hugging Face Space生成数据。这分三步完成：1) 我们提出数据集描述，2) 迭代任务配置，3) 生成并将数据推送到Hugging Face。更详细的流程可以在这篇博文中找到。

对于本例，我们将生成5000个单轮对话聊天数据示例。所有示例都以1的温度生成。经过几次迭代，我们得到了以下系统提示：

You are an AI assistant who provides brief and to-the-point responses with logical step-by-step reasoning. Your purpose is to offer straightforward explanations and answers so that you can get to the heart of the issue. Respond with extremely concise, direct justifications and evidence-based conclusions. User questions are direct and concise.

我们点击“Push to Hub”按钮，等待数据生成。这需要几个小时，最终我们得到了一个包含5000个示例的数据集，这是我们单次运行中可以生成的最大示例数。您可以通过部署Synthetic Data Generator的私有实例来扩展此功能。

数据也已推送到Argilla，因此我们建议在微调实际模型之前检查和验证数据。我们对数据应用了一些基本过滤和转换，使其更适合微调。

微调模型

我们将使用TRL来微调模型。它是Hugging Face生态系统的一部分，可以无缝地在合成数据生成器生成的数据集上工作，无需进行任何数据转换。

加载模型

我们将首先加载模型和分词器并设置聊天格式。

# Import necessary libraries
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer, setup_chat_format
import torch
import os

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)

# Load the model and tokenizer
model_name = "HuggingFaceTB/SmolLM2-360M"
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name
)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_name)

# Set up the chat format
model, tokenizer = setup_chat_format(model=model, tokenizer=tokenizer)

测试基础模型

我们将首先测试基础模型，以了解其在任务上的表现。在此步骤中，我们还将为模型生成一个提示，以查看其在任务上的表现。

from transformers import pipeline
# Let's test the base model before training
prompt = "What is the primary function of mitochondria within a cell?"

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, device=device)
pipe(prompt, max_new_tokens=100)
# [{'generated_text': 'What is the primary function of mitochondria within a cell?\n\The function of the mitochondria is to produce energy for the cell through a process called cellular respiration.'}]

加载数据集

为了进行微调，我们需要加载数据集并对其进行分词。我们将使用上一步中生成的synthetic-concise-reasoning-sft-filtered数据集。

from datasets import load_dataset

ds = load_dataset("argilla/synthetic-concise-reasoning-sft-filtered")
def tokenize_function(examples):
    examples["text"] = tokenizer.apply_chat_template([{"role": "user", "content": examples["prompt"].strip()}, {"role": "assistant", "content": examples["completion"].strip()}], tokenize=False)
    return examples
ds = ds.map(tokenize_function)

微调模型

现在我们将微调模型。我们将使用trl库中的SFTTrainer来微调模型。我们将使用批量大小4和学习率5e-5。我们还将使用use_mps_device标志来使用MPS设备（如果可用）。

os.environ["PYTORCH_MPS_HIGH_WATERMARK_RATIO"] = "0.0"

# Configure the SFTTrainer
sft_config = SFTConfig(
    output_dir="./sft_output",
    num_train_epochs=1,
    per_device_train_batch_size=4,  # Set according to your GPU memory capacity
    learning_rate=5e-5,  # Common starting point for fine-tuning
    logging_steps=100,  # Frequency of logging training metrics
    use_mps_device= True if device == "mps" else False,
    hub_model_id="argilla/SmolLM2-360M-synthetic-concise-reasoning",  # Set a unique name for your model
    push_to_hub=True,
)

# Initialize the SFTTrainer
trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=ds["train"],
    tokenizer=tokenizer,
)
trainer.train()
# {'loss': 1.4498, 'grad_norm': 2.3919131755828857, 'learning_rate': 4e-05, 'epoch': 0.1}
# {'loss': 1.362, 'grad_norm': 1.6650595664978027, 'learning_rate': 3e-05, 'epoch': 0.19}
# {'loss': 1.3778, 'grad_norm': 1.4778285026550293, 'learning_rate': 2e-05, 'epoch': 0.29}
# {'loss': 1.3735, 'grad_norm': 2.1424977779388428, 'learning_rate': 1e-05, 'epoch': 0.39}
# {'loss': 1.3512, 'grad_norm': 2.3498542308807373, 'learning_rate': 0.0, 'epoch': 0.48}
# {'train_runtime': 1911.514, 'train_samples_per_second': 1.046, 'train_steps_per_second': 0.262, 'train_loss': 1.3828572998046875, 'epoch': 0.48}

在本例中，我们没有使用特定的验证集，但我们可以看到损失正在减少，因此我们假设模型对训练数据泛化良好。为了更好地了解模型的性能，我们用相同的提示再次测试它。

运行推理

我们现在可以使用微调后的模型运行推理。

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
pipe(prompt, max_new_tokens=100)
# [{'generated_text': 'The primary function of mitochondria is to generate energy for the cell. They are organelles found in eukaryotic cells that convert nutrients into ATP (adenosine triphosphate), which is the primary source of energy for cellular processes.'}]

结论

我们已经在从大型语言模型生成的合成数据集上微调了SmolLM2模型。我们已经看到该模型在任务上表现良好，并且合成数据是为监督式微调生成多样化和代表性数据的好方法。

实际上，您可能需要花更多时间在数据质量和模型微调上，但该流程表明Synthetic Data Generator是为任何任务生成合成数据的绝佳工具。

总的来说，我认为对于几个小时的生成和在消费级硬件上的微调来说，这相当不错。

社区

helloansuman

3月14日

如何对问答数据集执行相同的过程？系统提示格式是什么？

通过拖放到文本输入框、粘贴或点击此处上传图片、音频和视频。

点击或粘贴此处以上传图片

· 注册或登录以评论