在您自己的数据集上微调 Mistral

社区文章发布于 2024 年 7 月 22 日

步骤 0：安装所需库

步骤 1：加载并格式化您的数据集

步骤 2：设置模型和分词器

步骤 3：设置 PEFT（参数高效微调）

步骤 4：设置训练参数

步骤 5：初始化训练器并微调模型

步骤 6：将适配器和模型合并

步骤 7：将微调后的模型推送到 Hugging Face Hub
步骤 8：被诅咒的孩子

此脚本已弃用！自发布以来，transformers 已进行了多次更新！

本教程将引导您完成使用 Hugging Face Transformers 和 PEFT 库在您自己的数据集上微调 Mistral-7B-Instruct 模型的过程。

步骤 0：安装所需库

!pip install -q datasets accelerate evaluate trl accelerate bitsandbytes peft

步骤 1：加载并格式化您的数据集

我们将定义一个函数来格式化数据集中的提示并加载数据集。

def format_prompts(examples):
    """
    Define the format for your dataset
    This function should return a dictionary with a 'text' key containing the formatted prompts
    """
    pass

from datasets import load_dataset

dataset = load_dataset("your_dataset_name", split="train")
dataset = dataset.map(format_prompts, batched=True)

dataset['text'][2] # Check to see if the fields were formatted correctly

步骤 2：设置模型和分词器

接下来，我们将加载预训练的 Mistral-7B-Instruct 模型和分词器，并设置模型以进行量化和梯度检查点。

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

model_id = "mistralai/Mistral-7B-Instruct-v0.3"
tokenizer = AutoTokenizer.from_pretrained(model_id)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config)
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

步骤 3：设置 PEFT（参数高效微调）

我们将使用 PEFT 技术高效地微调模型。这包括设置 LoraConfig 并获取 PEFT 模型。

from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
    bias="none",
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, config)

步骤 4：设置训练参数

我们将定义微调过程的训练参数。

from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="your_model_name",
    num_train_epochs=4, # replace this, depending on your dataset
    per_device_train_batch_size=16,
    learning_rate=1e-5,
    optim="sgd"
)

将 "your_model_name" 替换为您微调模型的所需名称。

步骤 5：初始化训练器并微调模型

现在，我们将从 trl 库初始化 SFTTrainer 并训练模型。

from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset,
    dataset_text_field='text',
    max_seq_length=1024,
)

trainer.train()

步骤 6：将适配器和模型合并

微调完成后，您可以将模型合并回去。

adapter_model = trainer.model
merged_model = adapter_model.merge_and_unload()

trained_tokenizer = trainer.tokenizer

步骤 7：将微调后的模型推送到 Hugging Face Hub

完成所有这些操作后，您可以选择将微调后的模型推送到 Hugging Face Hub，以便更轻松地共享和部署。

repo_id = "your_repo_name"

merged_model.push_to_hub(repo_id)
trained_tokenizer.push_to_hub(repo_id)

步骤 8：被诅咒的孩子

如果您觉得额外刺激，可以在新脚本中对模型进行反量化。

!pip install accelerate bitsandbytes peft transformers # make sure to install dependencies again

from transformers import AutoModelForCausalLM

model_id = "your_repo_name"

model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16)

config = model.config
del config.quantization_config
del config._pre_quantization_dtype
model.config = config

model.dequantize()

model.push_to_hub(model_id) # the tokenizer will stay the same

请注意，Mistral 是一个非常大的模型，执行此操作需要相当多的计算资源。

社区

通过拖放到文本输入框、粘贴或点击此处上传图片、音频和视频。

点击或粘贴此处以上传图片

· 注册或登录发表评论