开源 AI 食谱文档

在单 GPU 上微调代码 LLM 以适应自定义代码

开源 AI 食谱

加入 Hugging Face 社区

并获得增强的文档体验

在模型、数据集和 Spaces 上进行协作

通过加速推理获得更快的示例

在文档主题之间切换

开始使用

在单 GPU 上微调代码 LLM 以适应自定义代码

作者： Maria Khalusova

公开可用的代码 LLM，如 Codex、StarCoder 和 Code Llama，非常擅长生成符合通用编程原则和语法的代码，但它们可能不符合组织的内部约定，或者不了解专有库。

在本笔记本中，我们将展示如何根据私有代码库微调代码 LLM，以增强其上下文感知能力，并提高模型对您组织需求的实用性。由于代码 LLM 相当庞大，以传统方式对其进行微调可能会消耗大量资源。别担心！我们将展示如何优化微调以适应单个 GPU。

数据集

对于此示例，我们选择了 GitHub 上排名前 10 的 Hugging Face 公共存储库。我们从数据中排除了非代码文件，例如图像、音频文件、演示文稿等等。对于 Jupyter 笔记本，我们只保留了包含代码的单元格。生成的代码存储为一个数据集，您可以在 Hugging Face Hub 上找到，地址为 smangrul/hf-stack-v1。它包含 repo id、文件路径和文件内容。

模型

我们将微调 bigcode/starcoderbase-1b，这是一个在 80 多种编程语言上训练的 1B 参数模型。这是一个门控模型，因此如果您计划使用此确切模型运行此笔记本，则需要在模型的页面上获得访问权限。登录您的 Hugging Face 帐户即可完成此操作

from huggingface_hub import notebook_login

notebook_login()

首先，让我们安装所有必要的库。如您所见，除了 transformers 和 datasets 之外，我们还将使用 peft、bitsandbytes 和 flash-attn 来优化训练。

通过采用参数高效的训练技术，我们可以在单个 A100 High-RAM GPU 上运行此笔记本。

!pip install -q transformers datasets peft bitsandbytes flash-attn

现在让我们定义一些变量。随意尝试这些。

MODEL = "bigcode/starcoderbase-1b"  # Model checkpoint on the Hugging Face Hub
DATASET = "smangrul/hf-stack-v1"  # Dataset on the Hugging Face Hub
DATA_COLUMN = "content"  # Column name containing the code content

SEQ_LENGTH = 2048  # Sequence length

# Training arguments
MAX_STEPS = 2000  # max_steps
BATCH_SIZE = 16  # batch_size
GR_ACC_STEPS = 1  # gradient_accumulation_steps
LR = 5e-4  # learning_rate
LR_SCHEDULER_TYPE = "cosine"  # lr_scheduler_type
WEIGHT_DECAY = 0.01  # weight_decay
NUM_WARMUP_STEPS = 30  # num_warmup_steps
EVAL_FREQ = 100  # eval_freq
SAVE_FREQ = 100  # save_freq
LOG_FREQ = 25  # log_freq
OUTPUT_DIR = "peft-starcoder-lora-a100"  # output_dir
BF16 = True  # bf16
FP16 = False  # no_fp16

# FIM trasformations arguments
FIM_RATE = 0.5  # fim_rate
FIM_SPM_RATE = 0.5  # fim_spm_rate

# LORA
LORA_R = 8  # lora_r
LORA_ALPHA = 32  # lora_alpha
LORA_DROPOUT = 0.0  # lora_dropout
LORA_TARGET_MODULES = "c_proj,c_attn,q_attn,c_fc,c_proj"  # lora_target_modules

# bitsandbytes config
USE_NESTED_QUANT = True  # use_nested_quant
BNB_4BIT_COMPUTE_DTYPE = "bfloat16"  # bnb_4bit_compute_dtype

SEED = 0

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
    logging,
    set_seed,
    BitsAndBytesConfig,
)

set_seed(SEED)

准备数据

首先加载数据。由于数据集可能非常大，请确保启用流式模式。流式传输允许我们在迭代数据集时逐步加载数据，而不是一次下载整个数据集。

我们将保留前 4000 个示例作为验证集，其余所有内容将作为训练数据。

from datasets import load_dataset
import torch
from tqdm import tqdm


dataset = load_dataset(
    DATASET,
    data_dir="data",
    split="train",
    streaming=True,
)

valid_data = dataset.take(4000)
train_data = dataset.skip(4000)
train_data = train_data.shuffle(buffer_size=5000, seed=SEED)

在此步骤中，数据集仍然包含任意长度的原始代码数据。对于训练，我们需要固定长度的输入。让我们创建一个可迭代数据集，该数据集将从文本文件流中返回恒定长度的令牌块。

首先，让我们估计数据集中每个令牌的平均字符数，这将有助于我们稍后估计文本缓冲区中的令牌数。默认情况下，我们只会从数据集中获取 400 个示例 (nb_examples)。仅使用整个数据集的子集将降低计算成本，同时仍能合理估计总体字符与令牌的比率。

>>> tokenizer = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)


>>> def chars_token_ratio(dataset, tokenizer, data_column, nb_examples=400):
...     """
...     Estimate the average number of characters per token in the dataset.
...     """

...     total_characters, total_tokens = 0, 0
...     for _, example in tqdm(zip(range(nb_examples), iter(dataset)), total=nb_examples):
...         total_characters += len(example[data_column])
...         total_tokens += len(tokenizer(example[data_column]).tokens())

...     return total_characters / total_tokens


>>> chars_per_token = chars_token_ratio(train_data, tokenizer, DATA_COLUMN)
>>> print(f"The character to token ratio of the dataset is: {chars_per_token:.2f}")

The character to token ratio of the dataset is: 2.43

字符与令牌的比率也可以用作衡量文本令牌化质量的指标。例如，字符与令牌的比率为 1.0 意味着每个字符都用一个令牌表示，这意义不大。这将表明令牌化效果不佳。在标准英语文本中，一个令牌通常相当于大约四个字符，这意味着字符与令牌的比率约为 4.0。我们可以预期代码数据集中的比率较低，但一般来说，2.0 到 3.5 之间的数字可以被认为是足够好的。

可选的 FIM 转换

自回归语言模型通常从左到右生成序列。通过应用 FIM 转换，模型还可以学习填充文本。查看 “高效训练语言模型以填充中间内容”论文，了解有关该技术的更多信息。我们将在此处定义 FIM 转换，并在创建可迭代数据集时使用它们。但是，如果您想省略转换，请随意将 fim_rate 设置为 0。

import functools
import numpy as np


# Helper function to get token ids of the special tokens for prefix, suffix and middle for FIM transformations.
@functools.lru_cache(maxsize=None)
def get_fim_token_ids(tokenizer):
    try:
        FIM_PREFIX, FIM_MIDDLE, FIM_SUFFIX, FIM_PAD = tokenizer.special_tokens_map["additional_special_tokens"][1:5]
        suffix_tok_id, prefix_tok_id, middle_tok_id, pad_tok_id = (
            tokenizer.vocab[tok] for tok in [FIM_SUFFIX, FIM_PREFIX, FIM_MIDDLE, FIM_PAD]
        )
    except KeyError:
        suffix_tok_id, prefix_tok_id, middle_tok_id, pad_tok_id = None, None, None, None
    return suffix_tok_id, prefix_tok_id, middle_tok_id, pad_tok_id


## Adapted from https://github.com/bigcode-project/Megatron-LM/blob/6c4bf908df8fd86b4977f54bf5b8bd4b521003d1/megatron/data/gpt_dataset.py
def permute(
    sample,
    np_rng,
    suffix_tok_id,
    prefix_tok_id,
    middle_tok_id,
    pad_tok_id,
    fim_rate=0.5,
    fim_spm_rate=0.5,
    truncate_or_pad=False,
):
    """
    Take in a sample (list of tokens) and perform a FIM transformation on it with a probability of fim_rate, using two FIM modes:
    PSM and SPM (with a probability of fim_spm_rate).
    """

    # The if condition will trigger with the probability of fim_rate
    # This means FIM transformations will apply to samples with a probability of fim_rate
    if np_rng.binomial(1, fim_rate):

        # Split the sample into prefix, middle, and suffix, based on randomly generated indices stored in the boundaries list.
        boundaries = list(np_rng.randint(low=0, high=len(sample) + 1, size=2))
        boundaries.sort()

        prefix = np.array(sample[: boundaries[0]], dtype=np.int64)
        middle = np.array(sample[boundaries[0] : boundaries[1]], dtype=np.int64)
        suffix = np.array(sample[boundaries[1] :], dtype=np.int64)

        if truncate_or_pad:
            # calculate the new total length of the sample, taking into account tokens indicating prefix, middle, and suffix
            new_length = suffix.shape[0] + prefix.shape[0] + middle.shape[0] + 3
            diff = new_length - len(sample)

            # trancate or pad if there's a difference in length between the new length and the original
            if diff > 0:
                if suffix.shape[0] <= diff:
                    return sample, np_rng
                suffix = suffix[: suffix.shape[0] - diff]
            elif diff < 0:
                suffix = np.concatenate([suffix, np.full((-1 * diff), pad_tok_id)])

        # With the probability of fim_spm_rateapply SPM variant of FIM transformations
        # SPM: suffix, prefix, middle
        if np_rng.binomial(1, fim_spm_rate):
            new_sample = np.concatenate(
                [
                    [prefix_tok_id, suffix_tok_id],
                    suffix,
                    [middle_tok_id],
                    prefix,
                    middle,
                ]
            )
        # Otherwise, apply the PSM variant of FIM transformations
        # PSM: prefix, suffix, middle
        else:

            new_sample = np.concatenate(
                [
                    [prefix_tok_id],
                    prefix,
                    [suffix_tok_id],
                    suffix,
                    [middle_tok_id],
                    middle,
                ]
            )
    else:
        # don't apply FIM transformations
        new_sample = sample

    return list(new_sample), np_rng

让我们定义 ConstantLengthDataset，这是一个可迭代数据集，它将返回恒定长度的令牌块。为此，我们将从原始数据集中读取文本缓冲区，直到达到大小限制，然后应用令牌化器将原始文本转换为令牌化输入。可选地，我们将对某些序列执行 FIM 转换（受影响序列的比例由 fim_rate 控制）。

定义后，我们可以从训练和验证数据创建 ConstantLengthDataset 的实例。

from torch.utils.data import IterableDataset
from torch.utils.data.dataloader import DataLoader
import random

# Create an Iterable dataset that returns constant-length chunks of tokens from a stream of text files.


class ConstantLengthDataset(IterableDataset):
    """
    Iterable dataset that returns constant length chunks of tokens from stream of text files.
        Args:
            tokenizer (Tokenizer): The processor used for proccessing the data.
            dataset (dataset.Dataset): Dataset with text files.
            infinite (bool): If True the iterator is reset after dataset reaches end else stops.
            seq_length (int): Length of token sequences to return.
            num_of_sequences (int): Number of token sequences to keep in buffer.
            chars_per_token (int): Number of characters per token used to estimate number of tokens in text buffer.
            fim_rate (float): Rate (0.0 to 1.0) that sample will be permuted with FIM.
            fim_spm_rate (float): Rate (0.0 to 1.0) of FIM permuations that will use SPM.
            seed (int): Seed for random number generator.
    """

    def __init__(
        self,
        tokenizer,
        dataset,
        infinite=False,
        seq_length=1024,
        num_of_sequences=1024,
        chars_per_token=3.6,
        content_field="content",
        fim_rate=0.5,
        fim_spm_rate=0.5,
        seed=0,
    ):
        self.tokenizer = tokenizer
        self.concat_token_id = tokenizer.eos_token_id
        self.dataset = dataset
        self.seq_length = seq_length
        self.infinite = infinite
        self.current_size = 0
        self.max_buffer_size = seq_length * chars_per_token * num_of_sequences
        self.content_field = content_field
        self.fim_rate = fim_rate
        self.fim_spm_rate = fim_spm_rate
        self.seed = seed

        (
            self.suffix_tok_id,
            self.prefix_tok_id,
            self.middle_tok_id,
            self.pad_tok_id,
        ) = get_fim_token_ids(self.tokenizer)
        if not self.suffix_tok_id and self.fim_rate > 0:
            print("FIM is not supported by tokenizer, disabling FIM")
            self.fim_rate = 0

    def __iter__(self):
        iterator = iter(self.dataset)
        more_examples = True
        np_rng = np.random.RandomState(seed=self.seed)
        while more_examples:
            buffer, buffer_len = [], 0
            while True:
                if buffer_len >= self.max_buffer_size:
                    break
                try:
                    buffer.append(next(iterator)[self.content_field])
                    buffer_len += len(buffer[-1])
                except StopIteration:
                    if self.infinite:
                        iterator = iter(self.dataset)
                    else:
                        more_examples = False
                        break
            tokenized_inputs = self.tokenizer(buffer, truncation=False)["input_ids"]
            all_token_ids = []

            for tokenized_input in tokenized_inputs:
                # optionally do FIM permutations
                if self.fim_rate > 0:
                    tokenized_input, np_rng = permute(
                        tokenized_input,
                        np_rng,
                        self.suffix_tok_id,
                        self.prefix_tok_id,
                        self.middle_tok_id,
                        self.pad_tok_id,
                        fim_rate=self.fim_rate,
                        fim_spm_rate=self.fim_spm_rate,
                        truncate_or_pad=False,
                    )

                all_token_ids.extend(tokenized_input + [self.concat_token_id])
            examples = []
            for i in range(0, len(all_token_ids), self.seq_length):
                input_ids = all_token_ids[i : i + self.seq_length]
                if len(input_ids) == self.seq_length:
                    examples.append(input_ids)
            random.shuffle(examples)
            for example in examples:
                self.current_size += 1
                yield {
                    "input_ids": torch.LongTensor(example),
                    "labels": torch.LongTensor(example),
                }


train_dataset = ConstantLengthDataset(
    tokenizer,
    train_data,
    infinite=True,
    seq_length=SEQ_LENGTH,
    chars_per_token=chars_per_token,
    content_field=DATA_COLUMN,
    fim_rate=FIM_RATE,
    fim_spm_rate=FIM_SPM_RATE,
    seed=SEED,
)
eval_dataset = ConstantLengthDataset(
    tokenizer,
    valid_data,
    infinite=False,
    seq_length=SEQ_LENGTH,
    chars_per_token=chars_per_token,
    content_field=DATA_COLUMN,
    fim_rate=FIM_RATE,
    fim_spm_rate=FIM_SPM_RATE,
    seed=SEED,
)

准备模型

现在数据已准备就绪，是时候加载模型了！我们将加载模型的量化版本。

这将使我们能够减少内存使用量，因为量化使用更少的位数表示数据。我们将使用 bitsandbytes 库来量化模型，因为它与 transformers 具有良好的集成。我们需要做的就是定义一个 bitsandbytes 配置，然后在加载模型时使用它。

4 位量化有不同的变体，但通常，我们建议使用 NF4 量化以获得更好的性能 (bnb_4bit_quant_type="nf4")。

bnb_4bit_use_double_quant 选项在第一次量化之后添加第二次量化，以每个参数额外节省 0.4 位。

要了解有关量化的更多信息，请查看 “使用 bitsandbytes、4 位量化和 QLoRA 让 LLM 更加易于访问”博客文章。

定义后，将配置传递给 from_pretrained 方法以加载模型的量化版本。

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from peft.tuners.lora import LoraLayer

load_in_8bit = False

# 4-bit quantization
compute_dtype = getattr(torch, BNB_4BIT_COMPUTE_DTYPE)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=USE_NESTED_QUANT,
)

device_map = {"": 0}

model = AutoModelForCausalLM.from_pretrained(
    MODEL,
    load_in_8bit=load_in_8bit,
    quantization_config=bnb_config,
    device_map=device_map,
    use_cache=False,  # We will be using gradient checkpointing
    trust_remote_code=True,
    use_flash_attention_2=True,
)

当使用量化模型进行训练时，您需要调用 prepare_model_for_kbit_training() 函数来预处理量化模型以进行训练。

model = prepare_model_for_kbit_training(model)

现在量化模型已准备就绪，我们可以设置 LoRA 配置。LoRA 通过大幅减少可训练参数的数量来提高微调效率。

要使用 LoRA 技术训练模型，我们需要将基础模型包装为 PeftModel。这包括使用 LoraConfig 定义 LoRA 配置，并使用 LoraConfig 使用 get_peft_model() 包装原始模型。

要了解有关 LoRA 及其参数的更多信息，请参阅 PEFT 文档。

>>> # Set up lora
>>> peft_config = LoraConfig(
...     lora_alpha=LORA_ALPHA,
...     lora_dropout=LORA_DROPOUT,
...     r=LORA_R,
...     bias="none",
...     task_type="CAUSAL_LM",
...     target_modules=LORA_TARGET_MODULES.split(","),
... )

>>> model = get_peft_model(model, peft_config)
>>> model.print_trainable_parameters()

trainable params: 5,554,176 || all params: 1,142,761,472 || trainable%: 0.4860310866343243

如您所见，通过应用 LoRA 技术，我们现在只需要训练不到 1% 的参数。

训练模型

现在我们已经准备好了数据并优化了模型，我们准备好将所有内容整合在一起以开始训练。

要实例化 Trainer，您需要定义训练配置。最重要的是 TrainingArguments，它是一个包含配置训练的所有属性的类。

这些与其他任何类型的模型训练类似，因此我们在此处不再赘述。

train_data.start_iteration = 0


training_args = TrainingArguments(
    output_dir=f"Your_HF_username/{OUTPUT_DIR}",
    dataloader_drop_last=True,
    evaluation_strategy="steps",
    save_strategy="steps",
    max_steps=MAX_STEPS,
    eval_steps=EVAL_FREQ,
    save_steps=SAVE_FREQ,
    logging_steps=LOG_FREQ,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    learning_rate=LR,
    lr_scheduler_type=LR_SCHEDULER_TYPE,
    warmup_steps=NUM_WARMUP_STEPS,
    gradient_accumulation_steps=GR_ACC_STEPS,
    gradient_checkpointing=True,
    fp16=FP16,
    bf16=BF16,
    weight_decay=WEIGHT_DECAY,
    push_to_hub=True,
    include_tokens_per_second=True,
)

作为最后一步，实例化 Trainer 并调用 train 方法。

>>> trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset)

>>> print("Training...")
>>> trainer.train()

Training...

最后，您可以将微调后的模型推送到您的 Hub 存储库，与您的团队共享。

trainer.push_to_hub()

推理

将模型上传到 Hub 后，我们可以将其用于推理。为此，我们首先初始化原始基础模型及其令牌化器。接下来，我们需要将微调后的权重与基础模型合并。

from peft import PeftModel
import torch

# load the original model first
tokenizer = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)
base_model = AutoModelForCausalLM.from_pretrained(
    MODEL,
    quantization_config=None,
    device_map=None,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).cuda()

# merge fine-tuned weights with the base model
peft_model_id = f"Your_HF_username/{OUTPUT_DIR}"
model = PeftModel.from_pretrained(base_model, peft_model_id)
model.merge_and_unload()

现在我们可以将合并后的模型用于推理。为方便起见，我们将定义一个 get_code_completion - 随意尝试文本生成参数！

def get_code_completion(prefix, suffix):
    text = prompt = f"""<fim_prefix>{prefix}<fim_suffix>{suffix}<fim_middle>"""
    model.eval()
    outputs = model.generate(
        input_ids=tokenizer(text, return_tensors="pt").input_ids.cuda(),
        max_new_tokens=128,
        temperature=0.2,
        top_k=50,
        top_p=0.95,
        do_sample=True,
        repetition_penalty=1.0,
    )
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

现在，我们要做的就是调用 get_code_complete 函数，并将我们想要完成的前几行代码作为前缀传递，并将空字符串作为后缀。

>>> prefix = """from peft import LoraConfig, TaskType, get_peft_model
... from transformers import AutoModelForCausalLM
... peft_config = LoraConfig(
... """
>>> suffix = """"""

... print(get_code_completion(prefix, suffix))

from peft import LoraConfig, TaskType, get_peft_model
from transformers import AutoModelForCausalLM
peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none",
    modules_to_save=["q_proj", "v_proj"],
    inference_mode=False,
)
model = AutoModelForCausalLM.from_pretrained("gpt2")
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

作为刚刚在本笔记本前面使用过 PEFT 库的人，您可以看到为创建 LoraConfig 生成的结果相当不错！

如果您回到我们实例化模型进行推理的单元格，并注释掉我们合并微调权重的行，您可以看到原始模型对于完全相同的前缀会生成什么

>>> prefix = """from peft import LoraConfig, TaskType, get_peft_model
... from transformers import AutoModelForCausalLM
... peft_config = LoraConfig(
... """
>>> suffix = """"""

... print(get_code_completion(prefix, suffix))

from peft import LoraConfig, TaskType, get_peft_model
from transformers import AutoModelForCausalLM
peft_config = LoraConfig(
    model_name_or_path="facebook/wav2vec2-base-960h",
    num_labels=1,
    num_features=1,
    num_hidden_layers=1,
    num_attention_heads=1,
    num_hidden_layers_per_attention_head=1,
    num_attention_heads_per_hidden_layer=1,
    hidden_size=1024,
    hidden_dropout_prob=0.1,
    hidden_act="gelu",
    hidden_act_dropout_prob=0.1,
    hidden

虽然它是 Python 语法，但您可以看到原始模型不了解 LoraConfig 应该做什么。

要了解这种微调与完全微调的比较，以及如何在 VS Code 中通过推理端点或本地使用此类模型作为您的副驾驶，请查看 “个人副驾驶：训练您自己的编码助手”博客文章。本笔记本是对原始博客文章的补充。

< > 在 GitHub 上更新

←使用 SetFit 在零样本文本分类中进行数据标注的建议使用 PEFT 进行提示调优→