开源 AI 食谱文档
在单块 GPU 上用自定义代码微调代码大语言模型
并获得增强的文档体验
开始使用
在单块 GPU 上用自定义代码微调代码大语言模型
公开可用的代码大语言模型 (Code LLM),如 Codex、StarCoder 和 Code Llama,在生成遵循通用编程原则和语法的代码方面表现出色,但它们可能不符合组织的内部惯例,或者不知道专有库。
在本 notebook 中,我们将展示如何利用私有代码库对代码大语言模型进行微调,以增强其上下文感知能力,并提高模型对组织需求的实用性。由于代码大语言模型相当大,以传统方式进行微调可能会耗费大量资源。别担心!我们将展示如何优化微调过程,使其能够在单块 GPU 上运行。
数据集
本示例中,我们选择了 GitHub 上排名前 10 的 Hugging Face 公开仓库。我们排除了数据中的非代码文件,如图像、音频文件、演示文稿等。对于 Jupyter notebook,我们只保留了包含代码的单元格。最终的代码以数据集的形式存储,你可以在 Hugging Face Hub 的 smangrul/hf-stack-v1
下找到它。它包含仓库 ID、文件路径和文件内容。
模型
我们将微调 bigcode/starcoderbase-1b
,这是一个拥有 10 亿参数、在 80 多种编程语言上训练过的模型。这是一个受限模型,因此如果你打算使用这个确切的模型运行此 notebook,你需要在模型页面上申请访问权限。请登录你的 Hugging Face 账户以进行操作。
from huggingface_hub import notebook_login
notebook_login()
首先,让我们安装所有必要的库。如你所见,除了 transformers
和 datasets
,我们还将使用 peft
、bitsandbytes
和 flash-attn
来优化训练过程。
通过采用参数高效训练技术,我们可以在单块 A100 高显存 GPU 上运行此 notebook。
!pip install -q transformers datasets peft bitsandbytes flash-attn
现在我们来定义一些变量。你可以随意调整这些值。
MODEL = "bigcode/starcoderbase-1b" # Model checkpoint on the Hugging Face Hub
DATASET = "smangrul/hf-stack-v1" # Dataset on the Hugging Face Hub
DATA_COLUMN = "content" # Column name containing the code content
SEQ_LENGTH = 2048 # Sequence length
# Training arguments
MAX_STEPS = 2000 # max_steps
BATCH_SIZE = 16 # batch_size
GR_ACC_STEPS = 1 # gradient_accumulation_steps
LR = 5e-4 # learning_rate
LR_SCHEDULER_TYPE = "cosine" # lr_scheduler_type
WEIGHT_DECAY = 0.01 # weight_decay
NUM_WARMUP_STEPS = 30 # num_warmup_steps
EVAL_FREQ = 100 # eval_freq
SAVE_FREQ = 100 # save_freq
LOG_FREQ = 25 # log_freq
OUTPUT_DIR = "peft-starcoder-lora-a100" # output_dir
BF16 = True # bf16
FP16 = False # no_fp16
# FIM trasformations arguments
FIM_RATE = 0.5 # fim_rate
FIM_SPM_RATE = 0.5 # fim_spm_rate
# LORA
LORA_R = 8 # lora_r
LORA_ALPHA = 32 # lora_alpha
LORA_DROPOUT = 0.0 # lora_dropout
LORA_TARGET_MODULES = "c_proj,c_attn,q_attn,c_fc,c_proj" # lora_target_modules
# bitsandbytes config
USE_NESTED_QUANT = True # use_nested_quant
BNB_4BIT_COMPUTE_DTYPE = "bfloat16" # bnb_4bit_compute_dtype
SEED = 0
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
Trainer,
TrainingArguments,
logging,
set_seed,
BitsAndBytesConfig,
)
set_seed(SEED)
准备数据
首先加载数据。由于数据集可能很大,请确保启用流式模式 (streaming mode)。流式加载允许我们在遍历数据集时逐步加载数据,而不是一次性下载整个数据集。
我们将保留前 4000 个样本作为验证集,其余的将作为训练数据。
from datasets import load_dataset
import torch
from tqdm import tqdm
dataset = load_dataset(
DATASET,
data_dir="data",
split="train",
streaming=True,
)
valid_data = dataset.take(4000)
train_data = dataset.skip(4000)
train_data = train_data.shuffle(buffer_size=5000, seed=SEED)
在这一步,数据集仍然包含任意长度代码的原始数据。对于训练,我们需要固定长度的输入。让我们创建一个可迭代数据集 (Iterable dataset),它能从文本文件流中返回固定长度的词元 (token) 块。
首先,我们来估算数据集中每个词元的平均字符数,这将有助于我们稍后估算文本缓冲区中的词元数量。默认情况下,我们只从数据集中取 400 个样本 (nb_examples
)。仅使用整个数据集的一个子集将减少计算成本,同时仍能提供对整体字符与词元比率的合理估计。
>>> tokenizer = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)
>>> def chars_token_ratio(dataset, tokenizer, data_column, nb_examples=400):
... """
... Estimate the average number of characters per token in the dataset.
... """
... total_characters, total_tokens = 0, 0
... for _, example in tqdm(zip(range(nb_examples), iter(dataset)), total=nb_examples):
... total_characters += len(example[data_column])
... total_tokens += len(tokenizer(example[data_column]).tokens())
... return total_characters / total_tokens
>>> chars_per_token = chars_token_ratio(train_data, tokenizer, DATA_COLUMN)
>>> print(f"The character to token ratio of the dataset is: {chars_per_token:.2f}")
The character to token ratio of the dataset is: 2.43
字符与词元比率也可以用作衡量文本分词质量的指标。例如,字符与词元比率为 1.0 意味着每个字符都被表示为一个词元,这没有多大意义,表明分词效果很差。在标准英语文本中,一个词元通常约等于四个字符,即字符与词元比率约为 4.0。我们可以预期代码数据集中的比率会更低,但一般来说,2.0 到 3.5 之间的数值可以被认为是足够好的。
可选的 FIM 转换
自回归语言模型通常从左到右生成序列。通过应用 FIM (Fill-in-the-Middle, 中间填充) 转换,模型也可以学习填充文本。请查阅 “高效训练语言模型以填充中间内容” (Efficient Training of Language Models to Fill in the Middle) 论文以了解更多关于该技术的信息。我们将在这里定义 FIM 转换,并在创建可迭代数据集时使用它们。但是,如果你想省略转换,可以随时将 fim_rate
设置为 0。
import functools
import numpy as np
# Helper function to get token ids of the special tokens for prefix, suffix and middle for FIM transformations.
@functools.lru_cache(maxsize=None)
def get_fim_token_ids(tokenizer):
try:
FIM_PREFIX, FIM_MIDDLE, FIM_SUFFIX, FIM_PAD = tokenizer.special_tokens_map["additional_special_tokens"][1:5]
suffix_tok_id, prefix_tok_id, middle_tok_id, pad_tok_id = (
tokenizer.vocab[tok] for tok in [FIM_SUFFIX, FIM_PREFIX, FIM_MIDDLE, FIM_PAD]
)
except KeyError:
suffix_tok_id, prefix_tok_id, middle_tok_id, pad_tok_id = None, None, None, None
return suffix_tok_id, prefix_tok_id, middle_tok_id, pad_tok_id
## Adapted from https://github.com/bigcode-project/Megatron-LM/blob/6c4bf908df8fd86b4977f54bf5b8bd4b521003d1/megatron/data/gpt_dataset.py
def permute(
sample,
np_rng,
suffix_tok_id,
prefix_tok_id,
middle_tok_id,
pad_tok_id,
fim_rate=0.5,
fim_spm_rate=0.5,
truncate_or_pad=False,
):
"""
Take in a sample (list of tokens) and perform a FIM transformation on it with a probability of fim_rate, using two FIM modes:
PSM and SPM (with a probability of fim_spm_rate).
"""
# The if condition will trigger with the probability of fim_rate
# This means FIM transformations will apply to samples with a probability of fim_rate
if np_rng.binomial(1, fim_rate):
# Split the sample into prefix, middle, and suffix, based on randomly generated indices stored in the boundaries list.
boundaries = list(np_rng.randint(low=0, high=len(sample) + 1, size=2))
boundaries.sort()
prefix = np.array(sample[: boundaries[0]], dtype=np.int64)
middle = np.array(sample[boundaries[0] : boundaries[1]], dtype=np.int64)
suffix = np.array(sample[boundaries[1] :], dtype=np.int64)
if truncate_or_pad:
# calculate the new total length of the sample, taking into account tokens indicating prefix, middle, and suffix
new_length = suffix.shape[0] + prefix.shape[0] + middle.shape[0] + 3
diff = new_length - len(sample)
# trancate or pad if there's a difference in length between the new length and the original
if diff > 0:
if suffix.shape[0] <= diff:
return sample, np_rng
suffix = suffix[: suffix.shape[0] - diff]
elif diff < 0:
suffix = np.concatenate([suffix, np.full((-1 * diff), pad_tok_id)])
# With the probability of fim_spm_rateapply SPM variant of FIM transformations
# SPM: suffix, prefix, middle
if np_rng.binomial(1, fim_spm_rate):
new_sample = np.concatenate(
[
[prefix_tok_id, suffix_tok_id],
suffix,
[middle_tok_id],
prefix,
middle,
]
)
# Otherwise, apply the PSM variant of FIM transformations
# PSM: prefix, suffix, middle
else:
new_sample = np.concatenate(
[
[prefix_tok_id],
prefix,
[suffix_tok_id],
suffix,
[middle_tok_id],
middle,
]
)
else:
# don't apply FIM transformations
new_sample = sample
return list(new_sample), np_rng
让我们定义 ConstantLengthDataset
,这是一个可迭代的数据集,它将返回固定长度的词元块。为此,我们将从原始数据集中读取一个文本缓冲区,直到达到大小限制,然后应用分词器将原始文本转换为分词后的输入。我们还可以选择性地对某些序列执行 FIM 转换 (受影响序列的比例由 fim_rate
控制)。
定义完成后,我们可以从训练数据和验证数据中创建 ConstantLengthDataset
的实例。
from torch.utils.data import IterableDataset
from torch.utils.data.dataloader import DataLoader
import random
# Create an Iterable dataset that returns constant-length chunks of tokens from a stream of text files.
class ConstantLengthDataset(IterableDataset):
"""
Iterable dataset that returns constant length chunks of tokens from stream of text files.
Args:
tokenizer (Tokenizer): The processor used for proccessing the data.
dataset (dataset.Dataset): Dataset with text files.
infinite (bool): If True the iterator is reset after dataset reaches end else stops.
seq_length (int): Length of token sequences to return.
num_of_sequences (int): Number of token sequences to keep in buffer.
chars_per_token (int): Number of characters per token used to estimate number of tokens in text buffer.
fim_rate (float): Rate (0.0 to 1.0) that sample will be permuted with FIM.
fim_spm_rate (float): Rate (0.0 to 1.0) of FIM permuations that will use SPM.
seed (int): Seed for random number generator.
"""
def __init__(
self,
tokenizer,
dataset,
infinite=False,
seq_length=1024,
num_of_sequences=1024,
chars_per_token=3.6,
content_field="content",
fim_rate=0.5,
fim_spm_rate=0.5,
seed=0,
):
self.tokenizer = tokenizer
self.concat_token_id = tokenizer.eos_token_id
self.dataset = dataset
self.seq_length = seq_length
self.infinite = infinite
self.current_size = 0
self.max_buffer_size = seq_length * chars_per_token * num_of_sequences
self.content_field = content_field
self.fim_rate = fim_rate
self.fim_spm_rate = fim_spm_rate
self.seed = seed
(
self.suffix_tok_id,
self.prefix_tok_id,
self.middle_tok_id,
self.pad_tok_id,
) = get_fim_token_ids(self.tokenizer)
if not self.suffix_tok_id and self.fim_rate > 0:
print("FIM is not supported by tokenizer, disabling FIM")
self.fim_rate = 0
def __iter__(self):
iterator = iter(self.dataset)
more_examples = True
np_rng = np.random.RandomState(seed=self.seed)
while more_examples:
buffer, buffer_len = [], 0
while True:
if buffer_len >= self.max_buffer_size:
break
try:
buffer.append(next(iterator)[self.content_field])
buffer_len += len(buffer[-1])
except StopIteration:
if self.infinite:
iterator = iter(self.dataset)
else:
more_examples = False
break
tokenized_inputs = self.tokenizer(buffer, truncation=False)["input_ids"]
all_token_ids = []
for tokenized_input in tokenized_inputs:
# optionally do FIM permutations
if self.fim_rate > 0:
tokenized_input, np_rng = permute(
tokenized_input,
np_rng,
self.suffix_tok_id,
self.prefix_tok_id,
self.middle_tok_id,
self.pad_tok_id,
fim_rate=self.fim_rate,
fim_spm_rate=self.fim_spm_rate,
truncate_or_pad=False,
)
all_token_ids.extend(tokenized_input + [self.concat_token_id])
examples = []
for i in range(0, len(all_token_ids), self.seq_length):
input_ids = all_token_ids[i : i + self.seq_length]
if len(input_ids) == self.seq_length:
examples.append(input_ids)
random.shuffle(examples)
for example in examples:
self.current_size += 1
yield {
"input_ids": torch.LongTensor(example),
"labels": torch.LongTensor(example),
}
train_dataset = ConstantLengthDataset(
tokenizer,
train_data,
infinite=True,
seq_length=SEQ_LENGTH,
chars_per_token=chars_per_token,
content_field=DATA_COLUMN,
fim_rate=FIM_RATE,
fim_spm_rate=FIM_SPM_RATE,
seed=SEED,
)
eval_dataset = ConstantLengthDataset(
tokenizer,
valid_data,
infinite=False,
seq_length=SEQ_LENGTH,
chars_per_token=chars_per_token,
content_field=DATA_COLUMN,
fim_rate=FIM_RATE,
fim_spm_rate=FIM_SPM_RATE,
seed=SEED,
)
准备模型
既然数据已经准备好了,是时候加载模型了!我们将加载模型的量化版本。
这将使我们能够减少内存使用,因为量化用更少的比特表示数据。我们将使用 bitsandbytes
库来量化模型,因为它与 transformers
有很好的集成。我们只需要定义一个 bitsandbytes
配置,然后在加载模型时使用它。
4 位量化有不同的变体,但总的来说,我们推荐使用 NF4 量化以获得更好的性能 (bnb_4bit_quant_type="nf4"
)。
bnb_4bit_use_double_quant
选项在第一次量化后增加第二次量化,以每个参数额外节省 0.4 比特。
要了解更多关于量化的信息,请查看 “通过 bitsandbytes、4 位量化和 QLoRA 让大语言模型更易于使用” (Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA) 这篇博客文章。
定义好配置后,将其传递给 from_pretrained
方法以加载模型的量化版本。
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from peft.tuners.lora import LoraLayer
load_in_8bit = False
# 4-bit quantization
compute_dtype = getattr(torch, BNB_4BIT_COMPUTE_DTYPE)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=USE_NESTED_QUANT,
)
device_map = {"": 0}
model = AutoModelForCausalLM.from_pretrained(
MODEL,
load_in_8bit=load_in_8bit,
quantization_config=bnb_config,
device_map=device_map,
use_cache=False, # We will be using gradient checkpointing
trust_remote_code=True,
use_flash_attention_2=True,
)
当使用量化模型进行训练时,你需要调用 prepare_model_for_kbit_training()
函数来对量化模型进行训练前的预处理。
model = prepare_model_for_kbit_training(model)
现在量化模型已经准备好,我们可以设置 LoRA 配置了。LoRA 通过大幅减少可训练参数的数量,使微调更加高效。
要使用 LoRA 技术训练模型,我们需要将基础模型包装成 PeftModel
。这包括使用 LoraConfig
定义 LoRA 配置,并使用该 LoraConfig
通过 get_peft_model()
包装原始模型。
要了解更多关于 LoRA 及其参数的信息,请参阅 PEFT 文档。
>>> # Set up lora
>>> peft_config = LoraConfig(
... lora_alpha=LORA_ALPHA,
... lora_dropout=LORA_DROPOUT,
... r=LORA_R,
... bias="none",
... task_type="CAUSAL_LM",
... target_modules=LORA_TARGET_MODULES.split(","),
... )
>>> model = get_peft_model(model, peft_config)
>>> model.print_trainable_parameters()
trainable params: 5,554,176 || all params: 1,142,761,472 || trainable%: 0.4860310866343243
正如你所见,通过应用 LoRA 技术,我们现在需要训练的参数不到 1%。
训练模型
现在我们已经准备好了数据,并优化了模型,我们准备好将所有东西整合起来开始训练了。
要实例化一个 Trainer
,你需要定义训练配置。最重要的是 TrainingArguments
,它是一个包含所有用于配置训练的属性的类。
这些参数与你可能运行的任何其他类型的模型训练相似,所以我们在这里不作详细介绍。
train_data.start_iteration = 0
training_args = TrainingArguments(
output_dir=f"Your_HF_username/{OUTPUT_DIR}",
dataloader_drop_last=True,
evaluation_strategy="steps",
save_strategy="steps",
max_steps=MAX_STEPS,
eval_steps=EVAL_FREQ,
save_steps=SAVE_FREQ,
logging_steps=LOG_FREQ,
per_device_train_batch_size=BATCH_SIZE,
per_device_eval_batch_size=BATCH_SIZE,
learning_rate=LR,
lr_scheduler_type=LR_SCHEDULER_TYPE,
warmup_steps=NUM_WARMUP_STEPS,
gradient_accumulation_steps=GR_ACC_STEPS,
gradient_checkpointing=True,
fp16=FP16,
bf16=BF16,
weight_decay=WEIGHT_DECAY,
push_to_hub=True,
include_tokens_per_second=True,
)
作为最后一步,实例化 Trainer
并调用 train
方法。
>>> trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset)
>>> print("Training...")
>>> trainer.train()
Training...
最后,你可以将微调后的模型推送到你的 Hub 仓库,与你的团队分享。
trainer.push_to_hub()
推理
一旦模型上传到 Hub,我们就可以用它进行推理。为此,我们首先初始化原始的基础模型及其分词器。接下来,我们需要将微调后的权重与基础模型合并。
from peft import PeftModel
import torch
# load the original model first
tokenizer = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)
base_model = AutoModelForCausalLM.from_pretrained(
MODEL,
quantization_config=None,
device_map=None,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
).cuda()
# merge fine-tuned weights with the base model
peft_model_id = f"Your_HF_username/{OUTPUT_DIR}"
model = PeftModel.from_pretrained(base_model, peft_model_id)
model.merge_and_unload()
现在我们可以使用合并后的模型进行推理。为方便起见,我们将定义一个 get_code_completion
函数——你可以随意尝试不同的文本生成参数!
def get_code_completion(prefix, suffix):
text = prompt = f"""<fim_prefix>{prefix}<fim_suffix>{suffix}<fim_middle>"""
model.eval()
outputs = model.generate(
input_ids=tokenizer(text, return_tensors="pt").input_ids.cuda(),
max_new_tokens=128,
temperature=0.2,
top_k=50,
top_p=0.95,
do_sample=True,
repetition_penalty=1.0,
)
return tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
现在,要获得代码补全,我们只需要调用 get_code_complete
函数,并将我们想要补全的前几行作为前缀 (prefix) 传入,并将一个空字符串作为后缀 (suffix) 传入。
>>> prefix = """from peft import LoraConfig, TaskType, get_peft_model
... from transformers import AutoModelForCausalLM
... peft_config = LoraConfig(
... """
>>> suffix = """"""
... print(get_code_completion(prefix, suffix))
from peft import LoraConfig, TaskType, get_peft_model from transformers import AutoModelForCausalLM peft_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=8, lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.1, bias="none", modules_to_save=["q_proj", "v_proj"], inference_mode=False, ) model = AutoModelForCausalLM.from_pretrained("gpt2") model = get_peft_model(model, peft_config) model.print_trainable_parameters()
作为一个刚刚在本 notebook 前面部分使用过 PEFT 库的人,你可以看到为创建 LoraConfig
生成的结果相当不错!
如果你回到我们实例化推理模型的单元格,并注释掉合并微调权重的代码行,你可以看到原始模型对完全相同的前缀会生成什么内容。
>>> prefix = """from peft import LoraConfig, TaskType, get_peft_model
... from transformers import AutoModelForCausalLM
... peft_config = LoraConfig(
... """
>>> suffix = """"""
... print(get_code_completion(prefix, suffix))
from peft import LoraConfig, TaskType, get_peft_model from transformers import AutoModelForCausalLM peft_config = LoraConfig( model_name_or_path="facebook/wav2vec2-base-960h", num_labels=1, num_features=1, num_hidden_layers=1, num_attention_heads=1, num_hidden_layers_per_attention_head=1, num_attention_heads_per_hidden_layer=1, hidden_size=1024, hidden_dropout_prob=0.1, hidden_act="gelu", hidden_act_dropout_prob=0.1, hidden
虽然它符合 Python 语法,但你可以看到原始模型并不理解 LoraConfig
应该做什么。
要了解这种微调与完全微调的比较,以及如何通过推理端点 (Inference Endpoints) 或在本地将这样的模型用作你在 VS Code 中的编程助手 (copilot),请查看 “个人编程助手:训练你自己的编码助手” (Personal Copilot: Train Your Own Coding Assistant) 这篇博客文章。本 notebook 是对原文的补充。
< > 在 GitHub 上更新