释放 Unsloth 和 QLora 的力量:重新定义语言模型微调

引言
在语言模型优化这个充满活力的领域,一股革命性的力量已经崛起——Unsloth。这个由 Daniel 和 Michael Han 构思的先锋框架,旨在重新定义微调的格局。当我们深入探讨其定义、优势和益处时,请准备好见证我们处理语言模型优化方式的范式转变。
定义:
Unsloth 不仅仅是一个库;它是一场为大型语言模型(LLM)微调和训练而精心策划的技术交响乐。Unsloth 专为实现最佳性能而设计,引入了创新技术,以提高微调过程的速度、降低内存消耗并提升准确性。
Unsloth 的优势:
速度重塑:Unsloth 训练速度惊人地提高了 30 倍。Alpaca 这项基准任务现在仅需 3 小时,而不是传统的 85 小时。这种加速证明了 Unsloth 对效率和生产力的承诺。
内存效率:Unsloth 在内存领域带来了革命性的变革,承诺将内存使用量减少 60%。这不仅能够处理更大的批次,还能确保无缝的微调过程,而不会影响性能。
准确性提升:作者自豪地宣称准确性损失为 0%,并可选择使用其 MAX 产品将准确性提高 20%。这种保持和提升准确性水平的承诺使 Unsloth 在竞争激烈的环境中脱颖而出。
硬件兼容性:Unsloth 通过支持 NVIDIA、Intel 和 AMD GPU 来扩展其覆盖范围。这种包容性确保了广泛的硬件配置可访问性,使其成为不同平台开发人员的多功能选择。
使用 Unsloth 和 QLora 进行微调的好处:
效率释放:QLoRA 期间权重缩减的减少意味着更少的权重,从而实现更高效的内存占用。这种效率,加上直接使用 bfloat16,使开发人员能够更快地实现微调目标,并减少资源需求。
创新注意力机制:Unsloth 通过 xformers 和 Tri Dao 的实现集成了 Flash Attention,有助于优化 Transformer 模型。这种创新的注意力机制方法确保了微调不仅仅是一项技术任务,而是一项创造性工作。
因果掩码实现加速:采用因果掩码来加速训练,而不是单独的注意力掩码,展示了 Unsloth 重新构想传统方法学的承诺。这种前瞻性方法为更高效和更快速的微调铺平了道路。
优化交叉熵损失:Unsloth 不仅仅是微调;它以高精度进行微调。交叉熵损失计算的优化显著减少了内存消耗,确保该过程在不影响准确性的前提下保持资源友好。
代码实现
让我们深入了解使用 unsloth 和 QLora 进行微调的代码部分
步骤 1:安装库
# Import the PyTorch library
import torch
# Get the major and minor version of the current CUDA device (GPU)
major_version, minor_version = torch.cuda.get_device_capability()
# Apply the following if the GPU has Ampere or Hopper architecture (RTX 30xx, RTX 40xx, A100, H100, L40, etc.)
if major_version >= 8:
# Install the Unsloth library for Ampere and Hopper architecture from GitHub
!pip install "unsloth[colab_ampere] @ git+https://github.com/unslothai/unsloth.git" -q
# Apply the following for older GPUs (V100, Tesla T4, RTX 20xx, etc.)
else:
# Install the Unsloth library for older GPUs from GitHub
!pip install "unsloth[colab] @ git+https://github.com/unslothai/unsloth.git" -q
# Placeholder statement (does nothing)
pass
# Install the Hugging Face Transformers library from GitHub, which allows native 4-bit loading
!pip install "git+https://github.com/huggingface/transformers.git" -q
!pip install trl datasets -q
步骤 2:导入库
from unsloth import FastLanguageModel
# Import FastLanguageModel from the Unsloth library
max_seq_length = 2048 # Can be set arbitrarily, automatically supports RoPE scaling!
# Set the maximum sequence length to 2048 (can be changed arbitrarily)
dtype = None # Automatically detect if None. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
# Set the data type (automatically detect if None, can also specify Float16 or Bfloat16)
load_in_4bit = True # Reduce memory usage using 4-bit quantization. Can be set to False.
# Reduce memory usage using 4-bit quantization (can be set to False to disable)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/mistral-7b-bnb-4bit", # Use "unsloth/mistral-7b" for 16-bit loading
# Load the model "unsloth/mistral-7b-bnb-4bit" from pre-training (use "unsloth/mistral-7b" for 16-bit loading)
max_seq_length=max_seq_length,
# Set the maximum sequence length
dtype=dtype,
# Set the data type
load_in_4bit=load_in_4bit,
# Apply the settings for 4-bit loading
# token="hf_...", # Use the token when using a gate model (e.g., meta-llama/Llama-2-7b-hf)
# Use Hugging Face's token when using a gate model, or similar cases
)
添加 LoRA 适配器并仅更新所有参数的 1-10%!
model = FastLanguageModel.get_peft_model(
model,
# Specify the existing model
r=16, # Choose any positive number! Recommended values include 8, 16, 32, 64, 128, etc.
# Rank parameter for LoRA. The smaller this value, the fewer parameters will be modified.
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
# Specify the modules to which LoRA will be applied
lora_alpha=16,
# Alpha parameter for LoRA. This value determines the strength of the applied LoRA.
lora_dropout=0, # Currently, only supports dropout = 0
# Dropout rate for LoRA. Currently supports only 0.
bias="none", # Currently, only supports bias = "none"
# Bias usage setting. Currently supports only the setting without bias.
use_gradient_checkpointing=True,
# Whether to use gradient checkpointing to improve memory efficiency
random_state=3407,
# Seed value for random number generation
max_seq_length=max_seq_length,
# Set the maximum sequence length
)
步骤 3:加载数据集
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{}
### Input:
{}
### Response:
{}"""
# Define the prompt format for the Alpaca dataset
def formatting_prompts_func(examples):
# Define a function to format each example in the dataset
instructions = examples["instruction"]
inputs = examples["input"]
outputs = examples["output"]
# Get instructions, inputs, and outputs
texts = []
for instruction, input, output in zip(instructions, inputs, outputs):
# Generate text by combining instructions, inputs, and outputs
text = alpaca_prompt.format(instruction, input, output)
# Format the text according to the prompt format
texts.append(text)
return { "text" : texts, }
# Return a list of formatted texts
pass
# Placeholder (does nothing)
from datasets import load_dataset
# Import the load_dataset function from the datasets library
dataset = load_dataset("yahma/alpaca-cleaned", split="train")
# Load the training data of the cleaned version of the Alpaca dataset from yahma
dataset = dataset.map(formatting_prompts_func, batched=True,)
# Apply the formatting_prompts_func function to the dataset with batch processing
步骤 四:训练模型
from trl import SFTTrainer
# Import SFTTrainer from the TRL library
from transformers import TrainingArguments
# Import TrainingArguments from the Transformers library
trainer = SFTTrainer(
# Initialize the SFTTrainer
model=model,
# Specify the model to be used
train_dataset=dataset,
# Specify the training dataset
dataset_text_field="text",
# Specify the text field in the dataset
max_seq_length=max_seq_length,
# Specify the maximum sequence length
args=TrainingArguments(
# Specify training arguments
per_device_train_batch_size=2,
# Specify the training batch size per device
gradient_accumulation_steps=4,
# Specify the number of steps for gradient accumulation
warmup_steps=5,
# Specify the number of warm-up steps
max_steps=20,
# Specify the maximum number of steps
learning_rate=2e-4,
# Specify the learning rate
fp16=not torch.cuda.is_bf16_supported(),
# Set whether to use 16-bit floating-point precision (fp16)
bf16=torch.cuda.is_bf16_supported(),
# Set whether to use Bfloat16
logging_steps=1,
# Specify the logging steps
optim="adamw_8bit",
# Specify the optimizer (here using 8-bit AdamW)
weight_decay=0.01,
# Specify the weight decay value
lr_scheduler_type="linear",
# Specify the type of learning rate scheduler (linear)
seed=3407,
# Specify the random seed
output_dir="outputs",
# Specify the output directory
),
)
步骤 五:显示当前内存统计信息
gpu_stats = torch.cuda.get_device_properties(0)
# Get properties of the GPU device at index 0
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
# Get the maximum reserved GPU memory in GB and round to 3 decimal places
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
# Get the total GPU memory in GB and round to 3 decimal places
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
# Display the GPU name and maximum memory
print(f"{start_gpu_memory} GB of memory reserved.")
# Display the reserved memory amount
步骤 六:执行训练方法
trainer_stats = trainer.train()
步骤 七:转换为 GGUF 的代码
def colab_quantize_to_gguf(save_directory, quantization_method="q4_k_m"):
# Define a function for conversion to GGUF
from transformers.models.llama.modeling_llama import logger
import os
# Import necessary libraries
logger.warning_once(
"Unsloth: `colab_quantize_to_gguf` is still in development mode.\n"\
"If anything errors or breaks, please file a ticket on Github.\n"\
"Also, if you used this successfully, please tell us on Discord!"
)
# Warn that it's still in development mode and encourage reporting any issues
# From https://mlabonne.github.io/blog/posts/Quantize_Llama_2_models_using_ggml.html
ALLOWED_QUANTS = \
{
# Define currently allowed quantization methods
# Including descriptions for each quantization method
"q2_k" : "Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.",
"q3_k_l" : "Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
"q3_k_m" : "Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
"q3_k_s" : "Uses Q3_K for all tensors",
"q4_0" : "Original quant method, 4-bit.",
"q4_1" : "Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.",
"q4_k_m" : "Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K",
"q4_k_s" : "Uses Q4_K for all tensors",
"q5_0" : "Higher accuracy, higher resource usage and slower inference.",
"q5_1" : "Even higher accuracy, resource usage and slower inference.",
"q5_k_m" : "Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K",
"q5_k_s" : "Uses Q5_K for all tensors",
"q6_k" : "Uses Q8_K for all tensors",
"q8_0" : "Almost indistinguishable from float16. High resource use and slow. Not recommended for most users.",
}
if quantization_method not in ALLOWED_QUANTS.keys():
# If the specified quantization method is not allowed, raise an error
error = f"Unsloth: Quant method = [{quantization_method}] not supported. Choose from below:\n"
for key, value in ALLOWED_QUANTS.items():
error += f"[{key}] => {value}\n"
raise RuntimeError(error)
# Display information about the conversion
print_info = \
f"==((====))== Unsloth: Conversion from QLoRA to GGUF information\n"\
f" \\\ /| [0] Installing llama.cpp will take 3 minutes.\n"\
f"O^O/ \_/ \\ [1] Converting HF to GUUF 16bits will take 3 minutes.\n"\
f"\ / [2] Converting GGUF 16bits to q4_k_m will take 20 minutes.\n"\
f' "-____-" In total, you will have to wait around 26 minutes.\n'
print(print_info)
# Display information about the conversion process
if not os.path.exists("llama.cpp"):
# If llama.cpp does not exist, install it
print("Unsloth: [0] Installing llama.cpp. This will take 3 minutes...")
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && make clean && LLAMA_CUBLAS=1 make -j
!pip install gguf protobuf
pass
print("Unsloth: Starting conversion from HF to GGUF 16bit...")
# Display that conversion from HF to GGUF 16bit is starting
# print("Unsloth: [1] Converting HF into GGUF 16bit. This will take 3 minutes...")
!python llama.cpp/convert.py {save_directory} \
--outfile {save_directory}-unsloth.gguf \
--outtype f16
print("Unsloth: Starting conversion from GGUF 16bit to q4_k_m...")
# Display that conversion from GGUF 16bit to the specified quantization method is starting
# print("Unsloth: [2] Converting GGUF 16bit into q4_k_m. This will take 20 minutes...")
final_location = f"./{save_directory}-{quantization_method}-unsloth.gguf"
!./llama.cpp/quantize ./{save_directory}-unsloth.gguf \
{final_location} {quantization_method}
print(f"Unsloth: Output location: {final_location}")
# Display the output location of the converted file
pass
from unsloth import unsloth_save_model
# Import the unsloth_save_model function from the Unsloth library
# unsloth_save_model has the same args as model.save_pretrained
# unsloth_save_model has the same arguments as model.save_pretrained
unsloth_save_model(model, tokenizer, "output_model", push_to_hub=False, token=None)
# Save the model and tokenizer as "output_model". Do not push to the Hugging Face Hub
colab_quantize_to_gguf("output_model", quantization_method="q4_k_m")
# Convert "output_model" to GGUF format. Use the quantization method "q4_k_m"
结论
最后,我们对 Unsloth 的探索是一次引人入胜的旅程,深入到先进语言模型和人工智能创新的前沿。从 Ampere 和 Hopper 架构到低秩适配器(Low-Rank Adaptation adapters)的艺术,我们探索了数据准备、模型训练和内存优化的领域。
通过 TRL 原理增强的 Alpaca 数据集,成为了我们的画布。我们深入研究了内存使用细节、时间统计数据以及 GGUF 转换领域,展示了技术实力和创造力。
随着我们文章的结束,Unsloth 库作为技术与创造力融合的证明而存在。我们旅程的最后一步是将模型转换为 GGUF 格式,突显了我们工具的适应性。
这次探索不仅仅关乎代码;它是一次对创新和灵感的探索。Unsloth 对原创性和故事性的承诺邀请我们继续在不断发展的语言模型和人工智能领域突破界限。
“保持联系,并通过各种平台支持我的工作
Medium:您可以在 https://medium.com/@andysingal 阅读我的最新文章和见解
Paypal:喜欢我的文章吗?请我喝杯咖啡吧!https://paypal.me/alphasingal?country.x=US&locale.x=en_US"
请求和问题:如果您有想要我参与的项目,或者对我解释的概念有任何疑问,请随时告诉我。我一直在寻找未来笔记本的新想法,并且乐于帮助解决您可能有的任何疑问。
资源