LLM 性能比较：使用 LoRA 深入研究 Roberta、Llama 2 和 Mistral 在灾难推文分析中的表现

发布于 2023 年 11 月 7 日

访客

LLM 性能比较：使用 LoRA 深入研究 Roberta、Llama 2 和 Mistral 在灾难推文分析中的表现

引言

使用的硬件

目标

依赖项

预训练模型
RoBERTa

Llama 2

Mistral 7B

LoRA

设置

数据准备
数据加载

数据处理

模型
RoBERTa

Mistral

Llama 2

设置训练器
评估指标

用于加权损失的自定义训练器

训练器设置

超参数调优

结果

结论

资源

引言

在自然语言处理 (NLP) 这个日新月异的领域中，我们经常需要比较不同的语言模型，以确定哪一个最适合特定任务。本篇博文旨在比较三个模型：RoBERTa、Mistral-7b 和 Llama-2-7b。我们用它们来解决一个常见问题——对关于灾难的推文进行分类。值得注意的是，Mistral 和 Llama 2 是拥有 70 亿参数的大型模型。相比之下，RoBERTa-large（3.55 亿参数）是一个相对较小的模型，用作比较研究的基线。

在这篇博客中，我们使用了 PEFT (Parameter-Efficient Fine-Tuning) 技术：LoRA (Low-Rank Adaptation of Large Language Models) 来对预训练模型进行序列分类任务的微调。LoRa 旨在显著减少可训练参数的数量，同时保持强大的下游任务性能。

本博文的主要目标是使用 Hugging Face 的三个预训练模型，为序列分类任务实现 LoRA 微调：meta-llama/Llama-2-7b-hf, mistralai/Mistral-7B-v0.1, 和 roberta-large。

使用的硬件

节点数：1
每个节点的 GPU 数量：1
GPU 类型：A6000
GPU 内存：48GB

目标

使用 LoRA PEFT 方法实现预训练 LLM 的微调。
学习如何使用 HuggingFace API (transformers、peft 和 datasets)。
使用 Weights & Biases 设置超参数调优和实验日志记录。

依赖项

datasets
evaluate
peft
scikit-learn
torch
transformers
wandb

注意：为了重现报告的结果，请检查 wandb 报告中固定的版本。

预训练模型

RoBERTa

RoBERTa (Robustly Optimized BERT Approach) 是 Meta AI 研究团队提出的 BERT 模型的一个高级变体。BERT 是一种基于 Transformer 的语言模型，使用自注意力机制进行上下文词表示，并以掩码语言模型为目标进行训练。注意，BERT 是一个仅包含编码器的模型，用于自然语言理解任务（如序列分类和词元分类）。

RoBERTa 是一款流行的微调模型，适合作为我们实验的基线。更多信息，您可以查看 Hugging Face 模型卡片。

Llama 2

Llama 2 模型，即 Large Language Model Meta AI，属于 Meta AI 推出的大型语言模型 (LLM) 家族。Llama 2 模型的大小各不相同，参数数量从 70 亿到 650 亿不等。

Llama 2 是一种自回归语言模型，基于 Transformer 解码器架构。为了生成文本，Llama 2 将一个词序列作为输入，并使用滑动窗口迭代预测下一个词元。Llama 2 的架构与 GPT-3 等模型略有不同。例如，Llama 2 使用 SwiGLU 激活函数而不是 ReLU，并选择旋转位置嵌入代替绝对可学习的位置嵌入。

最近发布的 Llama 2 引入了架构上的改进，通过将上下文长度扩展到最多 4096 个词元，并使用分组查询注意力 (GQA) 解码，来更好地利用非常长的序列。

Mistral 7B

Mistral 7B v0.1 拥有 73 亿参数，是 Mistral AI 推出的第一款 LLM。Mistral 7B 架构中使用的主要新技术有：

滑动窗口注意力：用基于滑动窗口的注意力机制取代全注意力（平方计算成本），其中每个词元最多可以关注来自前一层的 4096 个词元（线性计算成本）。这种机制使 Mistral 7B 能够处理更长的序列，其中更高层可以访问超出 4096 个词元窗口大小的历史信息。
分组查询注意力：该技术同样用于 Llama 2，通过缓存序列中先前已解码词元的键和值向量来优化推理过程（减少处理时间）。

LoRA

PEFT (Parameter Efficient Fine-Tuning) 是一系列技术（p-tuning、prefix-tuning、IA3、Adapters 和 LoRa）的集合，旨在用更少的训练参数来微调大型模型，同时保持与全量微调相当的性能水平。

LoRA (Low-Rank Adaptation) 是一种 PEFT 方法，与适配器层 (Adapter layers) 有相似之处。其主要目标是减少模型的可训练参数。LoRA 的操作涉及学习一个低秩更新矩阵，同时保持预训练权重冻结。

设置

RoBERTa 的最大序列长度限制为 512，因此我们为所有模型设置 MAX_LEN=512 以确保公平比较。

MAX_LEN = 512 
roberta_checkpoint = "roberta-large"
mistral_checkpoint = "mistralai/Mistral-7B-v0.1"
llama_checkpoint = "meta-llama/Llama-2-7b-hf"

数据准备

数据加载

我们将从 Hugging Face 加载数据集

from datasets import load_dataset
dataset = load_dataset("mehdiiraqui/twitter_disaster")

现在，让我们将数据集拆分为训练集和验证集。然后添加测试集

from datasets import Dataset
# Split the dataset into training and validation datasets
data = dataset['train'].train_test_split(train_size=0.8, seed=42)
# Rename the default "test" split to "validation"
data['val'] = data.pop("test")
# Convert the test dataframe to HuggingFace dataset and add it into the first dataset
data['test'] = dataset['test']

以下是数据集的概览

DatasetDict({
    train: Dataset({
        features: ['id', 'keyword', 'location', 'text', 'target'],
        num_rows: 6090
    })
    val: Dataset({
        features: ['id', 'keyword', 'location', 'text', 'target'],
        num_rows: 1523
    })
    test: Dataset({
        features: ['id', 'keyword', 'location', 'text', 'target'],
        num_rows: 3263
    })
})

让我们检查一下数据分布

import pandas as pd

data['train'].to_pandas().info()
data['test'].to_pandas().info()

训练数据集

RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB

测试数据集

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3263 entries, 0 to 3262
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        3263 non-null   int64 
 1   keyword   3237 non-null   object
 2   location  2158 non-null   object
 3   text      3263 non-null   object
 4   target    3263 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 127.6+ KB

训练数据集中的目标分布

target
0    4342
1    3271
Name: count, dtype: int64

由于类别不平衡，我们将计算正负样本的权重，并在之后用于损失计算

pos_weights = len(data['train'].to_pandas()) / (2 * data['train'].to_pandas().target.value_counts()[1])
neg_weights = len(data['train'].to_pandas()) / (2 * data['train'].to_pandas().target.value_counts()[0])

最终的权重是

POS_WEIGHT, NEG_WEIGHT = (1.1637114032405993, 0.8766697374481806)

然后，我们计算 text 列的最大长度

# Number of Characters
max_char = data['train'].to_pandas()['text'].str.len().max()
# Number of Words
max_words = data['train'].to_pandas()['text'].str.split().str.len().max()

The maximum number of characters is 152.
The maximum number of words is 31.

数据处理

让我们来看一个训练数据的行示例

data['train'][0]

{'id': 5285,
 'keyword': 'fear',
 'location': 'Thibodaux, LA',
 'text': 'my worst fear. https://#/iH8UDz8mq3',
 'target': 0}

数据包含关键词、位置和推文文本。为简单起见，我们选择 text 特征作为 LLM 的唯一输入。

至此，我们已经准备好了符合预训练 LLM 预期的 HuggingFace 格式的训练集、验证集和测试集。下一步是使用相应的分词器定义用于训练的分词数据集，将 text 特征转换为词元 ID 序列和注意力掩码这两个张量。由于每个模型都有其特定的分词器，我们需要定义三个不同的数据集。

我们首先定义 RoBERTa 的数据加载器

加载分词器

from transformers import AutoTokenizer
roberta_tokenizer = AutoTokenizer.from_pretrained(roberta_checkpoint, add_prefix_space=True)

注意：RoBERTa 分词器在训练时将空格视为词元的一部分。因此，如果句子的第一个词前面没有空格，它的编码方式会不同。为了确保第一个词包含空格，我们设置 add_prefix_space=True。此外，为了保持所有三个模型的预处理一致，我们对 Llama 2 和 Mistral 7b 也将该参数设置为 'True'。

定义用于转换数据帧中一行的预处理函数

def roberta_preprocessing_function(examples):
    return roberta_tokenizer(examples['text'], truncation=True, max_length=MAX_LEN)

通过将预处理函数应用于我们训练数据集的第一个示例，我们得到了分词后的输入 (input_ids) 和注意力掩码

roberta_preprocessing_function(data['train'][0])

{'input_ids': [0, 127, 2373, 2490, 4, 1205, 640, 90, 4, 876, 73, 118, 725, 398, 13083, 329, 398, 119, 1343, 246, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

现在，让我们将预处理函数应用于整个数据集

col_to_delete = ['id', 'keyword','location', 'text']
# Apply the preprocessing function and remove the undesired columns
roberta_tokenized_datasets = data.map(roberta_preprocessing_function, batched=True, remove_columns=col_to_delete)
# Rename the target to label as for HugginFace standards
roberta_tokenized_datasets = roberta_tokenized_datasets.rename_column("target", "label")
# Set to torch format
roberta_tokenized_datasets.set_format("torch")

注意： 我们从数据中删除了不需要的列：id、keyword、location 和 text。我们删除了 text，因为我们已经将其转换为输入 ID 和注意力掩码

我们可以看一下我们分词后的训练数据集

roberta_tokenized_datasets['train'][0]

{'label': tensor(0),
 'input_ids': tensor([    0,   127,  2373,  2490,     4,  1205,   640,    90,     4,   876,
            73,   118,   725,   398, 13083,   329,   398,   119,  1343,   246,
             2]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])}

为了生成训练批次，我们还需要将给定批次中的行填充到批次中的最大长度。为此，我们将使用 DataCollatorWithPadding 类

# Data collator for padding a batch of examples to the maximum length seen in the batch
from transformers import DataCollatorWithPadding
roberta_data_collator = DataCollatorWithPadding(tokenizer=roberta_tokenizer)

您可以按照相同的步骤为 Mistral 7B 和 Llama 2 模型准备数据

注意 Llama 2 和 Mistral 7B 没有默认的 pad_token_id。因此，我们也使用 eos_token_id 进行填充。

Mistral 7B

# Load Mistral 7B Tokenizer
from transformers import AutoTokenizer, DataCollatorWithPadding
mistral_tokenizer = AutoTokenizer.from_pretrained(mistral_checkpoint, add_prefix_space=True)
mistral_tokenizer.pad_token_id = mistral_tokenizer.eos_token_id
mistral_tokenizer.pad_token = mistral_tokenizer.eos_token

def mistral_preprocessing_function(examples):
    return mistral_tokenizer(examples['text'], truncation=True, max_length=MAX_LEN)

mistral_tokenized_datasets = data.map(mistral_preprocessing_function, batched=True, remove_columns=col_to_delete)
mistral_tokenized_datasets = mistral_tokenized_datasets.rename_column("target", "label")
mistral_tokenized_datasets.set_format("torch")

# Data collator for padding a batch of examples to the maximum length seen in the batch
mistral_data_collator = DataCollatorWithPadding(tokenizer=mistral_tokenizer)

Llama 2

# Load Llama 2 Tokenizer
from transformers import AutoTokenizer, DataCollatorWithPadding
llama_tokenizer = AutoTokenizer.from_pretrained(llama_checkpoint, add_prefix_space=True)
llama_tokenizer.pad_token_id = llama_tokenizer.eos_token_id
llama_tokenizer.pad_token = llama_tokenizer.eos_token

def llama_preprocessing_function(examples):
    return llama_tokenizer(examples['text'], truncation=True, max_length=MAX_LEN)

llama_tokenized_datasets = data.map(llama_preprocessing_function, batched=True, remove_columns=col_to_delete)
llama_tokenized_datasets = llama_tokenized_datasets.rename_column("target", "label")
llama_tokenized_datasets.set_format("torch")

# Data collator for padding a batch of examples to the maximum length seen in the batch
llama_data_collator = DataCollatorWithPadding(tokenizer=llama_tokenizer)

现在我们已经准备好了分词数据集，下一节将展示如何加载预训练的 LLM 检查点以及如何设置 LoRa 权重。

模型

RoBERTa

加载用于分类任务的 RoBERTa 检查点

我们使用 Hugging Face 的 AutoModelForSequenceClassification 类加载带有序列分类头的预训练 RoBERTa 模型

from transformers import AutoModelForSequenceClassification 
roberta_model = AutoModelForSequenceClassification.from_pretrained(roberta_checkpoint, num_labels=2)

为 RoBERTa 分类器设置 LoRA

我们导入 LoRa 配置并为 RoBERTa 分类器设置一些参数

TaskType: 序列分类
r(rank): 我们分解矩阵的秩
lora_alpha: 用于缩放学习权重的 Alpha 参数。LoRA 论文建议将 alpha 固定为 16
lora_dropout: LoRA 层的丢弃概率
bias: 是否向 LoRa 层添加偏置项

下面的代码使用了 Lora 论文推荐的值。在本文的稍后部分，我们将使用 wandb 对这些参数进行超参数调优。

from peft import get_peft_model, LoraConfig, TaskType
roberta_peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS, r=2, lora_alpha=16, lora_dropout=0.1, bias="none",
)
roberta_model = get_peft_model(roberta_model, roberta_peft_config)
roberta_model.print_trainable_parameters()

我们可以看到，可训练参数的数量仅占 RoBERTa 模型参数的 0.64%

trainable params: 2,299,908 || all params: 356,610,052 || trainable%: 0.6449363911929212

Mistral

加载用于分类模型的检查点

让我们加载带有序列分类头的预训练 Mistral-7B 模型

from transformers import AutoModelForSequenceClassification
import torch
mistral_model =  AutoModelForSequenceClassification.from_pretrained(
  pretrained_model_name_or_path=mistral_checkpoint,
  num_labels=2,
  device_map="auto"
)

对于 Mistral 7B，我们必须添加填充词元 ID，因为它默认没有定义。

mistral_model.config.pad_token_id = mistral_model.config.eos_token_id

为 Mistral 7B 分类器设置 LoRa

对于 Mistral 7B 模型，我们需要指定 target_modules (注意力模块中的查询和值向量)

from peft import get_peft_model, LoraConfig, TaskType

mistral_peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS, r=2, lora_alpha=16, lora_dropout=0.1, bias="none", 
    target_modules=[
        "q_proj",
        "v_proj",
    ],
)

mistral_model = get_peft_model(mistral_model, mistral_peft_config)
mistral_model.print_trainable_parameters()

可训练参数的数量仅占 Mistral 模型参数的 0.024%

trainable params: 1,720,320 || all params: 7,112,380,416 || trainable%: 0.02418768259540745

Llama 2

加载用于分类模式的检查点

让我们加载带有序列分类头的预训练 Llama 2 模型。

from transformers import AutoModelForSequenceClassification
import torch
llama_model =  AutoModelForSequenceClassification.from_pretrained(
  pretrained_model_name_or_path=llama_checkpoint,
  num_labels=2,
  device_map="auto",
  offload_folder="offload",
  trust_remote_code=True
)

对于 Llama 2，我们必须添加填充词元 ID，因为它默认没有定义。

llama_model.config.pad_token_id = llama_model.config.eos_token_id

为 Llama 2 分类器设置 LoRa

我们为 Llama 2 定义 LoRa，参数与 Mistral 相同

from peft import get_peft_model, LoraConfig, TaskType
llama_peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS, r=16, lora_alpha=16, lora_dropout=0.05, bias="none", 
    target_modules=[
        "q_proj",
        "v_proj",  
    ],
)

llama_model = get_peft_model(llama_model, llama_peft_config)
llama_model.print_trainable_parameters()

可训练参数的数量仅占 Llama 2 模型参数的 0.12%

trainable params: 8,404,992 || all params: 6,615,748,608 || trainable%: 0.1270452143516515

至此，我们已经定义了用于训练的分词数据集以及带有 LoRa 层的 LLM 设置。下一节将介绍如何使用 HuggingFace 的 Trainer 类启动训练。

设置训练器

评估指标

首先，我们定义将用于比较三个模型的性能指标：F1 分数、召回率、精确率和准确率

import evaluate
import numpy as np

def compute_metrics(eval_pred):
    # All metrics are already predefined in the HF `evaluate` package
    precision_metric = evaluate.load("precision")
    recall_metric = evaluate.load("recall")
    f1_metric= evaluate.load("f1")
    accuracy_metric = evaluate.load("accuracy")

    logits, labels = eval_pred # eval_pred is the tuple of predictions and labels returned by the model
    predictions = np.argmax(logits, axis=-1)
    precision = precision_metric.compute(predictions=predictions, references=labels)["precision"]
    recall = recall_metric.compute(predictions=predictions, references=labels)["recall"]
    f1 = f1_metric.compute(predictions=predictions, references=labels)["f1"]
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)["accuracy"]
    # The trainer is expecting a dictionary where the keys are the metrics names and the values are the scores. 
    return {"precision": precision, "recall": recall, "f1-score": f1, 'accuracy': accuracy}

用于加权损失的自定义训练器

如本文开头所述，我们的正负类别存在不平衡分布。我们需要使用加权交叉熵损失来训练我们的模型以解决这个问题。Trainer 类不支持提供自定义损失，因为它期望直接从模型的输出中获取损失。

因此，我们需要定义我们自己的自定义 WeightedCELossTrainer，它会重写 compute_loss 方法，以根据模型的预测和输入标签计算加权交叉熵损失

from transformers import Trainer

class WeightedCELossTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        # Get model's predictions
        outputs = model(**inputs)
        logits = outputs.get("logits")
        # Compute custom loss
        loss_fct = torch.nn.CrossEntropyLoss(weight=torch.tensor([neg_weights, pos_weights], device=model.device, dtype=logits.dtype))
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

训练器设置

让我们为这三个模型设置训练参数和训练器。

RoBERTa

第一个重要步骤是将模型移至 GPU 设备进行训练。

roberta_model = roberta_model.cuda()
roberta_model.device()

它将打印以下内容

device(type='cuda', index=0)

然后，我们设置训练参数

from transformers import TrainingArguments

lr = 1e-4
batch_size = 8
num_epochs = 5

training_args = TrainingArguments(
    output_dir="roberta-large-lora-token-classification",
    learning_rate=lr,
    lr_scheduler_type= "constant",
    warmup_ratio= 0.1,
    max_grad_norm= 0.3,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.001,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="wandb",
    fp16=False,
    gradient_checkpointing=True,
)

最后，我们通过提供模型、训练参数和分词后的数据集来定义 RoBERTa 训练器

roberta_trainer = WeightedCELossTrainer(
    model=roberta_model,
    args=training_args,
    train_dataset=roberta_tokenized_datasets['train'],
    eval_dataset=roberta_tokenized_datasets["val"],
    data_collator=roberta_data_collator,
    compute_metrics=compute_metrics
)

Mistral-7B

与 RoBERTa 类似，我们按如下方式初始化 WeightedCELossTrainer

from transformers import TrainingArguments, Trainer

mistral_model = mistral_model.cuda()

lr = 1e-4
batch_size = 8
num_epochs = 5

training_args = TrainingArguments(
    output_dir="mistral-lora-token-classification",
    learning_rate=lr,
    lr_scheduler_type= "constant",
    warmup_ratio= 0.1,
    max_grad_norm= 0.3,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.001,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="wandb",
    fp16=True,
    gradient_checkpointing=True,
)


mistral_trainer = WeightedCELossTrainer(
    model=mistral_model,
    args=training_args,
    train_dataset=mistral_tokenized_datasets['train'],
    eval_dataset=mistral_tokenized_datasets["val"],
    data_collator=mistral_data_collator,
    compute_metrics=compute_metrics
)

注意，我们需要通过将 fp16 设置为 True 来启用半精度训练。主要原因是 Mistral-7B 很大，其权重无法以完整的 float32 精度装入单个 GPU 内存 (48GB)。

Llama 2

与 Mistral 7B 类似，我们按如下方式定义训练器

from transformers import TrainingArguments, Trainer

llama_model = llama_model.cuda()

lr = 1e-4
batch_size = 8
num_epochs = 5
training_args = TrainingArguments(
    output_dir="llama-lora-token-classification",
    learning_rate=lr,
    lr_scheduler_type= "constant",
    warmup_ratio= 0.1,
    max_grad_norm= 0.3,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.001,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="wandb",
    fp16=True,
    gradient_checkpointing=True,
)



llama_trainer = WeightedCELossTrainer(
    model=llama_model,
    args=training_args,
    train_dataset=llama_tokenized_datasets['train'],
    eval_dataset=llama_tokenized_datasets["val"],
    data_collator=llama_data_collator,
    compute_metrics=compute_metrics
)

超参数调优

我们使用 Wandb Sweep API 运行超参数调优，采用贝叶斯搜索策略（30 次运行）。调优的超参数如下。

方法	指标	lora_alpha	lora_bias	lora_dropout	lora_rank	学习率	max_length
贝叶斯	目标：最大化	分布：分类	分布：分类	分布：均匀	分布：分类	分布：均匀	分布：分类
	名称：eval/f1-score	值 -16 -32 -64	值：None	-最大值：0.1 -最小值：0	值 -4 -8 -16 -32	-最大值：2e-04 -最小值：1e-05	值：512

更多信息，您可以查看资源部分中的 Wandb 实验报告。

结果

模型	F1 分数	训练时间	内存消耗	可训练参数数量
RoBERTa	0.8077	538 秒	GPU1: 9.1 Gb GPU2: 8.3 Gb	0.64%
Mistral 7B	0.7364	2030 秒	GPU1: 29.6 Gb GPU2: 29.5 Gb	0.024%
Llama 2	0.7638	2052 秒	GPU1: 35 Gb GPU2: 33.9 Gb	0.12%

结论

在这篇博文中，我们使用 LoRa 比较了三种大型语言模型 (LLM)——RoBERTa、Mistral 7b 和 Llama 2——在灾难推文分类任务上的性能。从性能结果来看，我们可以看到 RoBERTa 的表现远超 Mistral 7B 和 Llama 2。这就引出了一个问题：对于像短序列二元分类这样的任务，我们真的需要一个复杂且庞大的 LLM 吗？

我们可以从这项研究中学到的一点是，应该考虑具体的项目需求、可用资源和性能要求来选择要使用的 LLM 模型。

此外，对于序列较短且相对简单的预测任务，像 RoBERTa 这样的基础模型仍然具有竞争力。

最后，我们展示了 LoRa 方法可以同时应用于编码器（RoBERTa）和解码器（Llama 2 和 Mistral 7B）模型。

资源

您可以在以下 Github 项目中找到代码脚本。
您可以在以下 Weight&Bias 报告中查看超参数搜索结果

更多博客文章

TGI Multi-LoRA: 一次部署，服务 30 个模型

作者： 2024 年 7 月 18 日 • 59

Cosmopedia：如何为大型语言模型预训练创建大规模合成数据

作者： 2024 年 3 月 20 日 • 99

社区

通过拖放到文本输入框、粘贴或点击此处上传图片、音频和视频。

点击或粘贴此处以上传图片

· 注册或登录以发表评论

LLM 性能比较：使用 LoRA 深入研究 Roberta、Llama 2 和 Mistral 在灾难推文分析中的表现

引言 使用的硬件 目标 依赖项 预训练模型 RoBERTa Llama 2 Mistral 7B LoRA 设置 数据准备 数据加载 数据处理 模型 RoBERTa Mistral Llama 2 设置训练器 评估指标 用于加权损失的自定义训练器 训练器设置 超参数调优 结果 结论 资源 引言

使用的硬件

目标

依赖项

预训练模型

RoBERTa

Llama 2

Mistral 7B

LoRA

设置

数据准备

数据加载

数据处理

模型

RoBERTa

加载用于分类任务的 RoBERTa 检查点

为 RoBERTa 分类器设置 LoRA

Mistral

加载用于分类模型的检查点

为 Mistral 7B 分类器设置 LoRa

Llama 2

加载用于分类模式的检查点

为 Llama 2 分类器设置 LoRa

设置训练器

评估指标

用于加权损失的自定义训练器

训练器设置

RoBERTa

Mistral-7B

Llama 2

超参数调优

结果

结论

资源

TGI Multi-LoRA: 一次部署，服务 30 个模型

Cosmopedia：如何为大型语言模型预训练创建大规模合成数据

社区