使用 Amazon SageMaker 微调和部署嵌入模型

嵌入模型对于成功的 RAG 应用程序至关重要，但它们通常在通用知识上进行训练，这限制了它们在公司或领域特定采用方面的有效性。为特定领域数据定制嵌入可以显著提高 RAG 应用程序的检索性能。随着 Sentence Transformers 3 和 Hugging Face Embedding Container 的新发布，微调和部署嵌入模型比以往任何时候都更容易。

在本示例中，我们将向您展示如何使用新的 Hugging Face Embedding Container 在 Amazon SageMaker 上微调和部署自定义嵌入模型。我们将使用 Sentence Transformers 3 库来微调自定义数据集上的模型，并将其部署到 Amazon SageMaker 进行推理。我们将使用 2023_10 NVIDIA SEC Filing 中的合成数据集，为金融 RAG 应用程序微调 BAAI/bge-base-en-v1.5。

设置开发环境
创建和准备数据集
在 Amazon SageMaker 上微调嵌入模型
在 Amazon SageMaker 上部署和测试微调的嵌入模型

Sentence Transformers 3 有什么新功能？

Sentence Transformers v3 引入了一个新的训练器，使微调和训练嵌入模型变得更容易。此更新包括增强的组件，如多样化的数据集、更新的损失函数和简化的训练过程，提高了模型开发的效率和灵活性。

Hugging Face Embedding Container 是什么？

Hugging Face Embedding Container 是一个新的专门构建的推理容器，用于在安全和托管环境中轻松部署嵌入模型。该 DLC 由 Text Embedding Inference (TEI) 提供支持，TEI 是一种用于部署和提供嵌入模型的超快速且内存高效的解决方案。TEI 为最流行的模型提供高性能提取，包括 FlagEmbedding、Ember、GTE 和 E5。TEI 实现了许多功能，例如

注意：此博客是在 ml.g5.xlarge 上进行训练并在 ml.c6i.2xlarge 上进行推理实例创建和验证的。

1. 设置开发环境

我们的第一步是安装客户端上所需的 Hugging Face 库，以便正确准备我们的数据集并开始我们的训练/评估作业。

!pip install transformers "datasets[s3]==2.18.0" "sagemaker>=2.190.0" "huggingface_hub[cli]" --upgrade --quiet

如果您将在本地环境中使用 Sagemaker。您需要访问具有 Sagemaker 所需权限的 IAM 角色。您可以在此处了解更多信息。

import sagemaker
import boto3

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

2. 创建和准备数据集

嵌入数据集通常由文本对（问题、答案/上下文）或表示句子之间关系或相似性的三元组组成。您选择或可用的数据集格式也会影响您可以使用哪个损失函数。嵌入数据集的常见格式

正向对：相关句子的文本对（查询、上下文 | 查询、答案），适用于相似性或语义搜索等任务，示例数据集：`sentence-transformers/sentence-compression`，`sentence-transformers/natural-questions`。
三元组：由（锚点、正向、负向）组成的三元组文本，示例数据集：`sentence-transformers/quora-duplicates`，`nirantk/triplets`。
带相似性分数的对：带有相似性分数表示它们相关程度的句子对，示例数据集：`sentence-transformers/stsb`，`PhilipMay/stsb_multi_mt`

在数据集概览中了解更多信息。

我们将使用philschmid/finanical-rag-embedding-dataset，其中包含来自2023_10 NVIDIA SEC Filing 的 7,000 个问题和相应上下文的正向文本对。

数据集具有以下格式

{"question": "<question>", "context": "<relevant context to answer>"}
{"question": "<question>", "context": "<relevant context to answer>"}
{"question": "<question>", "context": "<relevant context to answer>"}

我们将使用文件系统集成将数据集上传到 S3。我们正在使用 sess.default_bucket()，如果您想将数据集存储在不同的 S3 存储桶中，请进行调整。我们稍后将在训练脚本中使用 S3 路径。

from datasets import load_dataset

# Load dataset from the hub
dataset = load_dataset("philschmid/finanical-rag-embedding-dataset", split="train")
input_path = f"s3://{sess.default_bucket()}/datasets/rag-embedding"

# rename columns
dataset = dataset.rename_column("question", "anchor")
dataset = dataset.rename_column("context", "positive")

# Add an id column to the dataset
dataset = dataset.add_column("id", range(len(dataset)))

# split dataset into a 10% test set
dataset = dataset.train_test_split(test_size=0.1)

# save train_dataset to s3 using our SageMaker session

# save datasets to s3
dataset["train"].to_json(f"{input_path}/train/dataset.json", orient="records")
train_dataset_s3_path = f"{input_path}/train/dataset.json"
dataset["test"].to_json(f"{input_path}/test/dataset.json", orient="records")
test_dataset_s3_path = f"{input_path}/test/dataset.json"

print(f"Training data uploaded to:")
print(train_dataset_s3_path)
print(test_dataset_s3_path)
print(
    f"https://s3.console.aws.amazon.com/s3/buckets/{sess.default_bucket()}/?region={sess.boto_region_name}&prefix={input_path.split('/', 3)[-1]}/"
)

3. 在 Amazon SageMaker 上微调嵌入模型

我们现在准备好微调我们的模型了。我们将使用 sentence-transformers 中的 SentenceTransformerTrainer 来微调我们的模型。SentenceTransformerTrainer 使监督微调开放嵌入模型变得简单，因为它是 transformers 中 Trainer 的子类。我们准备了一个脚本 run_mnr.py，它将从磁盘加载数据集，准备模型、分词器并开始训练。SentenceTransformerTrainer 使监督微调开放嵌入模型变得简单，支持

集成组件：将数据集、损失函数和评估器组合成统一的训练框架。
灵活的数据处理：支持各种数据格式，并易于与 Hugging Face 数据集集成。
多功能损失函数：为不同的训练任务提供多种损失函数。
多数据集训练：方便使用多个数据集和不同损失函数进行同时训练。
无缝集成：在 Hugging Face 生态系统中轻松保存、加载和共享模型。

为了创建 SageMaker 训练作业，我们需要一个 HuggingFace Estimator。Estimator 处理端到端的 Amazon SageMaker 训练和部署任务。Estimator 管理基础设施使用。Amazon SagMaker 负责为我们启动和管理所有所需的 EC2 实例，提供正确的 Hugging Face 容器，上传提供的脚本并将数据从我们的 S3 存储桶下载到容器的 /opt/ml/input/data。然后，它通过运行以下命令开始训练作业。

注意：如果您使用自定义训练脚本，请确保在 source_dir 中包含 requirements.txt。我们建议直接克隆整个仓库。

我们先定义我们的训练参数。这些参数作为 CLI 参数传递给我们的训练脚本。我们将使用 BAAI/bge-base-en-v1.5 模型，这是一个在大规模英语文本语料库上预训练的模型。我们将结合 MatryoshkaLoss 使用 MultipleNegativesRankingLoss。这种方法使我们能够利用 Matryoshka 嵌入的效率和灵活性，从而能够在不显著降低性能的情况下利用不同的嵌入维度。如果您只有正向对，MultipleNegativesRankingLoss 是一个很好的损失函数，因为它在批处理中添加负样本到损失函数中，从而每个样本有 n-1 个负样本。

from sagemaker.huggingface import HuggingFace

# define Training Job Name
job_name = f"bge-base-exp1"

# define hyperparameters, which are passed into the training job
training_arguments = {
    "model_id": "BAAI/bge-base-en-v1.5",  # model id from the hub
    "train_dataset_path": "/opt/ml/input/data/train/",  # path inside the container where the training data is stored
    "test_dataset_path": "/opt/ml/input/data/test/",  # path inside the container where the test data is stored
    "num_train_epochs": 3,  # number of training epochs
    "learning_rate": 2e-5,  # learning rate
}

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point="run_mnr.py",  # train script
    source_dir="scripts",  # directory which includes all the files needed for training
    instance_type="ml.g5.xlarge",  # instances type used for the training job
    instance_count=1,  # the number of instances used for training
    max_run=2 * 24 * 60 * 60,  # maximum runtime in seconds (days * hours * minutes * seconds)
    base_job_name=job_name,  # the name of the training job
    role=role,  # Iam role used in training job to access AWS ressources, e.g. S3
    transformers_version="4.36.0",  # the transformers version used in the training job
    pytorch_version="2.1.0",  # the pytorch_version version used in the training job
    py_version="py310",  # the python version used in the training job
    hyperparameters=training_arguments,
    disable_output_compression=True,  # not compress output to save training time and cost
    environment={
        "HUGGINGFACE_HUB_CACHE": "/tmp/.cache",  # set env variable to cache models in /tmp
    },
)

我们现在可以使用 .fit() 方法启动我们的训练作业，并将我们的 S3 路径传递给训练脚本。

# define a data input dictonary with our uploaded s3 uris
data = {
    "train": train_dataset_s3_path,
    "test": test_dataset_s3_path,
}

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=True)

在我们的示例中，使用 Flash Attention 2 (SDPA) 训练 BGE Base 3 个 epoch，数据集包含 6.3k 个训练样本和 700 个评估样本，在 ml.g5.xlarge (1.2575 $/h) 上耗时 645 秒（约 10 分钟），花费约 5 美元。

4. 在 Amazon SageMaker 上部署和测试微调的嵌入模型

我们将使用 Hugging Face Embedding Container，这是一个专门构建的推理容器，用于在安全和托管环境中轻松部署嵌入模型。该 DLC 由 Text Embedding Inference (TEI) 提供支持，TEI 是一种用于部署和提供嵌入模型的超快速且内存高效的解决方案。

要在 Amazon SageMaker 中检索新的 Hugging Face Embedding Container，我们可以使用 SageMaker SDK 提供的 `get_huggingface_llm_image_uri` 方法。此方法允许我们检索所需 Hugging Face Embedding Container 的 URI。需要注意的是，TEI 有 CPU 和 GPU 两种不同的版本，因此我们创建一个辅助函数，根据实例类型检索正确的镜像 URI。

from sagemaker.huggingface import get_huggingface_llm_image_uri


# retrieve the image uri based on instance type
def get_image_uri(instance_type):
    key = (
        "huggingface-tei"
        if instance_type.startswith("ml.g") or instance_type.startswith("ml.p")
        else "huggingface-tei-cpu"
    )
    return get_huggingface_llm_image_uri(key, version="1.4.0")

我们现在可以使用容器 uri 和模型 S3 路径创建一个 HuggingFaceModel。我们还需要设置我们的 TEI 配置。

from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
instance_type = "ml.c6i.2xlarge"

# create HuggingFaceModel with the image uri
emb_model = HuggingFaceModel(
    role=role,
    image_uri=get_image_uri(instance_type),
    model_data=huggingface_estimator.model_data,
    env={"HF_MODEL_ID": "/opt/ml/model"},  # Path to the model in the container
)

创建 HuggingFaceModel 后，我们可以使用部署方法将其部署到 Amazon SageMaker。我们将使用 ml.c6i.2xlarge 实例类型部署模型。

# Deploy model to an endpoint
emb = emb_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
)

SageMaker 现在将创建我们的端点并将模型部署到该端点。这可能需要大约 5 分钟。端点部署后，我们可以在其上运行推理。我们将使用预测器的 predict 方法在我们的端点上运行推理。

data = {
    "inputs": "the mesmerizing performances of the leads keep the film grounded and keep the audience riveted .",
}

res = emb.predict(data=data)


# print some results
print(f"length of embeddings: {len(res[0])}")
print(f"first 10 elements of embeddings: {res[0][:10]}")

我们使用 Matryoshka Loss 训练了我们的模型，这意味着语义含义是预加载的。要使用不同的 Matryoshka 维度，我们需要手动截断我们的嵌入。下面是一个如何将嵌入截断为 256 维的示例，这是原始大小的 1/3。如果我们检查训练日志，我们可以看到 768 的 NDCG 指标为 0.823，256 的 NDCG 指标为 0.818，这意味着我们保留了 > 99% 的准确率。

data = {
    "inputs": "the mesmerizing performances of the leads keep the film grounded and keep the audience riveted .",
}

res = emb.predict(data=data)

# truncate embeddings to matryoshka dimensions
dim = 256
res = res[0][0:dim]

# print some results
print(f"length of embeddings: {len(res)}")

太棒了！🚀 现在我们可以生成嵌入并将您的端点集成到您的 RAG 应用程序中。

为了清理，我们可以删除模型和端点。

emb.delete_model()
emb.delete_endpoint()

📍 完整的示例请参见 GitHub 此处！

< > 在 GitHub 上更新

在 AWS 上部署

使用 Amazon SageMaker 微调和部署嵌入模型

1. 设置开发环境

2. 创建和准备数据集

3. 在 Amazon SageMaker 上微调嵌入模型

4. 在 Amazon SageMaker 上部署和测试微调的嵌入模型