如何使用新的 Hugging Face Embedding DLC 将嵌入模型部署到 Amazon SageMaker

这是一个关于如何使用新的 Hugging Face Embedding 推理容器将开放嵌入模型（如 Snowflake/snowflake-arctic-embed-l、BAAI/bge-large-en-v1.5 或 sentence-transformers/all-MiniLM-L6-v2）部署到 Amazon SageMaker 进行推理的示例。我们将部署 Snowflake/snowflake-arctic-embed-m，它是 MTEB 排行榜上用于检索和排名的最佳开放嵌入模型之一。

本示例涵盖以下内容

设置开发环境
检索新的 Hugging Face Embedding 容器
将 Snowflake Arctic 部署到 Amazon SageMaker
运行并评估推理性能
删除模型和端点

什么是 Hugging Face Embedding DLC？

Hugging Face Embedding DLC 是一个新的专用推理容器，用于在安全和托管的环境中轻松部署嵌入模型。该 DLC 由 Text Embedding Inference (TEI) 提供支持，TEI 是一种用于部署和提供嵌入模型的超快速且内存高效的解决方案。TEI 为最流行的模型（包括 FlagEmbedding、Ember、GTE 和 E5）实现高性能提取。TEI 实现了许多功能，例如

无模型图编译步骤
小巧的 Docker 镜像和快速启动时间
基于 token 的动态批处理
使用 Flash Attention、Candle 和 cuBLASLt 优化了 Transformers 推理代码
Safetensors 权重加载
生产就绪（通过 Open Telemetry 进行分布式追踪，Prometheus 指标）

TEI 支持以下模型架构

BERT/CamemBERT，例如 BAAI/bge-large-en-v1.5 或 Snowflake/snowflake-arctic-embed-m
RoBERTa，sentence-transformers/all-roberta-large-v1
XLM-RoBERTa，例如 sentence-transformers/paraphrase-xlm-r-multilingual-v1
NomicBert，例如 jinaai/jina-embeddings-v2-base-en
JinaBert，例如 nomic-ai/nomic-embed-text-v1.5

让我们开始吧！

1. 设置开发环境

我们将使用 sagemaker python SDK 将 Snowflake Arctic 部署到 Amazon SageMaker。我们需要确保已配置 AWS 账户并安装了 sagemaker python SDK。

!pip install "sagemaker>=2.221.1" --upgrade --quiet

如果您要在本地环境中使用 Sagemaker。您需要访问具有 SageMaker 所需权限的 IAM 角色。您可以在此处了解更多信息。

import sagemaker
import boto3

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")

2. 检索新的 Hugging Face Embedding 容器

与部署常规 Hugging Face 模型相比，我们首先需要检索容器 uri 并将其提供给我们的 HuggingFaceModel 模型类，其中 image_uri 指向该镜像。为了在 Amazon SageMaker 中检索新的 Hugging Face Embedding 容器，我们可以使用 sagemaker SDK 提供的 get_huggingface_llm_image_uri 方法。此方法允许我们检索所需 Hugging Face Embedding 容器的 URI。需要注意的是，TEI 有 CPU 和 GPU 两种不同版本，因此我们创建了一个辅助函数，根据实例类型检索正确的镜像 uri。

from sagemaker.huggingface import get_huggingface_llm_image_uri


# retrieve the image uri based on instance type
def get_image_uri(instance_type):
    key = (
        "huggingface-tei"
        if instance_type.startswith("ml.g") or instance_type.startswith("ml.p")
        else "huggingface-tei-cpu"
    )
    return get_huggingface_llm_image_uri(key, version="1.4.0")

3. 将 Snowflake Arctic 部署到 Amazon SageMaker

要将 Snowflake/snowflake-arctic-embed-m 部署到 Amazon SageMaker，我们创建一个 HuggingFaceModel 模型类并定义我们的端点配置，包括 HF_MODEL_ID、instance_type 等。我们将使用 c6i.2xlarge 实例类型，它有 4 个 Intel Ice-Lake vCPU，8GB 内存，每小时成本约为 $0.204。

import json
from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
instance_type = "ml.g5.xlarge"

# Define Model and Endpoint configuration parameter
config = {
    "HF_MODEL_ID": "Snowflake/snowflake-arctic-embed-m",  # model_id from hf.co/models
}

# create HuggingFaceModel with the image uri
emb_model = HuggingFaceModel(role=role, image_uri=get_image_uri(instance_type), env=config)

创建 HuggingFaceModel 后，我们可以使用 deploy 方法将其部署到 Amazon SageMaker。我们将使用 ml.c6i.2xlarge 实例类型部署模型。

# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
emb = emb_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
)

SageMaker 现在将创建我们的端点并将模型部署到其中。这可能需要大约 5 分钟。

4. 运行并评估推理性能

在我们的终端节点部署完成后，我们可以在其上运行推理。我们将使用 predictor 的 predict 方法在我们的终端节点上运行推理。

data = {
    "inputs": "the mesmerizing performances of the leads keep the film grounded and keep the audience riveted .",
}

res = emb.predict(data=data)


# print some results
print(f"length of embeddings: {len(res[0])}")
print(f"first 10 elements of embeddings: {res[0][:10]}")

太棒了！我们现在可以使用我们的模型生成嵌入。让我们测试一下我们模型的性能。

我们将向端点发送 3,900 个请求，使用 10 个并发线程进行线程化。我们将测量端点的平均延迟和吞吐量。我们将发送 256 个令牌作为输入，总共大约 100 万个令牌。我们选择 256 个令牌作为输入长度，以在较短和较长的输入之间找到平衡。

注意：运行负载测试时，请求从欧洲发送，端点部署在美国东部 1 区。这会增加网络开销。

import threading
import time

number_of_threads = 10
number_of_requests = int(3900 // number_of_threads)
print(f"number of threads: {number_of_threads}")
print(f"number of requests per thread: {number_of_requests}")


def send_rquests():
    for _ in range(number_of_requests):
        # input counted at https://huggingface.co/spaces/Xenova/the-tokenizer-playground for 100 tokens
        emb.predict(
            data={
                "inputs": "Hugging Face is a company and a popular platform in the field of natural language processing (NLP) and machine learning. They are known for their contributions to the development of state-of-the-art models for various NLP tasks and for providing a platform that facilitates the sharing and usage of pre-trained models. One of the key offerings from Hugging Face is the Transformers library, which is an open-source library for working with a variety of pre-trained transformer models, including those for text generation, translation, summarization, question answering, and more. The library is widely used in the research and development of NLP applications and is supported by a large and active community. Hugging Face also provides a model hub where users can discover, share, and download pre-trained models. Additionally, they offer tools and frameworks to make it easier for developers to integrate and use these models in their own projects. The company has played a significant role in advancing the field of NLP and making cutting-edge models more accessible to the broader community. Hugging Face also provides a model hub where users can discover, share, and download pre-trained models. Additionally, they offer tools and frameworks to make it easier for developers and ma"
            }
        )


# Create multiple threads
threads = [threading.Thread(target=send_rquests) for _ in range(number_of_threads)]
# start all threads
start = time.time()
[t.start() for t in threads]
# wait for all threads to finish
[t.join() for t in threads]
print(f"total time: {round(time.time() - start)} seconds")

发送 3,900 个请求或嵌入 100 万个令牌大约需要 841 秒。这意味着我们每秒可以运行大约 5 个请求。但请记住，这包括从欧洲到美国东部 1 区的网络延迟。当我们通过 Cloudwatch 检查端点的延迟时，我们可以看到我们的嵌入模型在 10 个并发请求下的延迟为 2 秒。对于一个小型且老旧的 CPU 实例来说，这非常令人印象深刻，该实例每月花费大约 150 美元。您可以将模型部署到 GPU 实例以获得更快的推理时间。

注意：我们在具有 1 个 NVIDIA A10G GPU 的 ml.g5.xlarge 上运行了相同的测试。嵌入 100 万个令牌大约需要 30 秒。这意味着我们每秒可以运行大约 130 个请求。端点在 10 个并发请求下的延迟为 4 毫秒。Amazon SageMaker 上的 ml.g5.xlarge 每小时成本约为 1.408 美元。

GPU 实例比 CPU 实例快得多，但它们也更昂贵。如果您想批量处理嵌入，可以使用 GPU 实例。如果您想以低成本运行小型端点，可以使用 CPU 实例。我们计划将来为 Hugging Face Embedding DLC 进行专门的基准测试。

print(
    f"https://console.aws.amazon.com/cloudwatch/home?region={sess.boto_region_name}#metricsV2:graph=~(metrics~(~(~'AWS*2fSageMaker~'ModelLatency~'EndpointName~'{emb.endpoint_name}~'VariantName~'AllTraffic))~view~'timeSeries~stacked~false~region~'{sess.boto_region_name}~start~'-PT5M~end~'P0D~stat~'Average~period~30);query=~'*7bAWS*2fSageMaker*2cEndpointName*2cVariantName*7d*20{emb.endpoint_name}"
)

5. 删除模型和端点

为了清理，我们可以删除模型和端点

emb.delete_model()
emb.delete_endpoint()

📍 在 GitHub 上找到完整的示例，点此访问！

< > 在 GitHub 上更新