将模型部署到 Amazon SageMaker

在 SageMaker 中部署 🤗 Transformers 模型进行推理非常简单

from sagemaker.huggingface import HuggingFaceModel

# create Hugging Face Model Class and deploy it as SageMaker endpoint
huggingface_model = HuggingFaceModel(...).deploy()

本指南将向您展示如何使用 Inference Toolkit 实现零代码部署模型。Inference Toolkit 构建在 🤗 Transformers 的 pipeline 功能之上。学习如何

安装和设置 Inference Toolkit.
部署在 SageMaker 中训练的 🤗 Transformers 模型.
部署来自 Hugging Face [模型中心](https://huggingface.co/models) 的 🤗 Transformers 模型.
使用 🤗 Transformers 和 Amazon SageMaker 运行批量转换作业.
创建自定义推理模块.

安装和设置

在将 🤗 Transformers 模型部署到 SageMaker 之前，您需要注册一个 AWS 账户。如果您还没有 AWS 账户，请在此处了解更多信息。

拥有 AWS 账户后，请使用以下方法之一开始使用

要在本地开始训练，您需要设置适当的 IAM 角色。

升级到最新的 sagemaker 版本。

pip install sagemaker --upgrade

SageMaker 环境

按如下所示设置您的 SageMaker 环境

import sagemaker
sess = sagemaker.Session()
role = sagemaker.get_execution_role()

注意：执行角色仅在 SageMaker 中运行笔记本时可用。如果您在非 SageMaker 的笔记本中运行 get_execution_role，则会遇到 region 错误。

本地环境

按如下所示设置您的本地环境

import sagemaker
import boto3

iam_client = boto3.client('iam')
role = iam_client.get_role(RoleName='role-name-of-your-iam-role-with-right-permissions')['Role']['Arn']
sess = sagemaker.Session()

部署在 SageMaker 中训练的 🤗 Transformers 模型

有两种方法可以部署您在 SageMaker 中训练的 Hugging Face 模型：

在训练完成后立即部署。
稍后使用 model_data 从 S3 部署您保存的模型。

📓 打开 deploy_transformer_model_from_s3.ipynb 笔记本，查看如何将模型从 S3 部署到 SageMaker 进行推理的示例。

训练后部署

要在训练后直接部署模型，请确保所有必需的文件都保存在您的训练脚本中，包括分词器和模型。

如果您使用 Hugging Face 的 Trainer，可以将分词器作为参数传递给 Trainer。当您调用 trainer.save_model() 时，它将自动保存。

from sagemaker.huggingface import HuggingFace

############ pseudo code start ############

# create Hugging Face Estimator for training
huggingface_estimator = HuggingFace(....)

# start the train job with our uploaded datasets as input
huggingface_estimator.fit(...)

############ pseudo code end ############

# deploy model to SageMaker Inference
predictor = hf_estimator.deploy(initial_instance_count=1, instance_type="ml.m5.xlarge")

# example request: you always need to define "inputs"
data = {
   "inputs": "Camera - You are awarded a SiPix Digital Camera! call 09061221066 fromm landline. Delivery within 28 days."
}

# request
predictor.predict(data)

运行完请求后，您可以按如下所示删除端点

# delete endpoint
predictor.delete_endpoint()

使用 model_data 部署

如果您已经训练好模型并希望稍后部署，请使用 model_data 参数指定分词器和模型权重的位置。

from sagemaker.huggingface.model import HuggingFaceModel

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data="s3://models/my-bert-model/model.tar.gz",  # path to your trained SageMaker model
   role=role,                                            # IAM role with permissions to create an endpoint
   transformers_version="4.26",                           # Transformers version used
   pytorch_version="1.13",                                # PyTorch version used
   py_version='py39',                                    # Python version used
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
   initial_instance_count=1,
   instance_type="ml.m5.xlarge"
)

# example request: you always need to define "inputs"
data = {
   "inputs": "Camera - You are awarded a SiPix Digital Camera! call 09061221066 fromm landline. Delivery within 28 days."
}

# request
predictor.predict(data)

在运行我们的请求后，您可以使用以下命令再次删除端点

# delete endpoint
predictor.delete_endpoint()

为部署创建模型工件

为了稍后部署，您可以创建一个包含所有必需文件的 model.tar.gz 文件，例如：

pytorch_model.bin
tf_model.h5
tokenizer.json
tokenizer_config.json

例如，您的文件应该如下所示

model.tar.gz/
|- pytorch_model.bin
|- vocab.txt
|- tokenizer_config.json
|- config.json
|- special_tokens_map.json

从 🤗 Hub 上的模型创建您自己的 model.tar.gz

下载模型

git lfs install
git clone git@hf.co:{repository}

创建一个 tar 文件

cd {repository}
tar zcvf model.tar.gz *

将 model.tar.gz 上传到 S3

aws s3 cp model.tar.gz <s3://{my-s3-path}>

现在，您可以将 S3 URI 提供给 model_data 参数，以便稍后部署您的模型。

从 🤗 Hub 部署模型

要直接从 🤗 Hub 将模型部署到 SageMaker，您需要在创建 HuggingFaceModel 时定义两个环境变量：

HF_MODEL_ID 定义了模型 ID，当您创建 SageMaker 端点时，该模型将自动从 huggingface.co/models 加载。通过此环境变量可以访问 🤗 Hub 上的 10,000 多个模型。
HF_TASK 定义了 🤗 Transformers pipeline 的任务。完整的任务列表可以在这里找到。

⚠️ ** Pipeline 未针对并行处理（多线程）进行优化，并且往往会消耗大量 RAM。例如，在基于 GPU 的实例上，pipeline 在单个 vCPU 上运行。当该 vCPU 因推理请求预处理而饱和时，可能会造成瓶颈，从而阻止 GPU 被充分利用于模型推理。在此处了解更多信息。

from sagemaker.huggingface.model import HuggingFaceModel

# Hub model configuration <https://huggingface.co/models>
hub = {
  'HF_MODEL_ID':'distilbert-base-uncased-distilled-squad', # model_id from hf.co/models
  'HF_TASK':'question-answering'                           # NLP task you want to use for predictions
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   env=hub,                                                # configuration for loading model from Hub
   role=role,                                              # IAM role with permissions to create an endpoint
   transformers_version="4.26",                             # Transformers version used
   pytorch_version="1.13",                                  # PyTorch version used
   py_version='py39',                                      # Python version used
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
   initial_instance_count=1,
   instance_type="ml.m5.xlarge"
)

# example request: you always need to define "inputs"
data = {
"inputs": {
	"question": "What is used for inference?",
	"context": "My Name is Philipp and I live in Nuremberg. This model is used with sagemaker for inference."
	}
}

# request
predictor.predict(data)

在运行我们的请求后，您可以使用以下命令再次删除端点

# delete endpoint
predictor.delete_endpoint()

📓 打开 deploy_transformer_model_from_hf_hub.ipynb 笔记本，查看如何将模型从 🤗 Hub 部署到 SageMaker 进行推理的示例。

使用 🤗 Transformers 和 SageMaker 运行批量转换

训练模型后，您可以使用SageMaker 批量转换来对模型进行推理。批量转换接受您的推理数据作为 S3 URI，然后 SageMaker 将负责下载数据、运行预测并将结果上传到 S3。有关批量转换的更多详细信息，请查看此处。

⚠️ 由于文本数据的复杂结构，Hugging Face 推理 DLC 目前仅支持 .jsonl 格式进行批量转换。

注意：确保在预处理期间，您的 inputs 适合模型的 max_length。

如果您使用 Hugging Face Estimator 训练了模型，可以调用 transformer() 方法来为基于该训练作业的模型创建一个转换作业（更多详细信息请参见此处）

batch_job = huggingface_estimator.transformer(
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    strategy='SingleRecord')


batch_job.transform(
    data='s3://s3-uri-to-batch-data',
    content_type='application/json',    
    split_type='Line')

如果您想稍后运行批量转换作业，或者使用来自 🤗 Hub 的模型，请创建一个 HuggingFaceModel 实例，然后调用 transformer() 方法

from sagemaker.huggingface.model import HuggingFaceModel

# Hub model configuration <https://huggingface.co/models>
hub = {
	'HF_MODEL_ID':'distilbert/distilbert-base-uncased-finetuned-sst-2-english',
	'HF_TASK':'text-classification'
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   env=hub,                                                # configuration for loading model from Hub
   role=role,                                              # IAM role with permissions to create an endpoint
   transformers_version="4.26",                             # Transformers version used
   pytorch_version="1.13",                                  # PyTorch version used
   py_version='py39',                                      # Python version used
)

# create transformer to run a batch job
batch_job = huggingface_model.transformer(
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    strategy='SingleRecord'
)

# starts batch transform job and uses S3 data as input
batch_job.transform(
    data='s3://sagemaker-s3-demo-test/samples/input.jsonl',
    content_type='application/json',    
    split_type='Line'
)

input.jsonl 看起来像这样

{"inputs":"this movie is terrible"}
{"inputs":"this movie is amazing"}
{"inputs":"SageMaker is pretty cool"}
{"inputs":"SageMaker is pretty cool"}
{"inputs":"this movie is terrible"}
{"inputs":"this movie is amazing"}

📓 打开 sagemaker-notebook.ipynb 笔记本，查看如何运行批量转换作业进行推理的示例。

使用 TGI 将 LLM 部署到 SageMaker

如果您有兴趣为 LLM 使用高性能的服务容器，您可以使用 Hugging Face TGI 容器。它利用了 Text Generation Inference 库。兼容模型的列表可以在这里找到。

首先，确保安装了最新版本的 SageMaker SDK

pip install sagemaker>=2.231.0

然后，我们导入 SageMaker Python SDK 并实例化一个 sagemaker_session 来查找当前区域和执行角色。

import sagemaker
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
import time

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()

接下来我们检索 LLM 镜像 URI。我们使用辅助函数 get_huggingface_llm_image_uri() 来为 Hugging Face 大型语言模型 (LLM) 推理生成合适的镜像 URI。该函数接受一个必需的参数 backend 和几个可选参数。backend 指定了模型使用的后端类型：“huggingface” 指的是使用 Hugging Face TGI 后端。

image_uri = get_huggingface_llm_image_uri(
  backend="huggingface",
  region=region
)

现在我们有了镜像 URI，下一步是配置模型对象。我们指定一个唯一的名称、托管 TGI 容器的 image_uri 以及端点的执行角色。此外，我们还指定了多个环境变量，包括与将要部署的 HuggingFace Hub 模型对应的 HF_MODEL_ID，以及配置模型执行的推理任务的 HF_TASK。

您还应该定义 SM_NUM_GPUS，它指定了模型的张量并行度。当处理对于单个 GPU 来说太大的 LLM 时，张量并行可以用来将模型拆分到多个 GPU 上。要了解更多关于推理中的张量并行，请参阅我们之前的博客文章。在这里，您应该将 SM_NUM_GPUS 设置为您所选实例类型上可用 GPU 的数量。例如，在本教程中，我们将 SM_NUM_GPUS 设置为 4，因为我们选择的实例类型 ml.g4dn.12xlarge 有 4 个可用 GPU。

请注意，您可以选择性地通过将 HF_MODEL_QUANTIZE 环境变量设置为 true 来减少模型的内存和计算占用，但这种较低的权重精度可能会影响某些模型输出的质量。

model_name = "llama-3-1-8b-instruct" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

hub = {
    'HF_MODEL_ID':'meta-llama/Llama-3.1-8B-Instruct',
    'SM_NUM_GPUS':'1',
	'HUGGING_FACE_HUB_TOKEN': '<REPLACE WITH YOUR TOKEN>',
}

assert hub['HUGGING_FACE_HUB_TOKEN'] != '<REPLACE WITH YOUR TOKEN>', "You have to provide a token."


model = HuggingFaceModel(
    name=model_name,
    env=hub,
    role=role,
    image_uri=image_uri
)

接下来，我们调用 deploy 方法来部署模型。

predictor = model.deploy(
  initial_instance_count=1,
  instance_type="ml.g5.2xlarge",
  endpoint_name=model_name
)

模型部署后，我们可以调用它来生成文本。我们传递一个输入提示并运行 predict 方法，从在 TGI 容器中运行的 LLM 生成文本响应。

input_data = {
  "inputs": "The diamondback terrapin was the first reptile to",
  "parameters": {
    "do_sample": True,
    "max_new_tokens": 100,
    "temperature": 0.7,
    "watermark": True
  }
}

predictor.predict(input_data)

我们收到以下自动生成的文本响应

[{'generated_text': 'The diamondback terrapin was the first reptile to make the list, followed by the American alligator, the American crocodile, and the American box turtle. The polecat, a ferret-like animal, and the skunk rounded out the list, both having gained their slots because they have proven to be particularly dangerous to humans.\n\nCalifornians also seemed to appreciate the new list, judging by the comments left after the election.\n\n“This is fantastic,” one commenter declared.\n\n“California is a very'}]

实验结束后，我们删除端点和模型资源。

predictor.delete_model()
predictor.delete_endpoint()

用户定义的代码和模块

Hugging Face Inference Toolkit 允许用户覆盖 HuggingFaceHandlerService 的默认方法。您需要创建一个名为 code/ 的文件夹，并在其中包含一个 inference.py 文件。有关如何归档模型工件的更多详细信息，请参见此处。例如：

model.tar.gz/
|- pytorch_model.bin
|- ....
|- code/
  |- inference.py
  |- requirements.txt

inference.py 文件包含您的自定义推理模块，而 requirements.txt 文件包含应添加的其他依赖项。自定义模块可以覆盖以下方法：

model_fn(model_dir) 覆盖加载模型的默认方法。返回值 model 将在 predict 中用于预测。predict 接收参数 model_dir，即您解压的 model.tar.gz 的路径。
transform_fn(model, data, content_type, accept_type) 使用您的自定义实现覆盖默认的转换函数。您需要在 transform_fn 中实现自己的 preprocess、predict 和 postprocess 步骤。此方法不能与下面提到的 input_fn、predict_fn 或 output_fn 结合使用。
input_fn(input_data, content_type) 覆盖预处理的默认方法。返回值 data 将在 predict 中用于预测。输入是：
- input_data 是您请求的原始正文。
- content_type 是请求头中的内容类型。
predict_fn(processed_data, model) 覆盖预测的默认方法。返回值 predictions 将在 postprocess 中使用。输入是 processed_data，即 preprocess 的结果。
output_fn(prediction, accept) 覆盖后处理的默认方法。返回值 result 将是您请求的响应（例如 JSON）。输入是：
- predictions 是 predict 的结果。
- accept 是 HTTP 请求的返回接受类型，例如 application/json。

这是一个带有 model_fn、input_fn、predict_fn 和 output_fn 的自定义推理模块示例：

from sagemaker_huggingface_inference_toolkit import decoder_encoder

def model_fn(model_dir):
    # implement custom code to load the model
    loaded_model = ...
    
    return loaded_model 

def input_fn(input_data, content_type):
    # decode the input data  (e.g. JSON string -> dict)
    data = decoder_encoder.decode(input_data, content_type)
    return data

def predict_fn(data, model):
    # call your custom model with the data
    outputs = model(data , ... )
    return predictions

def output_fn(prediction, accept):
    # convert the model output to the desired output format (e.g. dict -> JSON string)
    response = decoder_encoder.encode(prediction, accept)
    return response

仅使用 model_fn 和 transform_fn 自定义您的推理模块：

from sagemaker_huggingface_inference_toolkit import decoder_encoder

def model_fn(model_dir):
    # implement custom code to load the model
    loaded_model = ...
    
    return loaded_model 

def transform_fn(model, input_data, content_type, accept):
     # decode the input data (e.g. JSON string -> dict)
    data = decoder_encoder.decode(input_data, content_type)

    # call your custom model with the data
    outputs = model(data , ... ) 

    # convert the model output to the desired output format (e.g. dict -> JSON string)
    response = decoder_encoder.encode(output, accept)

    return response

< > 在 GitHub 上更新

在 AWS 上部署