将模型部署到 Amazon SageMaker

在 SageMaker 中部署 🤗 Transformers 模型进行推理非常简单

from sagemaker.huggingface import HuggingFaceModel

# create Hugging Face Model Class and deploy it as SageMaker endpoint
huggingface_model = HuggingFaceModel(...).deploy()

本指南将向您展示如何使用推理工具包进行零代码模型部署。推理工具包基于 🤗 Transformers 的 pipeline 功能构建。了解如何

安装和设置推理工具包.
部署在 SageMaker 中训练的 🤗 Transformers 模型.
从 Hugging Face [模型 Hub](https://huggingface.co/models) 部署 🤗 Transformers 模型.
使用 🤗 Transformers 和 Amazon SageMaker 运行批量转换作业.
创建自定义推理模块.

安装和设置

在将 🤗 Transformers 模型部署到 SageMaker 之前，您需要注册一个 AWS 账户。如果您还没有 AWS 账户，请在此处了解更多信息。

拥有 AWS 账户后，请使用以下方法之一开始使用

要在本地开始训练，您需要设置适当的 IAM 角色。

升级到最新的 sagemaker 版本。

pip install sagemaker --upgrade

SageMaker 环境

按如下所示设置您的 SageMaker 环境

import sagemaker
sess = sagemaker.Session()
role = sagemaker.get_execution_role()

注意：执行角色仅在 SageMaker 中运行笔记本时可用。如果您在非 SageMaker 的笔记本中运行 get_execution_role，则会遇到 region 错误。

本地环境

按如下所示设置您的本地环境

import sagemaker
import boto3

iam_client = boto3.client('iam')
role = iam_client.get_role(RoleName='role-name-of-your-iam-role-with-right-permissions')['Role']['Arn']
sess = sagemaker.Session()

在 SageMaker 中部署训练好的 🤗 Transformers 模型

有两种方法可以在 SageMaker 中部署您训练好的 Hugging Face 模型

训练完成后立即部署。
稍后使用 model_data 从 S3 部署您保存的模型。

📓 打开 deploy_transformer_model_from_s3.ipynb 笔记本，查看如何从 S3 部署模型到 SageMaker 进行推理的示例。

训练后部署

要在训练后直接部署模型，请确保所有必需文件（包括分词器和模型）都保存在训练脚本中。

如果您使用 Hugging Face Trainer，可以将分词器作为参数传递给 Trainer。当您调用 trainer.save_model() 时，它将自动保存。

from sagemaker.huggingface import HuggingFace

############ pseudo code start ############

# create Hugging Face Estimator for training
huggingface_estimator = HuggingFace(....)

# start the train job with our uploaded datasets as input
huggingface_estimator.fit(...)

############ pseudo code end ############

# deploy model to SageMaker Inference
predictor = hf_estimator.deploy(initial_instance_count=1, instance_type="ml.m5.xlarge")

# example request: you always need to define "inputs"
data = {
   "inputs": "Camera - You are awarded a SiPix Digital Camera! call 09061221066 from landline. Delivery within 28 days."
}

# request
predictor.predict(data)

运行请求后，您可以按如下所示删除端点

# delete endpoint
predictor.delete_endpoint()

使用 model_data 部署

如果您已经训练好模型并希望稍后部署，请使用 model_data 参数指定分词器和模型权重的路径。

from sagemaker.huggingface.model import HuggingFaceModel

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data="s3://models/my-bert-model/model.tar.gz",  # path to your trained SageMaker model
   role=role,                                            # IAM role with permissions to create an endpoint
   transformers_version="4.26",                           # Transformers version used
   pytorch_version="1.13",                                # PyTorch version used
   py_version='py39',                                    # Python version used
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
   initial_instance_count=1,
   instance_type="ml.m5.xlarge"
)

# example request: you always need to define "inputs"
data = {
   "inputs": "Camera - You are awarded a SiPix Digital Camera! call 09061221066 from landline. Delivery within 28 days."
}

# request
predictor.predict(data)

运行请求后，您可以再次使用以下命令删除端点

# delete endpoint
predictor.delete_endpoint()

创建用于部署的模型工件

为了方便后续部署，您可以创建一个包含所有必需文件（例如）的 model.tar.gz 文件

pytorch_model.bin
tf_model.h5
tokenizer.json
tokenizer_config.json

例如，您的文件应如下所示

model.tar.gz/
|- pytorch_model.bin
|- vocab.txt
|- tokenizer_config.json
|- config.json
|- special_tokens_map.json

从 🤗 Hub 中的模型创建您自己的 model.tar.gz

下载模型

git lfs install
git clone git@hf.co:{repository}

创建 tar 文件

cd {repository}
tar zcvf model.tar.gz *

将 model.tar.gz 上传到 S3

aws s3 cp model.tar.gz <s3://{my-s3-path}>

现在，您可以将 S3 URI 提供给 model_data 参数，以便稍后部署您的模型。

从 🤗 Hub 部署模型

要将模型直接从 🤗 Hub 部署到 SageMaker，请在创建 HuggingFaceModel 时定义两个环境变量

HF_MODEL_ID 定义了模型 ID，该 ID 在您创建 SageMaker 端点时会自动从 huggingface.co/models 加载。通过此环境变量访问 🤗 Hub 上的 10,000 多个模型。
HF_TASK 定义了 🤗 Transformers pipeline 的任务。完整的任务列表可在此处找到。

⚠️ ** Pipeline 未针对并行（多线程）进行优化，并且往往会消耗大量 RAM。例如，在基于 GPU 的实例上，pipeline 在单个 vCPU 上运行。当此 vCPU 因推理请求预处理而饱和时，可能会造成瓶颈，阻止 GPU 充分用于模型推理。在此处了解更多信息：here

from sagemaker.huggingface.model import HuggingFaceModel

# Hub model configuration <https://huggingface.co/models>
hub = {
  'HF_MODEL_ID':'distilbert-base-uncased-distilled-squad', # model_id from hf.co/models
  'HF_TASK':'question-answering'                           # NLP task you want to use for predictions
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   env=hub,                                                # configuration for loading model from Hub
   role=role,                                              # IAM role with permissions to create an endpoint
   transformers_version="4.26",                             # Transformers version used
   pytorch_version="1.13",                                  # PyTorch version used
   py_version='py39',                                      # Python version used
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
   initial_instance_count=1,
   instance_type="ml.m5.xlarge"
)

# example request: you always need to define "inputs"
data = {
"inputs": {
	"question": "What is used for inference?",
	"context": "My Name is Philipp and I live in Nuremberg. This model is used with sagemaker for inference."
	}
}

# request
predictor.predict(data)

运行请求后，您可以再次使用以下命令删除端点

# delete endpoint
predictor.delete_endpoint()

📓 打开 deploy_transformer_model_from_hf_hub.ipynb 笔记本，查看如何将模型从 🤗 Hub 部署到 SageMaker 进行推理的示例。

使用 🤗 Transformers 和 SageMaker 运行批量转换

训练模型后，您可以使用 SageMaker 批量转换对模型执行推理。批量转换接受您的推理数据作为 S3 URI，然后 SageMaker 将负责下载数据、运行预测并将结果上传到 S3。有关批量转换的更多详细信息，请参阅此处。

⚠️ 由于文本数据的复杂结构，Hugging Face 推理 DLC 目前仅支持 .jsonl 用于批量转换。

注意：确保您的 inputs 在预处理期间适合模型的 max_length。

如果您使用 Hugging Face Estimator 训练模型，请调用 transformer() 方法来为基于训练作业的模型创建转换作业（更多详细信息请参见此处）

batch_job = huggingface_estimator.transformer(
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    strategy='SingleRecord')


batch_job.transform(
    data='s3://s3-uri-to-batch-data',
    content_type='application/json',    
    split_type='Line')

如果您想稍后运行批量转换作业或使用 🤗 Hub 中的模型，请创建 HuggingFaceModel 实例，然后调用 transformer() 方法

from sagemaker.huggingface.model import HuggingFaceModel

# Hub model configuration <https://huggingface.co/models>
hub = {
	'HF_MODEL_ID':'distilbert/distilbert-base-uncased-finetuned-sst-2-english',
	'HF_TASK':'text-classification'
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   env=hub,                                                # configuration for loading model from Hub
   role=role,                                              # IAM role with permissions to create an endpoint
   transformers_version="4.26",                             # Transformers version used
   pytorch_version="1.13",                                  # PyTorch version used
   py_version='py39',                                      # Python version used
)

# create transformer to run a batch job
batch_job = huggingface_model.transformer(
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    strategy='SingleRecord'
)

# starts batch transform job and uses S3 data as input
batch_job.transform(
    data='s3://sagemaker-s3-demo-test/samples/input.jsonl',
    content_type='application/json',    
    split_type='Line'
)

input.jsonl 如下所示

{"inputs":"this movie is terrible"}
{"inputs":"this movie is amazing"}
{"inputs":"SageMaker is pretty cool"}
{"inputs":"SageMaker is pretty cool"}
{"inputs":"this movie is terrible"}
{"inputs":"this movie is amazing"}

📓 打开 sagemaker-notebook.ipynb 笔记本，查看如何运行批量转换作业进行推理的示例。

使用 TGI 将 LLM 部署到 SageMaker

如果您有兴趣将高性能服务容器用于 LLM，可以使用 Hugging Face TGI 容器。该容器利用了 Text Generation Inference 库。兼容模型的列表可在此处找到。

首先，确保已安装最新版本的 SageMaker SDK

pip install sagemaker>=2.231.0

然后，我们导入 SageMaker Python SDK 并实例化 sagemaker_session 以查找当前区域和执行角色。

import sagemaker
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
import time

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()

接下来，我们检索 LLM 镜像 URI。我们使用辅助函数 get_huggingface_llm_image_uri() 生成 Hugging Face 大型语言模型 (LLM) 推理的适当镜像 URI。该函数接受一个必需参数 backend 和几个可选参数。backend 指定用于模型的后端类型：“huggingface”表示使用 Hugging Face TGI 后端。

image_uri = get_huggingface_llm_image_uri(
  backend="huggingface",
  region=region
)

现在我们有了镜像 uri，下一步是配置模型对象。我们指定一个唯一的名称、托管 TGI 容器的 image_uri 以及端点的执行角色。此外，我们还指定了许多环境变量，包括 HF_MODEL_ID（对应于将部署的 HuggingFace Hub 中的模型）和 HF_TASK（用于配置模型要执行的推理任务）。

您还应该定义 SM_NUM_GPUS，它指定模型的张量并行度。张量并行性可用于将模型拆分到多个 GPU 上，这在使用对于单个 GPU 来说太大的 LLM 时是必要的。要了解有关推理张量并行性的更多信息，请参阅我们之前的博客文章。在此处，您应该将 SM_NUM_GPUS 设置为所选实例类型上可用 GPU 的数量。例如，在本教程中，我们将 SM_NUM_GPUS 设置为 4，因为我们选择的实例类型 ml.g4dn.12xlarge 有 4 个可用 GPU。

请注意，您可以通过将 HF_MODEL_QUANTIZE 环境变量设置为 true 来选择性地减少模型的内存和计算占用空间，但这种较低的权重精度可能会影响某些模型的输出质量。

model_name = "llama-3-1-8b-instruct" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

hub = {
    'HF_MODEL_ID':'meta-llama/Llama-3.1-8B-Instruct',
    'SM_NUM_GPUS':'1',
	'HUGGING_FACE_HUB_TOKEN': '<REPLACE WITH YOUR TOKEN>',
}

assert hub['HUGGING_FACE_HUB_TOKEN'] != '<REPLACE WITH YOUR TOKEN>', "You have to provide a token."


model = HuggingFaceModel(
    name=model_name,
    env=hub,
    role=role,
    image_uri=image_uri
)

接下来，我们调用 deploy 方法来部署模型。

predictor = model.deploy(
  initial_instance_count=1,
  instance_type="ml.g5.2xlarge",
  endpoint_name=model_name
)

部署模型后，我们可以调用它来生成文本。我们传递一个输入提示并运行 predict 方法，以从 TGI 容器中运行的 LLM 生成文本响应。

input_data = {
  "inputs": "The diamondback terrapin was the first reptile to",
  "parameters": {
    "do_sample": True,
    "max_new_tokens": 100,
    "temperature": 0.7,
    "watermark": True
  }
}

predictor.predict(input_data)

我们收到以下自动生成的文本响应

[{'generated_text': 'The diamondback terrapin was the first reptile to make the list, followed by the American alligator, the American crocodile, and the American box turtle. The polecat, a ferret-like animal, and the skunk rounded out the list, both having gained their slots because they have proven to be particularly dangerous to humans.\n\nCalifornians also seemed to appreciate the new list, judging by the comments left after the election.\n\n“This is fantastic,” one commenter declared.\n\n“California is a very'}]

实验完成后，我们删除端点和模型资源。

predictor.delete_model()
predictor.delete_endpoint()

用户定义的代码和模块

Hugging Face 推理工具包允许用户覆盖 HuggingFaceHandlerService 的默认方法。您需要创建一个名为 code/ 的文件夹，并在其中包含一个 inference.py 文件。有关如何归档模型工件的更多详细信息，请参阅此处。例如

model.tar.gz/
|- pytorch_model.bin
|- ....
|- code/
  |- inference.py
  |- requirements.txt

inference.py 文件包含您的自定义推理模块，而 requirements.txt 文件包含应添加的其他依赖项。自定义模块可以覆盖以下方法

model_fn(model_dir) 覆盖加载模型的默认方法。返回值 model 将用于 predict 进行预测。predict 接收参数 model_dir，即未解压的 model.tar.gz 的路径。
transform_fn(model, data, content_type, accept_type) 使用您的自定义实现覆盖默认转换函数。您需要在 transform_fn 中实现您自己的 preprocess、predict 和 postprocess 步骤。此方法不能与下面提到的 input_fn、predict_fn 或 output_fn 结合使用。
input_fn(input_data, content_type) 覆盖预处理的默认方法。返回值 data 将用于 predict 进行预测。输入是
- input_data 是您请求的原始正文。
- content_type 是请求头中的内容类型。
predict_fn(processed_data, model) 覆盖预测的默认方法。返回值 predictions 将用于 postprocess。输入是 processed_data，即 preprocess 的结果。
output_fn(prediction, accept) 覆盖后处理的默认方法。返回值 result 将是您请求的响应（例如 JSON）。输入是
- predictions 是 predict 的结果。
- accept 是 HTTP 请求的返回接受类型，例如 application/json。

以下是包含 model_fn、input_fn、predict_fn 和 output_fn 的自定义推理模块示例

from sagemaker_huggingface_inference_toolkit import decoder_encoder

def model_fn(model_dir):
    # implement custom code to load the model
    loaded_model = ...
    
    return loaded_model 

def input_fn(input_data, content_type):
    # decode the input data  (e.g. JSON string -> dict)
    data = decoder_encoder.decode(input_data, content_type)
    return data

def predict_fn(data, model):
    # call your custom model with the data
    outputs = model(data , ... )
    return predictions

def output_fn(prediction, accept):
    # convert the model output to the desired output format (e.g. dict -> JSON string)
    response = decoder_encoder.encode(prediction, accept)
    return response

仅使用 model_fn 和 transform_fn 自定义您的推理模块

from sagemaker_huggingface_inference_toolkit import decoder_encoder

def model_fn(model_dir):
    # implement custom code to load the model
    loaded_model = ...
    
    return loaded_model 

def transform_fn(model, input_data, content_type, accept):
     # decode the input data (e.g. JSON string -> dict)
    data = decoder_encoder.decode(input_data, content_type)

    # call your custom model with the data
    outputs = model(data , ... ) 

    # convert the model output to the desired output format (e.g. dict -> JSON string)
    response = decoder_encoder.encode(output, accept)

    return response

< > 在 GitHub 上更新

在 AWS 上部署