在SageMaker终端节点上运行任何HuggingFace模型：Cross Encoder模型示例的演练

社区文章发布于2023年12月14日

引言

在本文档中，我们将逐步指导您如何将任何HuggingFace模型实例化为SageMaker终端节点。它适用于任何模型，不限于支持 `text-generation` 或 `text2text-generation` 任务的模型。我们将使用 https://huggingface.co/BAAI/bge-reranker-base 作为示例。

示例代码已在OSX和us-west-2 AWS区域进行测试。

基础设施概览

TorchServe允许您将HuggingFace模型作为Web服务器运行。
AWS团队创建了https://github.com/aws/sagemaker-pytorch-inference-toolkit，以便轻松地将TorchServe作为SageMaker终端节点运行。
AWS团队还使用Dockerfiles在ECR上创建了容器镜像。其中一个，`763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:1.12.0-gpu-py38`，使用了sagemaker-pytorch-inference-toolkit。
在AWS SageMaker上，
1. 您通过指定基于TorchServe的docker镜像和S3存储桶中的模型zip文件位置来创建**SageMaker模型**。
2. 您通过选择模型和所需实例类型来创建**SageMaker终端节点配置**。
3. 您从终端节点配置创建**SageMaker终端节点**。这是实际的Web服务实例。

要在SageMaker终端节点上运行任何模型，您只需要知道如何创建模型zip文件（步骤#4-1），这是本文档的主要主题。

在SageMaker终端节点上运行HuggingFace模型的步骤

弄清楚如何在裸机Python环境（如SageMaker Notebook终端）中使用模型。
根据#1，编写`inference.py`并在本地进行测试。
将#2打包为zip文件并上传到S3存储桶。
创建SageMaker模型、终端节点配置，然后创建终端节点。测试。
编写一个客户端帮助代码，以便轻松使用服务。测试。
（可选）修改压缩包文件以包含模型二进制文件。

我们将在以下章节中详细介绍。

1. 弄清如何使用模型

根据BAAI/bge-reranker-base模型的HuggingFace文档，我们弄清楚了如何使用该模型。

启动 `python3`。
复制并粘贴以下内容，并确保其正常工作。
- 记下为了使其正常工作而必须安装的软件包。这将在以后用于定义`requirements.txt`。

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-base')
model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-base')

pairs = [['I love you', 'i like you'], ['I love you', 'i hate you']]
with torch.no_grad():
    inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
    scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
    print(scores)

2. 编写 inference.py

这是 `inference.py` 的示例内容。您在步骤 #1 中大部分内容都在 `CrossEncoder` 类中。

import json
import logging
import torch
from typing import List
from sagemaker_inference import encoder
from transformers import AutoModelForSequenceClassification, AutoTokenizer

PAIRS = "pairs"
SCORES = "scores"

class CrossEncoder:
    def __init__(self) -> None:
        self.device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
        logging.info(f"Using device: {self.device}")
        model_name = 'BAAI/bge-reranker-base'
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
        self.model = self.model.to(self.device)

    def __call__(self, pairs: List[List[str]]) -> List[float]:
        with torch.inference_mode():
            inputs = self.tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
            inputs = inputs.to(self.device)
            scores = self.model(**inputs, return_dict=True).logits.view(-1, ).float()

        return scores.detach().cpu().tolist()

def model_fn(model_dir: str) -> CrossEncoder:
    try:
        return CrossEncoder()
    except Exception:
        logging.exception(f"Failed to load model from: {model_dir}")
        raise

def transform_fn(cross_encoder: CrossEncoder, input_data: bytes, content_type: str, accept: str) -> bytes:
    payload = json.loads(input_data)
    model_output = cross_encoder(**payload)
    output = {SCORES: model_output}
    return encoder.encode(output, accept)

在 `inference.py` 中，我们定义了 `model_fn()` 和 `transform_fn()`。我提供了简要解释如下——更多信息请参阅 https://www.philschmid.de/custom-inference-huggingface-sagemaker。

2.1. model_fn()

此函数负责加载模型并返回引用。

`CrossEncoder` 类会即时下载所需的模型。这使得模型文件非常小巧，但代价是运行时依赖于HuggingFace服务。SageMaker Jumpstart模型zip文件在S3 zip文件中包含模型二进制文件。我们将在后续章节中介绍如何实现这一点。

2.2. transform_fn()

在这里，您定义了如何解析请求负载以及输出将是什么样子。

请注意，`CrossEncoder` 类的 `__init__()` 只会被 `model_fn()` 调用一次，而 `call()` 则在每次 `transform_fn()` 调用时被调用。

2.3. 测试

在您已安装 Python 的本地终端上，运行 `python3 -i inference.py` 并执行

model = model_fn("")
transform_fn(model, "{\"pairs\": [[\"I love you\", \"i like you\"], [\"I love you\", \"i hate you\"]]}", "application/json", "application/json")

您应该得到与之前测试相同的分数。

3. 打包模型并上传到S3存储桶

创建模型包根文件夹。
在根文件夹下创建 `code` 子文件夹。
在代码文件夹中，放置 `inference.py` 以及 `__init__.py`、`requirements.txt` 和 `version`。
在模型包的根文件夹中，压缩并上传模型包。

#3的示例目录结构

<model package root>
└── code
    ├── __init__.py          # the content is empty
    ├── inference.py         # the content is from step #2. Write `inference.py`
    ├── requirements.txt
    └── version              # the content can be a one-line string "1.0.0"

请注意，`requirements.txt` 不必列出在您本地运行所需的软件包，如果该软件包已包含在容器镜像中。以下是 `requirements.txt` 的示例内容。

accelerate==0.24.1
bitsandbytes==0.41.2.post2
transformers==4.30.0
sentencepiece==0.1.99
protobuf==3.20.1

完成#4的示例代码

tar zcvf BAAI_bge-reranker-base.tar.gz *
aws s3 cp BAAI_bge-reranker-base.tar.gz s3://<<YOUR_S3_BUCKET_NAME>>/huggingface-models/

4. 创建SageMaker终端节点

创建模型。
1. 登录AWS控制台，打开https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/models
2. 单击“**创建模型**”按钮。
3. 在 `Location of inference code image` 中，输入 `763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:1.12.0-gpu-py38`
4. 在 `Location of model artifacts // - optional//` 处，填写步骤 #3 打包模型并上传到S3存储桶中的S3路径。
创建终端节点配置。
1. https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/endpointConfig
2. 点击“**创建终端节点配置**”。
3. 点击 **创建生产变体** 并选择您在步骤 #1 中创建的模型。
创建终端节点。
1. https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/endpoints
2. 单击“**创建终端节点**”。
3. 选择您在步骤 #2 中创建的终端节点配置。

4.1. 测试

一旦终端节点处于 InService 状态，请在您的 Python 终端上运行以下命令，并确认您获得与之前测试相同的结果。

import boto3, json
session = boto3.Session()
client = session.client("sagemaker-runtime", region_name="us-west-2")
output = client.invoke_endpoint(EndpointName="my-bge-reranker-base", Body="{\"pairs\": [[\"I love you\", \"i like you\"], [\"I love you\", \"i hate you\"]]}", ContentType="application/json")
json.loads(output["Body"].read().decode("utf-8"))

4.2. 故障排除提示

模型更新快速提示：如果您正在调整模型包，例如 'inference.py'，则无需重新开始。只需更新 S3 存储桶中的模型 zip 文件，然后使用现有配置删除并重新创建终端节点。这种方法可以节省时间和精力。

4.3. 故障排除：终端节点服务无响应。

它应该与您本地运行一样快速响应。否则，很可能是因为模型无法启动。

在AWS控制台，打开SageMaker终端节点页面。
点击“**模型容器日志**”链接。
检查日志，查看哪里出错了。

此时，最可能的原因是缺少软件包，可以通过修改 `requirements.txt` 来解决。

4.4. 故障排除：终端节点服务有响应但响应缓慢。

确保它使用GPU。`inference.py` 代码 `logging.info(f"Using device: {self.device}")` 应该写入 `cuda`。如果它写入 `cpu`，则表示终端节点未利用 GPU。

这可能由于各种原因发生。我遇到的一个案例是由于 PyTorch 版本不正确。我将 `torch` 从 `requirements.txt` 中删除，当终端节点使用容器镜像中的 PyTorch 时，问题就解决了。

4.5. 故障排除：其他任何问题

在 `inference.py` 中，您可以放置调试消息。

您还可以下载容器并在本地运行模型。如果您做到这一步，请注意您的模型包根目录应映射到 docker 容器中的 `/opt/ml/model`。（换句话说，`inference.py` 应位于 `/opt/ml/model/code/inference.py`。）

5. 编写客户端包装器

我创建了以下代码，以便于客户端使用。

import json
from typing import Any, Dict, List, Optional

from langchain.pydantic_v1 import BaseModel, Extra, root_validator
from langchain.schema.cross_encoder import CrossEncoder


class CrossEncoderContentHandler:
    """Content handler for CrossEncoder class."""
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, pairs: List[List[str]]) -> bytes:
        input_str = json.dumps({"pairs": pairs})
        return input_str.encode('utf-8')

    def transform_output(self, output: Any) -> List[float]:
        response_json = json.loads(output.read().decode("utf-8"))
        scores = response_json["scores"]
        return scores

class SagemakerEndpointCrossEncoder(BaseModel):
    client: Any  #: :meta private:

    endpoint_name: str = ""
    region_name: str = ""
    credentials_profile_name: Optional[str] = None
    content_handler: CrossEncoderContentHandler = CrossEncoderContentHandler()
    model_kwargs: Optional[Dict] = None
    endpoint_kwargs: Optional[Dict] = None

    class Config:
        extra = Extra.forbid
        arbitrary_types_allowed = True

    @root_validator()
    def validate_environment(cls, values: Dict) -> Dict:
        """Validate that AWS credentials to and python package exists in environment."""
        import boto3

        if values["credentials_profile_name"] is not None:
            session = boto3.Session(
                profile_name=values["credentials_profile_name"]
            )
        else:
            # use default credentials
            session = boto3.Session()

        values["client"] = session.client(
            "sagemaker-runtime", region_name=values["region_name"]
        )
        return values

    def score(self, pairs: List[List[str]]) -> List[float]:
        """Call out to SageMaker Inference CrossEncoder endpoint."""
        _model_kwargs = self.model_kwargs or {}
        _endpoint_kwargs = self.endpoint_kwargs or {}

        body = self.content_handler.transform_input(pairs)
        content_type = self.content_handler.content_type
        accepts = self.content_handler.accepts

        # send request
        try:
            response = self.client.invoke_endpoint(
                EndpointName=self.endpoint_name,
                Body=body,
                ContentType=content_type,
                Accept=accepts,
                **_endpoint_kwargs,
            )
        except Exception as e:
            raise ValueError(f"Error raised by inference endpoint: {e}")

        return self.content_handler.transform_output(response["Body"])

def _setup_sagemaker_endpoint_for_cross_encoder(reranker_endpoint_name: str,
                                                 region: str) -> Callable:
    sm_llm = SagemakerEndpointCrossEncoder(
        endpoint_name=reranker_endpoint_name,
        region_name=region,
        model_kwargs={},
        content_handler=CrossEncoderContentHandler())
    return sm_llm

测试：确认 `llm.score()` 的结果与之前的测试相符。

llm = _setup_sagemaker_endpoint_for_cross_encoder("my-bge-reranker-base", "us-west-2")
llm.score([["I love you", "i like you"], ["I love you", "i hate you"]])

6. (可选) 修改包zip文件以包含模型二进制文件

有关如何将模型包含在zip文件中的示例，请参阅其他Jumpstart模型。您可以通过运行以下命令从S3存储桶获取列表

aws s3 ls s3://jumpstart-cache-prod-us-west-2/huggingface-infer/prepack/ --recursive

其中的 `inference.py` 也包含如何将 `kwargs` 传递给模型实例化或添加参数验证逻辑的示例代码。

包含模型二进制文件是SageMaker Jumpstart的标准做法。就个人而言，我不确定这是否一定是更好的做法，因为它需要更长的时间来更改镜像或加载终端节点。

您需要修改 `code/inference.py`，使模型从当前路径加载，而不是从 HuggingFace 下载。

class CrossEncoder:
    def __init__(self, model_dir: str, **kwargs: Any) -> None:
        self.device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
        logging.info(f"Using device: {self.device}")

        self.tokenizer = AutoTokenizer.from_pretrained(model_dir)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_dir)
        self.model = self.model.to(self.device)
...
def model_fn(model_dir: str) -> CrossEncoder:
    try:
        return CrossEncoder(model_dir)

然后创建包含以下内容的 `build-and-upload.sh`，将其放置在模型包根文件夹中，并运行它。新的模型包将上传到 S3 存储桶。

#/bin/sh
rm -rf build
mkdir build
cd build
git clone https://huggingface.co/BAAI/bge-reranker-base
cp -r ../code bge-reranker-base/
cd bge-reranker-base
tar zcvf BAAI_bge-reranker-base.tar.gz *
aws s3 cp BAAI_bge-reranker-base.tar.gz s3://<<YOUR_S3_BUCKET_NAME>>/huggingface-models/

请确保来自4. 创建SageMaker终端节点的终端节点测试仍然有效。

社区

通过拖放到文本输入框、粘贴或点击此处上传图片、音频和视频。

点击或粘贴此处以上传图片

· 注册或登录以发表评论