🤗 使用推理端点 + 自定义处理程序部署任何模型

社区文章发布于 2024 年 11 月 22 日

TL;DR 推理端点提供了一个安全的生产解决方案，可以轻松地将 Hugging Face Hub 中的任何 Transformers、Sentence-Transformers 和 Diffusers 模型部署到 Hugging Face 管理的专用且自动伸缩的基础设施上。推理端点支持通过处理程序运行自定义代码，允许根据您的特定需求进行定制的预处理、推理或后处理。本文解释了如何使用自定义处理程序在推理端点上部署任何模型，并提供了任何人都可以重现的实际用例示例。

什么是推理端点？

推理端点提供了一个安全的生产解决方案，可以轻松地将 Hugging Face Hub 中的任何 Transformers、Sentence-Transformers 和 Diffusers 模型部署到 Hugging Face 管理的专用且自动伸缩的基础设施上。

推理端点可以通过推理端点 UI部署为专用端点，适用于 Hugging Face Hub 中带有推理端点标签的任何模型。另外，它们可以通过无服务器推理 API使用（而非部署），适用于具有“热”或“冷”推理状态的任何模型。

如果您还不熟悉推理端点，我们建议您先查阅文档。

什么是自定义处理程序？

自定义处理程序是 Python 中的自定义类，它们定义了在模型上运行推理所需的预处理、推理和后处理步骤。当使用默认容器（即 PyTorch 容器）时，这些自定义类由推理端点后端在内部使用，PyTorch 容器支持 Hugging Face Hub 中定义并由 Transformers、Sentence-Transformers 和 Diffusers 支持的大多数模型架构和任务。

自定义处理程序扩展了推理端点的功能，超越了原生支持，提供了更大的灵活性和对推理过程的控制。它们使用户能够调整预处理、推理和后处理等步骤，集成额外的依赖项，或实现自定义指标或日志记录等功能。这意味着用户不会被一刀切的解决方案束缚，而是可以控制和修改以适应其特定需求或要求的解决方案；如果默认解决方案尚未涵盖这些需求。

自定义处理程序作为模型仓库中的 `handler.py` 文件（如果需要，还可以包含一个可选的 `requirements.txt` 文件）提供。如果可用，推理端点后端在启动时会自动检测并使用这些文件。

入门！

要开始在 Hugging Face Hub 上使用自定义处理程序，有多种替代方案：

复制包含模型权重的仓库，将 handler.py 和 requirements.txt（如果适用）文件包含在单独的仓库中。
打开 PR（或提交到 main 如果您是唯一所有者）以将 handler.py 和 requirements.txt（如果适用）文件包含在现有仓库中。
创建一个全新的模型仓库，仅包含 handler.py 和 requirements.txt（如果适用）。

请注意，要启用模型仓库中的“部署”按钮，README.md 应该包含 pipeline_tag: ...，并使用推理端点支持的有效管道，以便在该仓库中启用该选项，即使该仓库不包含模型权重。

设置好包含或不包含模型权重的仓库后，您应该在仓库的根目录中创建一个名为 handler.py 的文件，并实现以下接口：

from typing import Any, Dict

class EndpointHandler:
    def __init__(self, model_dir: str, **kwargs: Any) -> None:
        ...

    def __call__(self, data: Dict[str, Any]) -> Any:
        ...

请注意，您可以在 handler.py 中包含任何其他功能，但要实现的类必须命名为 EndpointHandler，并且必须同时实现 __init__ 和 __call__ 方法，但您可以在类中包含任何其他方法或在类外部包含函数，然后在这些类方法中使用它们。

最后，创建完成后，您可以通过运行以下代码片段在本地进行调试：

if __name__ == "__main__":
    handler = EndpointHandler(model_dir=...)
    assert handler(data=...) == ...

此外，如果您的管道需要任何特定的依赖版本，甚至是默认 PyTorch 容器中不包含的依赖项，您可以将其包含在 `requirements.txt` 文件中，如下所示：

diffusers>=0.31.0

这样就设置好了！点击“部署”并选择“推理端点（专用）”后，您就应该能够在推理端点上部署您的自定义处理程序了！或者，您也可以直接前往推理端点 UI，并在 Hub 上搜索包含自定义处理程序的模型仓库。

提示和技巧

要将模型权重从一个仓库复制到另一个仓库，最方便的方法是使用仓库复制器 - Hugging Face Space，它将在 Hugging Face 内部复制所有内容，而无需先在本地拉取和推送所有 LFS 文件。
复制现有仓库始终是最佳方法，因为在创建推理端点时，硬件推荐仍会起作用（除了未与基础模型权重一起托管的 LoRA 适配器权重）；否则，在使用仅在 EndpointHandler.__init__ 方法中拉取模型的自定义处理程序时，硬件推荐将被忽略。
由于驱动这些功能的核心引擎是 huggingface-inference-toolkit，您可以利用其中定义的一些实用程序，例如通过 from huggingface_inference_toolkit.logging import logger 进行日志记录，然后正常使用导入的 logger，例如 logger.info、logger.debug 等，所有这些日志都将显示在推理端点日志中。
在推理端点 UI 中为默认（即 PyTorch 容器）选择任务时，请确保将任务设置为与模型相同的任务（除非不支持），以便 Playground UI 正常工作。请注意，它在输入负载修改或不支持的任务上不起作用，因此如果出现这种情况，请改为选择“自定义”任务，否则 Playground UI 将毫无用处。
如果模型权重不在当前仓库中，并且模型权重位于受限仓库下，您将需要在推理端点配置中手动设置一个秘密变量，以便可以下载受限模型权重。为此，最好的方法是在 EndpointHandler.__init__ 方法中，在运行任何其他初始化步骤之前添加以下代码片段：
```
if os.getenv("HF_TOKEN") is None:
    raise ValueError(
        "Since the model weights are gated, you will need to provide a valid `HF_TOKEN` with read-access"
        " to the repository where the weights are hosted."
    )
```
请注意，如果模型权重托管在当前仓库中，则不需要令牌。
当从复制的仓库或现有仓库部署推理端点时，该仓库中的所有文件可能都不需要，因为它可能包含不同的格式，例如 `safetensors`、`bin` 等，并且，由于所有这些文件都将在启动时下载，您可能希望首先删除未使用的文件。如果仓库仅包含 `handler.py` 和 `requirements.txt`（如果适用），并且 `handler.py` 通过 `transformers.pipeline(task=..., model=...)` 等指向另一个仓库，则不会发生这种情况，只会下载所需的文件，而不是仓库中的所有文件。

用例

下面，您将找到几个用例，它们演示了为什么自定义处理程序具有价值，以及简单的代码片段，展示了如何重现和调整这些用例以满足您的需求。

为 Diffusion 模型提供 LoRA 适配器

为 Diffusion 模型提供 LoRA 适配器

假设您想为 Diffusers 模型提供一个微调的 LoRA 适配器，例如 alvarobartt/ghibli-characters-flux-lora，它是 black-forest-labs/FLUX.1-dev 的 LoRA 微调版本。当尝试将其部署到推理端点时，将显示以下错误：

正如错误所说，您需要确保包含 LoRA 适配器的模型仓库包含一个 handler.py 文件，该文件将首先加载模型，然后加载适配器，如Diffusers 文档中如何加载适配器所述。

请注意，由于这里的基本模型（即仓库中不是适配器的部分）是受限的，您需要确保使用对受限模型具有读取权限的有效 Hugging Face Hub 令牌创建并设置 HF_TOKEN 环境变量值，本例中为 black-forest-labs/FLUX.1-dev。

import os
from typing import Any, Dict

from diffusers import DiffusionPipeline  # type: ignore
from PIL.Image import Image
import torch

from huggingface_inference_toolkit.logging import logger


class EndpointHandler:
    def __init__(self, model_dir: str, **kwargs: Any) -> None:  # type: ignore
        """The current `EndpointHandler` works with any FLUX.1-dev LoRA Adapter."""
        if os.getenv("HF_TOKEN") is None:
            raise ValueError(
                "Since `black-forest-labs/FLUX.1-dev` is a gated model, you will need to provide a valid "
                "`HF_TOKEN` as an environment variable for the handler to work properly."
            )

        self.pipeline = DiffusionPipeline.from_pretrained(
            "black-forest-labs/FLUX.1-dev",
            torch_dtype=torch.bfloat16,
            token=os.getenv("HF_TOKEN"),
        )
        self.pipeline.load_lora_weights(model_dir)
        self.pipeline.to("cuda")

    def __call__(self, data: Dict[str, Any]) -> Image:
        logger.info(f"Received incoming request with {data=}")

        if "inputs" in data and isinstance(data["inputs"], str):
            prompt = data.pop("inputs")
        elif "prompt" in data and isinstance(data["prompt"], str):
            prompt = data.pop("prompt")
        else:
            raise ValueError(
                "Provided input body must contain either the key `inputs` or `prompt` with the"
                " prompt to use for the image generation, and it needs to be a non-empty string."
            )

        parameters = data.pop("parameters", {})

        num_inference_steps = parameters.get("num_inference_steps", 30)
        width = parameters.get("width", 1024)
        height = parameters.get("height", 768)
        guidance_scale = parameters.get("guidance_scale", 3.5)

        # seed generator (seed cannot be provided as is but via a generator)
        seed = parameters.get("seed", 0)
        generator = torch.manual_seed(seed)

        return self.pipeline(  # type: ignore
            prompt,
            height=height,
            width=width,
            guidance_scale=guidance_scale,
            num_inference_steps=num_inference_steps,
            generator=generator,
        ).images[0]

上述代码可以重复使用并作为 `handler.py` 文件包含在 black-forest-labs/FLUX.1-dev 的任何可用 LoRA 适配器中，无需任何代码修改；当将基础模型更改为例如 stabilityai/stable-diffusion-3.5-large 时，只需进行少量修改，因为大部分代码在不同的 `text-to-image` 用例中是共享的。

部署到推理端点后，它看起来像这样：

在 alvarobartt/ghibli-characters-flux-lora 找到自定义处理程序。

部署 Hub 上不支持原生支持的模型

部署 Hub 上不支持原生支持的模型

假设您想部署 nvidia/NVLM-D-72B，这是一个 image-text-to-text 模型，即一个视觉语言模型 (VLM)，它在文本生成推理 (TGI) 上不受支持，在默认 PyTorch 容器上也不受支持（因为 image-text-to-text 尚未有预定义的 AutoPipeline 实现，但根据 https://github.com/huggingface/transformers/pull/34170，应该很快就会有）。

那么您将需要在 handler.py 文件中定义一个自定义处理程序，该处理程序将为该任务运行预处理、推理和后处理；并在 requirements.txt 中包含任何其他需求，在大多数情况下这应该是不必要的，因为默认的 PyTorch 容器已经安装了 Transformers、Sentence-Transformers 和 Diffusers 的大部分 Hugging Face 依赖项；以及它们的一些常用额外依赖项。

请注意，在这种情况下，使用自定义处理程序不仅是为了覆盖不支持的模型，还在于定义自定义设备映射、添加自定义预处理代码以及添加一些自定义日志消息等。

import math
from typing import Any, Dict, List

import torch
import torchvision.transforms as T
from torchvision.transforms.functional import InterpolationMode

import requests
from io import BytesIO
from PIL import Image

from transformers import AutoTokenizer, AutoModel

from huggingface_inference_toolkit.logging import logger


def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float("inf")
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio


def dynamic_preprocess(
    image, min_num=1, max_num=12, image_size=448, use_thumbnail=False
):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j)
        for n in range(min_num, max_num + 1)
        for i in range(1, n + 1)
        for j in range(1, n + 1)
        if i * j <= max_num and i * j >= min_num
    )
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio,
        target_ratios,
        orig_width,
        orig_height,
        image_size,
    )

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size,
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images


def load_image(image_url, input_size=448, max_num=12):
    response = requests.get(image_url)
    image = Image.open(BytesIO(response.content)).convert("RGB")
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(
        image, image_size=input_size, use_thumbnail=True, max_num=max_num
    )
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values


def split_model():
    device_map = {}
    world_size = torch.cuda.device_count()
    num_layers = 80
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f"language_model.model.layers.{layer_cnt}"] = i
            layer_cnt += 1
    device_map["vision_model"] = 0
    device_map["mlp1"] = 0
    device_map["language_model.model.tok_embeddings"] = 0
    device_map["language_model.model.embed_tokens"] = 0
    device_map["language_model.output"] = 0
    device_map["language_model.model.norm"] = 0
    device_map["language_model.lm_head"] = 0
    device_map[f"language_model.model.layers.{num_layers - 1}"] = 0

    return device_map


IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)


def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose(
        [
            T.Lambda(lambda img: img.convert("RGB") if img.mode != "RGB" else img),
            T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
            T.ToTensor(),
            T.Normalize(mean=MEAN, std=STD),
        ]
    )
    return transform


class EndpointHandler:
    def __init__(self, model_dir: str, **kwargs: Any) -> None:
        self.model = AutoModel.from_pretrained(
            model_dir,
            torch_dtype=torch.bfloat16,
            low_cpu_mem_usage=True,
            use_flash_attn=False,
            trust_remote_code=True,
            device_map=split_model(),
        ).eval()

        self.tokenizer = AutoTokenizer.from_pretrained(
            model_dir, trust_remote_code=True, use_fast=False
        )

    def __call__(self, data: Dict[str, Any]) -> Dict[str, List[Any]]:
        logger.info(f"Received incoming request with {data=}")
        
        if "instances" in data:
            logger.warning("Using `instances` instead of `inputs` is deprecated.")
            data["inputs"] = data.pop("instances")

        if "inputs" not in data:
            raise ValueError(
                "The request body must contain a key 'inputs' with a list of inputs."
            )

        if not isinstance(data["inputs"], list):
            raise ValueError(
                "The request inputs must be a list of dictionaries with either the key"
                " 'prompt' or 'prompt' + 'image_url', and optionally including the key"
                " 'generation_config'."
            )

        if not all(isinstance(input, dict) and "prompt" in input.keys() for input in data["inputs"]):
            raise ValueError(
                "The request inputs must be a list of dictionaries with either the key"
                " 'prompt' or 'prompt' + 'image_url', and optionally including the key"
                " 'generation_config'."
            )

        predictions = []
        for input in data["inputs"]:
            if "prompt" not in input:
                raise ValueError(
                    "The request input body must contain at least the key 'prompt' with the prompt to use."
                )

            generation_config = input.get("generation_config", dict(max_new_tokens=1024, do_sample=False))

            if "image_url" not in input:
                # pure-text conversation
                response, history = self.model.chat(
                    self.tokenizer,
                    None,
                    input["prompt"],
                    generation_config,
                    history=None,
                    return_history=True,
                )
            else:
                # single-image single-round conversation
                pixel_values = load_image(input["image_url"], max_num=6).to(
                    torch.bfloat16
                )
                response = self.model.chat(
                    self.tokenizer,
                    pixel_values,
                    f"<image>\n{input['prompt']}",
                    generation_config,
                )

            predictions.append(response)
        return {"predictions": predictions}

部署到推理端点后，它看起来像这样：

在 alvarobartt/NVLM-D-72B-IE-compatible 找到自定义处理程序。

为 I/O 负载定义自定义规范

为 I/O 负载定义自定义规范

请注意，当使用 I/O 负载的自定义规范时，推理端点中“默认”容器内运行的“任务”需要设置为“自定义”，否则 UI 中的 Playground 将针对给定任务创建，这将因预定义的输出解析而失败；而自定义任务将以 JSON 格式打印原始响应。

假设您有一个 UI 或 SDK，期望 API 接收或生成给定的负载，但默认的推理端点负载格式（无论是输入还是输出，或两者）不符合该要求，但您仍然希望利用 Hugging Face 推理端点来无缝地在您的应用程序中使用它们。

然后你需要实现一个自定义处理程序，给定一个任务，例如 zero-shot-classification，它期望的输入与默认输入不同：

{"inputs": "I have a problem with my iphone that needs to be resolved asap!", "parameters": {"candidate_labels": ["urgent", "not urgent", "phone", "tablet", "computer"]}}

但您希望它期望以下内容：

{"sequence": "I have a problem with my iphone that needs to be resolved asap!", "labels": ["urgent", "not urgent", "phone", "tablet", "computer"]}

并且默认情况下生成输出：

{"sequence": "I have a problem with my iphone that needs to be resolved asap!!", "labels": ["urgent", "phone", "computer", "not urgent", "tablet"], "scores": [0.504, 0.479, 0.013, 0.003, 0.002]}

但你想要它产生：

{"sequence": "I have a problem with my iphone that needs to be resolved asap!!", "label": "urgent", "timestamp": 1732028280}

那么自定义处理程序将类似于以下内容：

import os
from typing import Any, Dict
import time

from transformers import pipeline
import torch

from huggingface_inference_toolkit.logging import logger


class EndpointHandler:
    def __init__(self, model_dir: str, **kwargs: Any) -> None:
        """Initialize the EndpointHandler for zero-shot classification."""
        self.classifier = pipeline(
            "zero-shot-classification",
            model=model_dir,
            device=0 if torch.cuda.is_available() else -1
        )

    def __call__(self, data: Dict[str, Any]) -> Dict[str, Any]:
        logger.info(f"Received incoming request with {data=}")

        if "sequence" not in data or not isinstance(data["sequence"], str):
            raise ValueError(
                "Provided input body must contain the key `sequence` with the text to classify, "
                "and it needs to be a non-empty string."
            )

        if "labels" not in data or not isinstance(data["labels"], list):
            raise ValueError(
                "Provided input body must contain the key `labels` with a list of classification labels."
            )

        sequence = data["sequence"]
        labels = data["labels"]

        output = self.classifier(sequence, candidate_labels=labels)

        return {
            "sequence": sequence,
            "label": output["labels"][0],
            "timestamp": int(time.time())
        }

这些只是自定义处理程序的众多用例中的一部分，其中还包括从 Hub 外部的私有存储（例如 Google Cloud Storage (GCS)）下载模型权重，添加自定义指标报告或日志记录，以及许多其他功能。

结论

虽然推理端点的默认实现应该涵盖 Hugging Face Hub 上托管的文本生成推理（TGI）、文本嵌入推理（TEI）或 PyTorch 兼容模型的大多数用例，但在某些情况下，这些实现可能存在局限性或不适合您的特定需求或规范。如上所述，通过使用端点处理程序中的自定义代码，可以轻松解决这些挑战。

自定义处理程序为推理端点提供了极大的灵活性，使其能够部署几乎任何模型，同时由 Hugging Face 安全可靠地管理。该解决方案托管在主要的云服务提供商基础设施上，例如亚马逊云服务（AWS）、谷歌云平台（GCP）或微软 Azure，确保了强大且可扩展的部署选项。

社区

通过拖放到文本输入框、粘贴或点击此处上传图片、音频和视频。

点击或粘贴此处以上传图片

· 注册或登录以评论