推理端点（专用）

您是否曾想创建自己的机器学习 API？在本教程中，我们将使用 HF 专用推理端点来实现这一目标。推理端点使您能够从 HF Hub 上的数十万个模型中选择任何一个，在您控制的部署平台和您选择的硬件上创建自己的 API。

无服务器推理 API 非常适合初步测试，但它们仅限于预配置的流行模型选择，并且受到速率限制，因为无服务器 API 的硬件同时被许多用户使用。通过专用推理端点，您可以自定义模型的部署，并且硬件专门为您服务。

在本教程中，我们将：

通过简单的用户界面创建推理端点，并向该端点发送标准 HTTP 请求
使用 huggingface_hub 库以编程方式创建和管理不同的推理端点
涵盖三种用例：使用 LLM 进行文本生成、使用 Stable Diffusion 进行图像生成，以及使用 Idefics2 对图像进行推理。

安装和登录

如果您没有 HF 账户，可以在此处创建账户。如果您在一个大型团队中工作，还可以创建HF 组织并通过该组织管理您的所有模型、数据集和端点。专用推理端点是一项付费服务，因此您需要在您的个人 HF 账户或 HF 组织的账单设置中添加信用卡。

然后您可以在此处创建用户访问令牌。具有read或write权限的令牌适用于本指南，但我们鼓励使用细粒度令牌以提高安全性。对于本笔记本，您需要一个具有用户权限 > 推理 > 调用推理端点并管理推理端点和仓库权限 > google/gemma-1.1-2b-it 和 HuggingFaceM4/idefics2-8b-chatty的细粒度令牌。

!pip install huggingface_hub~=0.23.3
!pip install transformers~=4.41.2

# Login to the HF Hub. We recommend using this login method
# to avoid the need for explicitly storing your HF token in variables
import huggingface_hub

huggingface_hub.interpreter_login()

创建您的第一个端点

完成初始设置后，我们现在可以创建我们的第一个端点。导航到 https://ui.endpoints.huggingface.co/ 并点击“专用端点”旁边的 + New。您将看到创建新端点的界面，其中包含以下选项（见下图）

模型仓库：您可以在此处插入 HF Hub 上任何模型的标识符。对于本次初始演示，我们使用 google/gemma-1.1-2b-it，这是一个小型生成式 LLM（2.5B 参数）。
端点名称：端点名称是根据模型标识符自动生成的，但您可以自由更改名称。有效的端点名称只能包含小写字母、数字或连字符（“-”），长度介于 4 到 32 个字符之间。
实例配置：您可以在此处选择所有主要云平台提供的各种 CPU 或 GPU。您还可以调整区域，例如，如果您需要在欧盟托管您的端点。
自动缩减到零：您可以配置您的端点在一定时间后缩减到零个 GPU/CPU。缩减到零的端点不再计费。请注意，重新启动端点需要将模型重新加载到内存中（并可能重新下载），这对于大型模型可能需要几分钟。
端点安全级别：标准安全级别是Protected，这需要授权的 HF 令牌才能访问端点。Public端点可以由任何人访问，无需令牌认证。Private端点只能通过区域内安全的 AWS 或 Azure PrivateLink 连接访问。
高级配置：您可以在此处选择一些高级选项，例如 Docker 容器类型。由于 Gemma 与文本生成推理 (TGI) 容器兼容，系统会自动选择 TGI 作为容器类型和其他良好的默认值。

对于本指南，选择下图中所示的选项，然后点击 Create Endpoint。

大约一分钟后，您的端点将创建完成，您将看到一个类似于下图的页面。

在端点的Overview页面上，您将找到查询端点的 URL、一个用于测试模型的 Playground，以及Analytics、Usage & Cost、Logs和Settings等其他选项卡。

以编程方式创建和管理端点

在投入生产时，您不总是希望手动启动、停止和修改您的端点。huggingface_hub 库提供了良好的功能，可用于以编程方式管理您的端点。请参阅此处的文档，以及此处所有功能的详细信息。以下是一些关键功能：

# list all your inference endpoints
huggingface_hub.list_inference_endpoints()

# get an existing endpoint and check it's status
endpoint = huggingface_hub.get_inference_endpoint(
    name="gemma-1-1-2b-it-yci",  # the name of the endpoint
    namespace="MoritzLaurer",  # your user name or organization name
)
print(endpoint)

# Pause endpoint to stop billing
endpoint.pause()

# Resume and wait until the endpoint is ready
# endpoint.resume()
# endpoint.wait()

# Update the endpoint to a different GPU
# You can find the correct arguments for different hardware types in this table: https://huggingface.co/docs/inference-endpoints/pricing#gpu-instances
# endpoint.update(
#    instance_size="x1",
#    instance_type="nvidia-a100",  # nvidia-a10g
# )

您还可以通过编程方式创建推理端点。让我们重新创建与通过 UI 创建的相同的 gemma LLM 端点。

from huggingface_hub import create_inference_endpoint


model_id = "google/gemma-1.1-2b-it"
endpoint_name = "gemma-1-1-2b-it-001"  # Valid Endpoint names must only contain lower-case characters, numbers or hyphens ("-") and are between 4 to 32 characters long.
namespace = "MoritzLaurer"  # your user or organization name


# check if endpoint with this name already exists from previous tests
available_endpoints_names = [endpoint.name for endpoint in huggingface_hub.list_inference_endpoints()]
if endpoint_name in available_endpoints_names:
    endpoint_exists = True
else:
    endpoint_exists = False
print("Does the endpoint already exist?", endpoint_exists)


# create new endpoint
if not endpoint_exists:
    endpoint = create_inference_endpoint(
        endpoint_name,
        repository=model_id,
        namespace=namespace,
        framework="pytorch",
        task="text-generation",
        # see the available hardware options here: https://huggingface.co/docs/inference-endpoints/pricing#pricing
        accelerator="gpu",
        vendor="aws",
        region="us-east-1",
        instance_size="x1",
        instance_type="nvidia-a10g",
        min_replica=0,
        max_replica=1,
        type="protected",
        # since the LLM is compatible with TGI, we specify that we want to use the latest TGI image
        custom_image={
            "health_route": "/health",
            "env": {"MODEL_ID": "/repository"},
            "url": "ghcr.io/huggingface/text-generation-inference:latest",
        },
    )
    print("Waiting for endpoint to be created")
    endpoint.wait()
    print("Endpoint ready")

# if endpoint with this name already exists, get and resume existing endpoint
else:
    endpoint = huggingface_hub.get_inference_endpoint(name=endpoint_name, namespace=namespace)
    if endpoint.status in ["paused", "scaledToZero"]:
        print("Resuming endpoint")
        endpoint.resume()
    print("Waiting for endpoint to start")
    endpoint.wait()
    print("Endpoint ready")

# access the endpoint url for API calls
print(endpoint.url)

查询您的端点

现在，让我们像查询其他任何 LLM API 一样查询此端点。首先，从界面复制端点 URL（或使用 endpoint.url），并将其分配给下面的 API_URL。然后，我们使用标准化的消息格式作为文本输入，即用户和助手消息的字典，您可能从其他 LLM API 服务中了解这种格式。然后，我们需要将聊天模板应用于消息，LLM（如 Gemma、Llama-3 等）都经过训练，期望这种格式（参见文档中的详细信息）。对于大多数最新的生成式 LLM，应用此聊天模板至关重要，否则模型性能会下降，而不会抛出错误。

>>> import requests
>>> from transformers import AutoTokenizer

>>> # paste your endpoint URL here or reuse endpoint.url if you created the endpoint programmatically
>>> API_URL = endpoint.url  # or paste link like "https://dz07884a53qjqb98.us-east-1.aws.endpoints.huggingface.cloud"
>>> HEADERS = {"Authorization": f"Bearer {huggingface_hub.get_token()}"}


>>> # function for standard http requests
>>> def query(payload=None, api_url=None):
...     response = requests.post(api_url, headers=HEADERS, json=payload)
...     return response.json()


>>> # define conversation input in messages format
>>> # you can also provide multiple turns between user and assistant
>>> messages = [
...     {"role": "user", "content": "Please write a short poem about open source for me."},
...     # {"role": "assistant", "content": "I am not in the mood."},
...     # {"role": "user", "content": "Can you please do this for me?"},
... ]

>>> # apply the chat template for the respective model
>>> model_id = "google/gemma-1.1-2b-it"
>>> tokenizer = AutoTokenizer.from_pretrained(model_id)
>>> messages_with_template = tokenizer.apply_chat_template(messages, tokenize=False)
>>> print("Your text input looks like this, after the chat template has been applied:\n")
>>> print(messages_with_template)

Your text input looks like this, after the chat template has been applied:

user
Please write a short poem about open source for me.

>>> # send standard http request to endpoint
>>> output = query(
...     payload={
...         "inputs": messages_with_template,
...         "parameters": {"temperature": 0.2, "max_new_tokens": 100, "seed": 42, "return_full_text": False},
...     },
...     api_url=API_URL,
... )

>>> print("The output from your API/Endpoint call:\n")
>>> print(output)

The output from your API/Endpoint call:

[&#123;'generated_text': "Free to use, free to share,\nA collaborative code, a community's care.\n\nCode transparent, bugs readily found,\nContributions welcome, stories unbound.\nOpen source, a gift to all,\nBuilding the future, one line at a call.\n\nSo join the movement, embrace the light,\nOpen source, shining ever so bright."}]

就是这样，您已经向您的端点（您自己的 API！）发出了第一个请求！

如果您希望端点自动处理聊天模板，并且您的 LLM 在 TGI 容器上运行，您也可以通过在 URL 后附加 /v1/chat/completions 路径来使用 messages API。通过 /v1/chat/completions 路径，在端点上运行的 TGI 容器会自动应用聊天模板，并且与 OpenAI 的 API 结构完全兼容，以实现更简单的互操作性。有关所有可用参数，请参阅 TGI Swagger UI。请注意，默认 / 路径和 /v1/chat/completions 路径接受的参数略有不同。以下是使用 messages API 的略微修改后的代码：

>>> API_URL_CHAT = API_URL + "/v1/chat/completions"

>>> output = query(
...     payload={
...         "messages": messages,
...         "model": "tgi",
...         "parameters": {"temperature": 0.2, "max_tokens": 100, "seed": 42},
...     },
...     api_url=API_URL_CHAT,
... )

>>> print("The output from your API/Endpoint call with the OpenAI-compatible messages API route:\n")
>>> print(output)

The output from your API/Endpoint call with the OpenAI-compatible messages API route:

&#123;'id': '', 'object': 'text_completion', 'created': 1718283608, 'model': '/repository', 'system_fingerprint': '2.0.5-dev0-sha-90184df', 'choices': [&#123;'index': 0, 'message': &#123;'role': 'assistant', 'content': '**Open Source**\n\nA license for the mind,\nTo share, distribute, and bind,\nIdeas freely given birth,\nFor the good of all to sort.\n\nCode transparent, eyes open wide,\nA permission for the wise,\nTo learn, to build, to use at will,\nA future bright, we help fill.\n\nFrom servers vast to candles low,\nOpen source, a guiding key,\nFor progress made, knowledge shared,\nA future brimming with'}, 'logprobs': None, 'finish_reason': 'length'}], 'usage': &#123;'prompt_tokens': 20, 'completion_tokens': 100, 'total_tokens': 120}}

使用 InferenceClient 简化端点使用

您也可以使用 InferenceClient 轻松向您的端点发送请求。该客户端是 huggingface_hub Python 库中提供的一个方便的实用程序，它允许您轻松调用专用推理端点和无服务器推理 API。有关详细信息，请参阅文档。

这是向您的端点发送请求的最简洁方式

from huggingface_hub import InferenceClient

client = InferenceClient()

output = client.chat_completion(
    messages,  # the chat template is applied automatically, if your endpoint uses a TGI container
    model=API_URL,
    temperature=0.2,
    max_tokens=100,
    seed=42,
)

print("The output from your API/Endpoint call with the InferenceClient:\n")
print(output)

# pause the endpoint to stop billing
# endpoint.pause()

为各种模型创建端点

按照相同的流程，您可以为 HF Hub 上的任何模型创建端点。让我们演示其他一些用例。

使用 Stable Diffusion 进行图像生成

我们可以使用与 LLM 几乎完全相同的代码来创建图像生成端点。唯一的区别在于，在这种情况下我们不使用 TGI 容器，因为 TGI 仅用于 LLM（和视觉 LM）。

>>> !pip install Pillow  # for image processing

Collecting Pillow
  Downloading pillow-10.3.0-cp39-cp39-manylinux_2_28_x86_64.whl (4.5 MB)
[K     |████████████████████████████████| 4.5 MB 24.7 MB/s eta 0:00:01
[?25hInstalling collected packages: Pillow
Successfully installed Pillow-10.3.0

>>> from huggingface_hub import create_inference_endpoint

>>> model_id = "stabilityai/stable-diffusion-xl-base-1.0"
>>> endpoint_name = "stable-diffusion-xl-base-1-0-001"  # Valid Endpoint names must only contain lower-case characters, numbers or hyphens ("-") and are between 4 to 32 characters long.
>>> namespace = "MoritzLaurer"  # your user or organization name
>>> task = "text-to-image"

>>> # check if endpoint with this name already exists from previous tests
>>> available_endpoints_names = [endpoint.name for endpoint in huggingface_hub.list_inference_endpoints()]
>>> if endpoint_name in available_endpoints_names:
...     endpoint_exists = True
>>> else:
...     endpoint_exists = False
>>> print("Does the endpoint already exist?", endpoint_exists)


>>> # create new endpoint
>>> if not endpoint_exists:
...     endpoint = create_inference_endpoint(
...         endpoint_name,
...         repository=model_id,
...         namespace=namespace,
...         framework="pytorch",
...         task=task,
...         # see the available hardware options here: https://huggingface.co/docs/inference-endpoints/pricing#pricing
...         accelerator="gpu",
...         vendor="aws",
...         region="us-east-1",
...         instance_size="x1",
...         instance_type="nvidia-a100",
...         min_replica=0,
...         max_replica=1,
...         type="protected",
...     )
...     print("Waiting for endpoint to be created")
...     endpoint.wait()
...     print("Endpoint ready")

>>> # if endpoint with this name already exists, get existing endpoint
>>> else:
...     endpoint = huggingface_hub.get_inference_endpoint(name=endpoint_name, namespace=namespace)
...     if endpoint.status in ["paused", "scaledToZero"]:
...         print("Resuming endpoint")
...         endpoint.resume()
...     print("Waiting for endpoint to start")
...     endpoint.wait()
...     print("Endpoint ready")

Does the endpoint already exist? True
Waiting for endpoint to start
Endpoint ready

>>> prompt = "A whimsical illustration of a fashionably dressed llama proudly holding a worn, vintage cookbook, with a warm cup of tea and a few freshly baked treats scattered around, set against a cozy background of rustic wood and blooming flowers."

>>> image = client.text_to_image(
...     prompt=prompt,
...     model=endpoint.url,  # "stabilityai/stable-diffusion-xl-base-1.0",
...     guidance_scale=8,
... )

>>> print("PROMPT: ", prompt)
>>> display(image.resize((image.width // 2, image.height // 2)))

PROMPT:  A whimsical illustration of a fashionably dressed llama proudly holding a worn, vintage cookbook, with a warm cup of tea and a few freshly baked treats scattered around, set against a cozy background of rustic wood and blooming flowers.

我们再次暂停端点以停止计费。

endpoint.pause()

视觉语言模型：对文本和图像进行推理

现在，让我们为视觉语言模型 (VLM) 创建一个端点。VLM 与 LLM 非常相似，只是它们可以同时接受文本和图像作为输入。它们的输出是自回归生成的文本，就像标准的 LLM 一样。VLM 可以处理从视觉问答到文档理解的许多任务。在本例中，我们使用 Idefics2，一个功能强大的 8B 参数 VLM。

我们首先需要将使用 Stable Diffusion 生成的 PIL 图像转换为 base64 编码字符串，以便通过网络将其发送到模型。

import base64
from io import BytesIO


def pil_image_to_base64(image):
    buffered = BytesIO()
    image.save(buffered, format="JPEG")
    img_str = base64.b64encode(buffered.getvalue()).decode("utf-8")
    return img_str


image_b64 = pil_image_to_base64(image)

由于 VLM 和 LLM 非常相似，我们可以再次使用几乎相同的消息格式和聊天模板，只是增加了一些代码，用于在提示中包含图像。有关提示格式的特定详细信息，请参阅 Idefics2 模型卡片。

from transformers import AutoProcessor

# load the processor
model_id_vlm = "HuggingFaceM4/idefics2-8b-chatty"
processor = AutoProcessor.from_pretrained(model_id_vlm)

# define the user messages
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image"
            },  # the image is placed here in the prompt. You can add multiple images throughout the conversation.
            {"type": "text", "text": "Write a short limerick about this image."},
        ],
    },
]

# apply the chat template to the messages
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)

# the chat template places a special "<image>" token at the position where the image should go
# here we replace the "<image>" token with the base64 encoded image string in the prompt
# to be able to send the image via an API request
image_input = f"data:image/jpeg;base64,{image_b64}"
image_input = f"![]({image_input})"
prompt = prompt.replace("<image>", image_input)

对于 VLM，一张图像表示一定数量的 token。例如，对于 Idefics2，一张图像在低分辨率下表示 64 个 token，在高分辨率下表示 5*64=320 个 token。高分辨率是 TGI 中的默认设置（有关详细信息，请参阅模型卡片中的 do_image_splitting）。这意味着一张图像消耗了 320 个 token。

TGI 也支持一些 VLM，如 Idefics2（请参阅支持模型列表），因此在创建端点时我们再次使用 TGI 容器。

>>> from huggingface_hub import create_inference_endpoint

>>> endpoint_name = "idefics2-8b-chatty-001"
>>> namespace = "MoritzLaurer"
>>> task = "text-generation"

>>> # check if endpoint with this name already exists from previous tests
>>> available_endpoints_names = [endpoint.name for endpoint in huggingface_hub.list_inference_endpoints()]
>>> if endpoint_name in available_endpoints_names:
...     endpoint_exists = True
>>> else:
...     endpoint_exists = False
>>> print("Does the endpoint already exist?", endpoint_exists)


>>> if endpoint_exists:
...     endpoint = huggingface_hub.get_inference_endpoint(name=endpoint_name, namespace=namespace)
...     if endpoint.status in ["paused", "scaledToZero"]:
...         print("Resuming endpoint")
...         endpoint.resume()
...     print("Waiting for endpoint to start")
...     endpoint.wait()
...     print("Endpoint ready")

>>> else:
...     endpoint = create_inference_endpoint(
...         endpoint_name,
...         repository=model_id_vlm,
...         namespace=namespace,
...         framework="pytorch",
...         task=task,
...         accelerator="gpu",
...         vendor="aws",
...         region="us-east-1",
...         type="protected",
...         instance_size="x1",
...         instance_type="nvidia-a100",
...         min_replica=0,
...         max_replica=1,
...         custom_image={
...             "health_route": "/health",
...             "env": {
...                 "MAX_BATCH_PREFILL_TOKENS": "2048",
...                 "MAX_INPUT_LENGTH": "1024",
...                 "MAX_TOTAL_TOKENS": "1536",
...                 "MODEL_ID": "/repository",
...             },
...             "url": "ghcr.io/huggingface/text-generation-inference:latest",
...         },
...     )

...     print("Waiting for endpoint to be created")
...     endpoint.wait()
...     print("Endpoint ready")

Does the endpoint already exist? False
Waiting for endpoint to be created
Endpoint ready

>>> output = client.text_generation(prompt, model=model_id_vlm, max_new_tokens=200, seed=42)

>>> print(output)

In a quaint little café, there lived a llama,
With glasses on his face, he was quite a charm.
He'd sit at the table,
With a book and a mable,
And sip from a cup of warm tea.

endpoint.pause()

额外信息

当您创建多个端点时，您可能会收到 GPU 配额已达到的错误消息。请不要犹豫，向错误消息中提供的电子邮件地址发送消息，我们很可能会增加您的 GPU 配额。
paused（暂停）和 scaled-to-zero（缩减到零）端点有什么区别？scaled-to-zero 端点可以根据用户请求灵活地唤醒和扩容，而 paused 端点需要由端点创建者手动取消暂停。此外，scaled-to-zero 端点会计入您的 GPU 配额（以其可以扩容到的最大副本数为准），而 paused 端点则不会。因此，释放 GPU 配额的一个简单方法是暂停一些端点。

结论与后续步骤

就这样，您已经为文本到文本、文本到图像和图像到文本生成创建了三个不同的端点（您自己的 API！），并且对于许多其他模型和任务，这同样可行。

我们鼓励您阅读专用推理端点文档以了解更多信息。如果您正在使用生成式 LLM 和 VLM，我们还建议您阅读 TGI 文档，因为最流行的 LLM/VLM 也受 TGI 支持，这会显着提高您的端点效率。

例如，您可以通过TGI Guidance使用开源模型进行JSON 模式或函数调用（另请参阅此教程，了解带结构化生成的 RAG 示例）。

当您将端点投入生产时，您将需要进行一些额外的改进，以提高您的设置效率。在使用 TGI 时，您应该通过异步函数调用向端点发送批处理请求，以充分利用端点的硬件，并且您可以调整几个容器参数，以优化您的用例的延迟和吞吐量。我们将在另一份教程中介绍这些优化。

< > 在 GitHub 上更新

开源 AI 食谱