推理端点

推理端点提供了一个安全的生产解决方案，可以轻松地将任何 transformers、sentence-transformers 和 diffusers 模型部署在由 Hugging Face 管理的专用且自动扩展的基础设施上。推理端点由 Hub 中的模型构建。在本指南中，我们将学习如何使用 huggingface_hub 以编程方式管理推理端点。有关推理端点产品本身的更多信息，请查看其官方文档。

本指南假设 huggingface_hub 已正确安装并且您的机器已登录。如果尚未完成，请查看快速入门指南。支持推理端点 API 的最低版本为 v0.19.0。

新功能：现在可以通过简单的 API 调用从 HF 模型目录部署推理端点。该目录是精心策划的模型列表，可以使用优化的设置进行部署。您无需配置任何内容，我们将承担所有繁重的工作！所有模型和设置都保证经过测试，以提供最佳的成本/性能平衡。create_inference_endpoint_from_catalog() 的工作方式与 create_inference_endpoint() 相同，但需要传递的参数少得多。您可以使用 list_inference_catalog() 以编程方式检索目录。

请注意，这仍然是一个实验性功能。如果您使用它，请告诉我们您的想法！

创建推理端点

第一步是使用 create_inference_endpoint() 创建推理端点

>>> from huggingface_hub import create_inference_endpoint

>>> endpoint = create_inference_endpoint(
...     "my-endpoint-name",
...     repository="gpt2",
...     framework="pytorch",
...     task="text-generation",
...     accelerator="cpu",
...     vendor="aws",
...     region="us-east-1",
...     type="protected",
...     instance_size="x2",
...     instance_type="intel-icl"
... )

在本示例中，我们创建了一个名为“my-endpoint-name”的 protected 推理端点，用于为文本生成服务 gpt2。protected 推理端点意味着您需要令牌才能访问 API。我们还需要提供其他信息来配置硬件要求，例如供应商、区域、加速器、实例类型和大小。您可以在此处查看可用资源列表。或者，为了方便起见，您可以使用 Web 界面手动创建推理端点。有关高级设置及其用法的详细信息，请参阅本指南。

create_inference_endpoint() 返回的值是一个 InferenceEndpoint 对象

>>> endpoint
InferenceEndpoint(name='my-endpoint-name', namespace='Wauplin', repository='gpt2', status='pending', url=None)

它是一个数据类，其中包含有关端点的信息。您可以访问重要的属性，例如 name、repository、status、task、created_at、updated_at 等。如果需要，您还可以使用 endpoint.raw 访问来自服务器的原始响应。

创建推理端点后，您可以在您的个人仪表板上找到它。

使用自定义镜像

默认情况下，推理端点是从 Hugging Face 提供的 Docker 镜像构建的。但是，可以使用 custom_image 参数指定任何 Docker 镜像。一个常见的用例是使用 text-generation-inference 框架运行 LLM。可以这样做

# Start an Inference Endpoint running Zephyr-7b-beta on TGI
>>> from huggingface_hub import create_inference_endpoint
>>> endpoint = create_inference_endpoint(
...     "aws-zephyr-7b-beta-0486",
...     repository="HuggingFaceH4/zephyr-7b-beta",
...     framework="pytorch",
...     task="text-generation",
...     accelerator="gpu",
...     vendor="aws",
...     region="us-east-1",
...     type="protected",
...     instance_size="x1",
...     instance_type="nvidia-a10g",
...     custom_image={
...         "health_route": "/health",
...         "env": {
...             "MAX_BATCH_PREFILL_TOKENS": "2048",
...             "MAX_INPUT_LENGTH": "1024",
...             "MAX_TOTAL_TOKENS": "1512",
...             "MODEL_ID": "/repository"
...         },
...         "url": "ghcr.io/huggingface/text-generation-inference:1.1.0",
...     },
... )

作为 custom_image 传递的值是一个字典，其中包含 Docker 容器的 URL 和运行它的配置。有关其更多详细信息，请查看 Swagger 文档。

获取或列出现有的推理端点

在某些情况下，您可能需要管理以前创建的推理端点。如果您知道名称，可以使用 get_inference_endpoint() 获取它，该方法返回一个 InferenceEndpoint 对象。或者，您可以使用 list_inference_endpoints() 检索所有推理端点的列表。这两种方法都接受一个可选的 namespace 参数。您可以将 namespace 设置为您所属的任何组织。否则，它默认为您的用户名。

>>> from huggingface_hub import get_inference_endpoint, list_inference_endpoints

# Get one
>>> get_inference_endpoint("my-endpoint-name")
InferenceEndpoint(name='my-endpoint-name', namespace='Wauplin', repository='gpt2', status='pending', url=None)

# List all endpoints from an organization
>>> list_inference_endpoints(namespace="huggingface")
[InferenceEndpoint(name='aws-starchat-beta', namespace='huggingface', repository='HuggingFaceH4/starchat-beta', status='paused', url=None), ...]

# List all endpoints from all organizations the user belongs to
>>> list_inference_endpoints(namespace="*")
[InferenceEndpoint(name='aws-starchat-beta', namespace='huggingface', repository='HuggingFaceH4/starchat-beta', status='paused', url=None), ...]

检查部署状态

在本指南的其余部分中，我们将假设我们有一个名为 endpoint 的 InferenceEndpoint 对象。您可能已经注意到，该端点具有 InferenceEndpointStatus 类型的 status 属性。当推理端点已部署且可访问时，状态应为“running”，并且 url 属性已设置

>>> endpoint
InferenceEndpoint(name='my-endpoint-name', namespace='Wauplin', repository='gpt2', status='running', url='https://jpj7k2q4j805b727.us-east-1.aws.endpoints.huggingface.cloud')

在达到“running”状态之前，推理端点通常会经历“initializing”或“pending”阶段。您可以通过运行 fetch() 来获取端点的新状态。与 InferenceEndpoint 中向服务器发出请求的任何其他方法一样，endpoint 的内部属性会就地更改

>>> endpoint.fetch()
InferenceEndpoint(name='my-endpoint-name', namespace='Wauplin', repository='gpt2', status='pending', url=None)

您可以直接调用 wait()，而不是在等待推理端点运行时获取其状态。此助手接受一个 timeout 和一个 fetch_every 参数（以秒为单位）作为输入，并将阻塞线程，直到推理端点部署完成。默认值分别为 None（无超时）和 5 秒。

# Pending endpoint
>>> endpoint
InferenceEndpoint(name='my-endpoint-name', namespace='Wauplin', repository='gpt2', status='pending', url=None)

# Wait 10s => raises a InferenceEndpointTimeoutError
>>> endpoint.wait(timeout=10)
    raise InferenceEndpointTimeoutError("Timeout while waiting for Inference Endpoint to be deployed.")
huggingface_hub._inference_endpoints.InferenceEndpointTimeoutError: Timeout while waiting for Inference Endpoint to be deployed.

# Wait more
>>> endpoint.wait()
InferenceEndpoint(name='my-endpoint-name', namespace='Wauplin', repository='gpt2', status='running', url='https://jpj7k2q4j805b727.us-east-1.aws.endpoints.huggingface.cloud')

如果设置了 timeout 并且推理端点加载时间过长，则会引发 InferenceEndpointTimeoutError 超时错误。

运行推理

一旦您的推理端点启动并运行，您终于可以对其运行推理了！

InferenceEndpoint 具有两个属性 client 和 async_client，分别返回 InferenceClient 和 AsyncInferenceClient 对象。

# Run text_generation task:
>>> endpoint.client.text_generation("I am")
' not a fan of the idea of a "big-budget" movie. I think it\'s a'

# Or in an asyncio context:
>>> await endpoint.async_client.text_generation("I am")

如果推理端点未运行，则会引发 InferenceEndpointError 异常

>>> endpoint.client
huggingface_hub._inference_endpoints.InferenceEndpointError: Cannot create a client for this Inference Endpoint as it is not yet deployed. Please wait for the Inference Endpoint to be deployed using `endpoint.wait()` and try again.

有关如何使用 InferenceClient 的更多详细信息，请查看推理指南。

管理生命周期

现在我们已经了解了如何创建推理端点并在其上运行推理，让我们看看如何管理其生命周期。

在本节中，我们将看到 pause()、resume()、scale_to_zero()、update() 和 delete() 等方法。所有这些方法都是为了方便起见添加到 InferenceEndpoint 的别名。如果您愿意，您也可以使用 HfApi 中定义的通用方法：pause_inference_endpoint()、resume_inference_endpoint()、scale_to_zero_inference_endpoint()、update_inference_endpoint() 和 delete_inference_endpoint()。

暂停或缩放至零

为了在您的推理端点不使用时降低成本，您可以选择使用 pause() 暂停它，或使用 scale_to_zero() 将其缩放至零。

暂停或缩放至零的推理端点不收取任何费用。这两者之间的区别在于，暂停的端点需要使用 resume() 显式恢复。相反，如果对缩放至零的端点进行推理调用，它将自动启动，但会增加额外的冷启动延迟。推理端点也可以配置为在一段时间不活动后自动缩放至零。

# Pause and resume endpoint
>>> endpoint.pause()
InferenceEndpoint(name='my-endpoint-name', namespace='Wauplin', repository='gpt2', status='paused', url=None)
>>> endpoint.resume()
InferenceEndpoint(name='my-endpoint-name', namespace='Wauplin', repository='gpt2', status='pending', url=None)
>>> endpoint.wait().client.text_generation(...)
...

# Scale to zero
>>> endpoint.scale_to_zero()
InferenceEndpoint(name='my-endpoint-name', namespace='Wauplin', repository='gpt2', status='scaledToZero', url='https://jpj7k2q4j805b727.us-east-1.aws.endpoints.huggingface.cloud')
# Endpoint is not 'running' but still has a URL and will restart on first call.

更新模型或硬件要求

在某些情况下，您可能还想更新您的推理端点，而无需创建新的端点。您可以更新托管模型或运行模型的硬件要求。您可以使用 update() 执行此操作

# Change target model
>>> endpoint.update(repository="gpt2-large")
InferenceEndpoint(name='my-endpoint-name', namespace='Wauplin', repository='gpt2-large', status='pending', url=None)

# Update number of replicas
>>> endpoint.update(min_replica=2, max_replica=6)
InferenceEndpoint(name='my-endpoint-name', namespace='Wauplin', repository='gpt2-large', status='pending', url=None)

# Update to larger instance
>>> endpoint.update(accelerator="cpu", instance_size="x4", instance_type="intel-icl")
InferenceEndpoint(name='my-endpoint-name', namespace='Wauplin', repository='gpt2-large', status='pending', url=None)

删除端点

最后，如果您不再使用推理端点，您可以简单地调用 ~InferenceEndpoint.delete()。

这是一个不可逆的操作，它将完全删除端点，包括其配置、日志和使用情况指标。您无法恢复已删除的推理端点。

一个端到端示例

推理端点的典型用例是一次处理一批作业，以限制基础设施成本。您可以使用我们在本指南中看到的内容来自动化此过程

>>> import asyncio
>>> from huggingface_hub import create_inference_endpoint

# Start endpoint + wait until initialized
>>> endpoint = create_inference_endpoint(name="batch-endpoint",...).wait()

# Run inference
>>> client = endpoint.client
>>> results = [client.text_generation(...) for job in jobs]

# Or with asyncio
>>> async_client = endpoint.async_client
>>> results = asyncio.gather(*[async_client.text_generation(...) for job in jobs])

# Pause endpoint
>>> endpoint.pause()

或者，如果您的推理端点已存在且已暂停

>>> import asyncio
>>> from huggingface_hub import get_inference_endpoint

# Get endpoint + wait until initialized
>>> endpoint = get_inference_endpoint("batch-endpoint").resume().wait()

# Run inference
>>> async_client = endpoint.async_client
>>> results = asyncio.gather(*[async_client.text_generation(...) for job in jobs])

# Pause endpoint
>>> endpoint.pause()

< > 在 GitHub 上更新

Hub Python 库