无服务器推理 API

Hugging Face 提供无服务器推理 API，让用户可以通过简单的 API 调用免费快速测试和评估数千个公开可访问的（或您自己私有权限的）机器学习模型！

在此笔记本食谱中，我们将演示查询无服务器推理 API 的几种不同方法，同时探索各种任务，包括

使用开放式 LLM 生成文本
使用稳定扩散创建图像
使用 VLM 对图像进行推理
从文本生成语音

目标是帮助您从基础知识开始！

由于我们免费提供无服务器推理 API，普通 Hugging Face 用户有速率限制（每小时约几百个请求）。如需获得更高的速率限制，您每月只需 9 美元即可升级到 PRO 账户。但是，对于大批量、生产推理工作负载，请查看我们的专用推理端点解决方案。

开始吧

要开始使用无服务器推理 API，您需要一个 Hugging Face Hub 配置文件：如果您没有，可以注册；如果您有，可以在此处登录。

接下来，您需要创建用户访问令牌。具有读或写权限的令牌都可以。但是，我们强烈建议使用细粒度令牌。

对于此笔记本，您需要一个细粒度令牌，具有“推理 > 调用无服务器推理 API”用户权限，以及对“meta-llama/Meta-Llama-3-8B-Instruct”和“HuggingFaceM4/idefics2-8b-chatty”仓库的读取权限，因为我们必须下载它们的 tokenizer 才能运行此笔记本。

完成这些步骤后，我们可以安装所需的软件包并使用我们的用户访问令牌向 Hub 进行身份验证。

%pip install -U huggingface_hub transformers

import os
from huggingface_hub import interpreter_login, whoami, get_token

# running this will prompt you to enter your Hugging Face credentials
interpreter_login()

我们上面使用了 interpreter_login() 来以编程方式登录到 Hub。作为替代，我们还可以使用其他方法，例如来自 Hub Python 库的 notebook_login() 或来自 Hugging Face CLI 工具的 login 命令。

现在，让我们使用 whoami() 验证是否正确登录，它会打印出当前用户名和您的个人资料所属的组织。

whoami()

查询无服务器推理 API

无服务器推理 API 通过一个简单的 API 将模型暴露在 Hub 上

https://api-inference.huggingface.co/models/<MODEL_ID>

其中 <MODEL_ID> 对应于 Hub 上的模型仓库名称。

例如，codellama/CodeLlama-7b-hf 变成 https://api-inference.huggingface.co/models/codellama/CodeLlama-7b-hf

通过 HTTP 请求

我们可以使用 requests 库通过简单的 POST 请求轻松调用此 API。

>>> import requests

>>> API_URL = "https://api-inference.huggingface.co/models/codellama/CodeLlama-7b-hf"
>>> HEADERS = {"Authorization": f"Bearer {get_token()}"}


>>> def query(payload):
...     response = requests.post(API_URL, headers=HEADERS, json=payload)
...     return response.json()


>>> print(
...     query(
...         payload={
...             "inputs": "A HTTP POST request is used to ",
...             "parameters": {"temperature": 0.8, "max_new_tokens": 50, "seed": 42},
...         }
...     )
... )

[&#123;'generated_text': 'A HTTP POST request is used to send data to a web server.\n\n# Example\n```javascript\npost("localhost:3000", &#123;foo: "bar"})\n  .then(console.log => console.log(\'success\'))\n```\n\n'}]

不错！API 回应了我们输入提示的延续。但您可能会想……API 是如何知道如何处理有效载荷的？以及作为用户，我如何知道给定模型可以传递哪些参数？

在后台，推理 API 将动态加载请求的模型到共享计算基础设施以提供预测。当模型加载时，无服务器推理 API 将使用模型卡中指定的 pipeline_tag（参见此处）来确定适当的推理任务。您可以参考相应的任务或流水线文档来查找允许的参数。

如果请求的模型在请求时未加载到内存中（这取决于该模型的近期请求），无服务器推理 API 将最初返回 503 响应，然后才能成功响应预测。请稍等片刻，让模型有时间启动，然后重试。您还可以使用 InferenceClient().list_deployed_models() 随时检查哪些模型已加载并可用。

使用 huggingface_hub Python 库

要在 Python 中发送请求，您可以利用 InferenceClient，这是一个方便的实用工具，可在 huggingface_hub Python 库中找到，它允许您轻松调用无服务器推理 API。

>>> from huggingface_hub import InferenceClient

>>> client = InferenceClient()
>>> response = client.text_generation(
...     prompt="A HTTP POST request is used to ",
...     model="codellama/CodeLlama-7b-hf",
...     temperature=0.8,
...     max_new_tokens=50,
...     seed=42,
...     return_full_text=True,
... )
>>> print(response)

A HTTP POST request is used to send data to a web server.

# Example
```javascript
post("localhost:3000", &#123;foo: "bar"})
  .then(console.log => console.log('success'))
```

请注意，使用 InferenceClient，我们只指定模型 ID，并直接在 text_generation() 方法中传递参数。我们可以轻松检查函数签名，以了解更多关于如何使用任务及其允许参数的详细信息。

# uncomment the following line to see the function signature
# help(client.text_generation)

除了 Python，您还可以使用 JavaScript 将推理调用集成到您的 JS 或 Node 应用程序中。请查看 huggingface.js 以开始使用。

应用

现在我们了解了无服务器推理 API 的工作原理，让我们来试一试，并在此过程中学习一些技巧。

1. 使用开放式 LLM 生成文本

文本生成是一个非常常见的用例。然而，与开放式 LLM 交互有一些微妙之处，理解这些微妙之处对于避免悄无声息的性能下降至关重要。在文本生成方面，底层语言模型可能以几种不同的形式出现

基础模型： 指的是普通的、预训练的语言模型，例如 codellama/CodeLlama-7b-hf 或 meta-llama/Meta-Llama-3-8B。这些模型擅长从提供的提示继续生成（就像我们在上面的例子中看到的那样）。然而，它们并未针对会话使用（例如回答问题）进行微调。
指令微调模型： 以多任务方式进行训练，以遵循广泛的指令，例如“给我写一个巧克力蛋糕食谱”。诸如 meta-llama/Meta-Llama-3-8B-Instruct 或 mistralai/Mistral-7B-Instruct-v0.3 之类的模型就是以这种方式训练的。指令微调模型将比基础模型对指令产生更好的响应。通常，这些模型也针对多轮聊天对话进行了微调，使其非常适合会话用例。

理解这些细微的差异很重要，因为它们会影响我们查询特定模型的方式。指令模型使用特定于模型的聊天模板进行训练，因此您需要仔细了解模型期望的格式，并在查询中复制它。

例如，meta-llama/Meta-Llama-3-8B-Instruct 使用以下提示结构来区分系统、用户和助手对话轮次

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ user_msg_1 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{{ model_answer_1 }}<|eot_id|>

特殊令牌和提示格式因模型而异。为了确保我们使用正确的格式，我们可以依赖模型的聊天模板，通过其分词器，如下所示。

>>> from transformers import AutoTokenizer

>>> # define the system and user messages
>>> system_input = "You are an expert prompt engineer with artistic flair."
>>> user_input = "Write a concise prompt for a fun image containing a llama and a cookbook. Only return the prompt."
>>> messages = [
...     {"role": "system", "content": system_input},
...     {"role": "user", "content": user_input},
... ]

>>> # load the model and tokenizer
>>> model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
>>> tokenizer = AutoTokenizer.from_pretrained(model_id)

>>> # apply the chat template to the messages
>>> prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
>>> print(f"\nPROMPT:\n-----\n\n{prompt}")

PROMPT:
-----

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert prompt engineer with artistic flair.<|eot_id|><|start_header_id|>user<|end_header_id|>

Write a concise prompt for a fun image containing a llama and a cookbook. Only return the prompt.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

请注意，apply_chat_template() 方法是如何将熟悉的 messages 列表转换为模型期望的正确格式字符串的。我们可以使用这个格式化的字符串传递给无服务器推理 API 的 text_generation 方法。

>>> llm_response = client.text_generation(prompt, model=model_id, max_new_tokens=250, seed=42)
>>> print(llm_response)

"A whimsical illustration of a llama proudly holding a cookbook, with a sassy expression and a sprinkle of flour on its nose, surrounded by a colorful kitchen backdrop with utensils and ingredients scattered about, as if the llama is about to whip up a culinary masterpiece."

在不遵守模型提示模板的情况下查询 LLM 不会产生任何明显的错误！但是，它将导致输出质量低下。看看当我们传递相同的系统和用户输入，但没有根据聊天模板格式化它时会发生什么。

>>> out = client.text_generation(system_input + " " + user_input, model=model_id, max_new_tokens=250, seed=42)
>>> print(out)

Do not write the... 1 answer below »

You are an expert prompt engineer with artistic flair. Write a concise prompt for a fun image containing a llama and a cookbook. Only return the prompt. Do not write the image description.

A llama is sitting at a kitchen table, surrounded by cookbooks and utensils, with a cookbook open in front of it. The llama is wearing a chef's hat and holding a spatula. The cookbook is titled "Llama's Favorite Recipes" and has a llama on the cover. The llama is surrounded by a warm, golden light, and the kitchen is filled with the aroma of freshly baked bread. The llama is smiling and looking directly at the viewer, as if inviting them to join in the cooking fun. The image should be colorful, whimsical, and full of texture and detail. The llama should be the main focus of the image, and the cookbook should be prominently displayed. The background should be a warm, earthy color, such as terracotta or sienna. The overall mood of the image should be playful, inviting, and joyful. 1 answer below »

You are an expert prompt engineer with artistic flair. Write a concise prompt for a fun image containing a llama and a

天哪！LLM 虚构了一个毫无意义的引言，意外地重复了提示，并且未能保持简洁。为了简化提示过程并确保使用正确的聊天模板，InferenceClient 还提供了一个 chat_completion 方法，它抽象掉了 chat_template 的细节。这允许您简单地传递一个消息列表

>>> for token in client.chat_completion(messages, model=model_id, max_tokens=250, stream=True, seed=42):
...     print(token.choices[0].delta.content)

"A
 whims
ical
 illustration
 of
 a
 fashion
ably
 dressed
 llama
 proudly
 holding
 a
 worn
,
 vintage
 cookbook
,
 with
 a
 warm
 cup
 of
 tea
 and
 a
 few
 freshly
 baked
 treats
 scattered
 around
,
 set
 against
 a
 cozy
 background
 of
 rustic
 wood
 and
 blo
oming
 flowers
."

流式传输

在上面的示例中，我们还设置了 stream=True 以启用从端点流式传输文本。要了解更多此类功能以及查询 LLM 的最佳实践，我们建议阅读这些支持资源以获取更多信息

2. 使用稳定扩散创建图像

无服务器推理 API 可用于许多不同的任务。在这里，我们将使用它通过稳定扩散生成图像。

>>> image = client.text_to_image(
...     prompt=llm_response,
...     model="stabilityai/stable-diffusion-xl-base-1.0",
...     guidance_scale=8,
...     seed=42,
... )

>>> display(image.resize((image.width // 2, image.height // 2)))
>>> print("PROMPT: ", llm_response)

缓存

默认情况下，InferenceClient 将缓存 API 响应。这意味着如果您多次使用相同的负载查询 API，您将看到 API 返回的结果完全相同。请看

>>> image = client.text_to_image(
...     prompt=llm_response,
...     model="stabilityai/stable-diffusion-xl-base-1.0",
...     guidance_scale=8,
...     seed=42,
... )

>>> display(image.resize((image.width // 2, image.height // 2)))
>>> print("PROMPT: ", llm_response)

为了强制每次都返回不同的响应，我们可以使用 HTTP 头部让客户端忽略缓存并运行新的生成：x-use-cache: 0。

>>> # turn caching off
>>> client.headers["x-use-cache"] = "0"

>>> # generate a new image with the same prompt
>>> image = client.text_to_image(
...     prompt=llm_response,
...     model="stabilityai/stable-diffusion-xl-base-1.0",
...     guidance_scale=8,
...     seed=42,
... )

>>> display(image.resize((image.width // 2, image.height // 2)))
>>> print("PROMPT: ", llm_response)

3. 使用 Idefics2 对图像进行推理

视觉语言模型 (VLM) 可以同时接收文本和图像作为输入，并生成文本作为输出。这使它们能够处理从视觉问答到图像标注的许多任务。让我们使用无服务器推理 API 查询 Idefics2，一个强大的 8B 参数 VLM，并让它为我们新生成的图像写一首诗。

我们首先需要将 PIL 图像转换为 base64 编码字符串，以便通过网络将其发送到模型。

import base64
from io import BytesIO


def pil_image_to_base64(image):
    buffered = BytesIO()
    image.save(buffered, format="JPEG")
    img_str = base64.b64encode(buffered.getvalue()).decode("utf-8")
    return img_str


image_b64 = pil_image_to_base64(image)

然后，我们需要使用聊天模板正确格式化我们的文本 + 图像提示。有关提示格式的具体细节，请参阅 Idefics2 模型卡。

from transformers import AutoProcessor

# load the processor
vlm_model_id = "HuggingFaceM4/idefics2-8b-chatty"
processor = AutoProcessor.from_pretrained(vlm_model_id)

# define the user messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Write a short limerick about this image."},
        ],
    },
]

# apply the chat template to the messages
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)

# add the base64 encoded image to the prompt
image_input = f"data:image/jpeg;base64,{image_b64}"
image_input = f"![]({image_input})"
prompt = prompt.replace("<image>", image_input)

然后，最后调用无服务器 API 以获取预测。在我们的例子中，是关于我们生成图像的一首有趣的打油诗！

>>> limerick = client.text_generation(prompt, model=vlm_model_id, max_new_tokens=200, seed=42)
>>> print(limerick)

In the heart of a kitchen, so bright and so clean,
Lived a llama named Lulu, quite the culinary queen.
With a book in her hand, she'd read and she'd cook,
Her recipes were magic, her skills were so nook.
In her world, there was no room for defeat,
For Lulu, the kitchen was where she'd meet.

4. 从文本生成语音

最后，让我们使用一个基于 transformer 的文本到音频模型 Bark 为我们的诗歌生成一个可听的配音。

tts_model_id = "suno/bark"
speech_out = client.text_to_speech(text=limerick, model=tts_model_id)

>>> from IPython.display import Audio

>>> display(Audio(speech_out, rate=24000))
>>> print(limerick)

In the heart of a kitchen, so bright and so clean,
Lived a llama named Lulu, quite the culinary queen.
With a book in her hand, she'd read and she'd cook,
Her recipes were magic, her skills were so nook.
In her world, there was no room for defeat,
For Lulu, the kitchen was where she'd meet.

下一步

就是这样！在这个笔记本中，我们学习了如何使用无服务器推理 API 查询各种强大的 transformer 模型。我们只是触及了您能做的事情的皮毛，建议查看文档以了解更多可能性。

< > 在 GitHub 上更新

开源 AI 食谱