使用文本生成推断

在您的应用程序中，有许多方法可以使用文本生成推断 (TGI) 服务器。启动服务器后，您可以使用 Messages API /v1/chat/completions 路由，并发出 POST 请求以从服务器获取结果。如果您希望 TGI 返回令牌流，您也可以将 "stream": true 传递给调用。

有关 API 的更多信息，请查阅此处提供的 text-generation-inference 的 OpenAPI 文档。

您可以使用任何您喜欢的工具发出请求，例如 curl、Python 或 TypeScript。为了获得端到端的体验，我们开源了 ChatUI，这是一个用于开放访问模型的聊天界面。

curl

成功启动服务器后，您可以使用 v1/chat/completions 路由查询模型，以获得符合 OpenAI Chat Completion 规范的响应

curl localhost:8080/v1/chat/completions \
    -X POST \
    -d '{
  "model": "tgi",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is deep learning?"
    }
  ],
  "stream": true,
  "max_tokens": 20
}' \
    -H 'Content-Type: application/json'

对于非聊天用例，您还可以使用 /generate 和 /generate_stream 路由。

curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{
  "inputs":"What is Deep Learning?",
  "parameters":{
    "max_new_tokens":20
  }
}' \
    -H 'Content-Type: application/json'

Python

推断客户端

huggingface_hub 是一个 Python 库，用于与 Hugging Face Hub（包括其端点）进行交互。它提供了一个高级类 huggingface_hub.InferenceClient，可以轻松调用 TGI 的 Messages API。 InferenceClient 还负责参数验证，并提供了一个易于使用的界面。

通过 pip 安装 huggingface_hub 包。

pip install huggingface_hub

现在您可以像在 Python 中使用 OpenAI 客户端一样使用 InferenceClient

from huggingface_hub import InferenceClient

client = InferenceClient(
    base_url="https://:8080/v1/",
)

output = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Count to 10"},
    ],
    stream=True,
    max_tokens=1024,
)

for chunk in output:
    print(chunk.choices[0].delta.content)

您可以在此处查看有关 OpenAI 兼容性的更多详细信息。

还有一个基于 asyncio 和 aiohttp 的客户端异步版本 AsyncInferenceClient。您可以在此处找到其文档

OpenAI 客户端

您可以直接使用 OpenAI Python 或 JS 客户端与 TGI 交互。

通过 pip 安装 OpenAI Python 包。

pip install openai

from openai import OpenAI

# init the client but point it to TGI
client = OpenAI(
    base_url="https://:8080/v1/",
    api_key="-"
)

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant." },
        {"role": "user", "content": "What is deep learning?"}
    ],
    stream=True
)

# iterate and print stream
for message in chat_completion:
    print(message)

UI

Gradio

Gradio 是一个 Python 库，可帮助您用几行代码为您的机器学习模型构建 Web 应用程序。它有一个 ChatInterface 包装器，可帮助为聊天机器人创建简洁的 UI。让我们看看如何使用 TGI 和 Gradio 创建具有流式传输模式的聊天机器人。让我们首先安装 Gradio 和 Hub Python 库。

pip install huggingface-hub gradio

假设您在端口 8080 上为您的模型提供服务，我们将通过 InferenceClient 查询。

import gradio as gr
from huggingface_hub import InferenceClient

client = InferenceClient(base_url="http://127.0.0.1:8080")

def inference(message, history):
    partial_message = ""
    output = client.chat.completions.create(
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": message},
        ],
        stream=True,
        max_tokens=1024,
    )

    for chunk in output:
        partial_message += chunk.choices[0].delta.content
        yield partial_message

gr.ChatInterface(
    inference,
    type="messages",
    description="This is the demo for Gradio UI consuming TGI endpoint.",
    title="Gradio 🤝 TGI",
    examples=["Are tomatoes vegetables?"],
).queue().launch()

您可以查看 UI 并在此处直接尝试演示 👇

您可以在此处阅读有关如何自定义 ChatInterface 的更多信息。

ChatUI

ChatUI 是一个为使用 LLM 而构建的开源界面。它提供了许多自定义选项，例如使用 SERP API 进行网络搜索等。 ChatUI 可以自动使用 TGI 服务器，甚至提供在不同 TGI 端点之间切换的选项。您可以在 Hugging Chat 试用，或使用 ChatUI Docker Space 将您自己的 Hugging Chat 部署到 Spaces。

要在同一环境中同时服务 ChatUI 和 TGI，只需将您自己的端点添加到 chat-ui 存储库中 .env.local 文件中的 MODELS 变量。提供指向 TGI 服务位置的端点。

{
// rest of the model config here
"endpoints": [{"url": "https://HOST:PORT/generate_stream"}]
}

ChatUI

< > 在 GitHub 上更新