开源 AI 食谱文档

使用 TGI 的 Messages API 从 OpenAI 迁移到开放式 LLM

Hugging Face's logo
加入 Hugging Face 社区

并获得增强的文档体验

开始使用

Open In Colab

使用 TGI 的 Messages API 从 OpenAI 迁移到开放式 LLM

作者:Andrew Reed

本笔记本演示了如何轻松地从 OpenAI 模型过渡到开放式 LLM,而无需重构任何现有代码。

文本生成推理 (TGI) 现在提供了一个 Messages API,使其与 OpenAI 聊天补全 API 直接兼容。这意味着任何使用 OpenAI 模型(通过 OpenAI 客户端库或 LangChain、LlamaIndex 等第三方工具)的现有脚本都可以直接替换为使用在 TGI 端点上运行的任何开放式 LLM!

这使您能够快速测试并受益于开放模型提供的众多优势。例如:

  • 对模型和数据的完全控制和透明
  • 不再担心速率限制
  • 能够根据您的特定需求完全定制系统

在本笔记本中,我们将向您展示如何:

  1. 创建推理端点以使用 TGI 部署模型
  2. 使用 OpenAI 客户端库查询推理端点
  3. 将端点与 LangChain 和 LlamaIndex 工作流集成

让我们深入了解!

设置

首先我们需要安装依赖项并设置 HF API 密钥。

!pip install --upgrade -q huggingface_hub langchain langchain-community langchainhub langchain-openai llama-index chromadb bs4 sentence_transformers torch torchvision torchaudio llama-index-llms-openai-like llama-index-embeddings-huggingface
import os
import getpass

# enter API key
os.environ["HF_TOKEN"] = HF_API_KEY = getpass.getpass()

1. 创建一个推理端点

首先,让我们将 Nous-Hermes-2-Mixtral-8x7B-DPO(一个经过微调的 Mixtral 模型)部署到使用 TGI 的推理端点。

我们可以通过UI 界面上的几次点击来部署模型,或者利用 huggingface_hub Python 库以编程方式创建和管理推理端点。

这里我们将使用 Hub 库,通过指定端点名称和模型仓库,以及 text-generation 任务。在本例中,我们使用 protected 类型,以便访问已部署的模型需要有效的 Hugging Face 令牌。我们还需要配置硬件要求,如供应商、区域、加速器、实例类型和大小。您可以通过此 API 调用查看可用的资源选项列表,并在此处的目录中查看选定模型的推荐配置。

注意:您可能需要发送电子邮件至 api-enterprise@huggingface.co 申请配额升级

>>> from huggingface_hub import create_inference_endpoint

>>> endpoint = create_inference_endpoint(
...     "nous-hermes-2-mixtral-8x7b-demo",
...     repository="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
...     framework="pytorch",
...     task="text-generation",
...     accelerator="gpu",
...     vendor="aws",
...     region="us-east-1",
...     type="protected",
...     instance_type="p4de",
...     instance_size="2xlarge",
...     custom_image={
...         "health_route": "/health",
...         "env": {
...             "MAX_INPUT_LENGTH": "4096",
...             "MAX_BATCH_PREFILL_TOKENS": "4096",
...             "MAX_TOTAL_TOKENS": "32000",
...             "MAX_BATCH_TOTAL_TOKENS": "1024000",
...             "MODEL_ID": "/repository",
...         },
...         "url": "ghcr.io/huggingface/text-generation-inference:sha-1734540",  # must be >= 1.4.0
...     },
... )

>>> endpoint.wait()
>>> print(endpoint.status)
running

我们的部署需要几分钟才能启动。我们可以使用 .wait() 工具来阻塞运行中的线程,直到端点达到最终的“运行”状态。一旦运行,我们可以确认其状态并通过 UI Playground 进行试用。

IE UI Overview

太棒了,我们现在有了一个可用的端点!

注意:当使用 huggingface_hub 部署时,您的端点默认会在闲置 15 分钟后缩减至零,以在非活动期间优化成本。请查看 Hub Python 库文档,了解所有可用于管理端点生命周期的功能。

2. 使用 OpenAI 客户端库查询推理端点

如上所述,由于我们的模型是托管在 TGI 上的,它现在支持 Messages API,这意味着我们可以直接使用熟悉的 OpenAI 客户端库来查询它。

使用 Python 客户端

下面的例子展示了如何使用 OpenAI Python 库 进行这种转换。只需将 <ENDPOINT_URL> 替换为您的端点 URL(请确保包含后缀 v1/),并用有效的 Hugging Face 用户令牌填充 <HF_API_KEY> 字段。<ENDPOINT_URL> 可以从推理端点 UI 中获取,或者从我们上面用 endpoint.url 创建的端点对象中获取。

然后我们可以像往常一样使用客户端,传递一个消息列表来从我们的推理端点流式传输响应。

>>> from openai import OpenAI

>>> BASE_URL = endpoint.url

>>> # init the client but point it to TGI
>>> client = OpenAI(
...     base_url=os.path.join(BASE_URL, "v1/"),
...     api_key=HF_API_KEY,
... )
>>> chat_completion = client.chat.completions.create(
...     model="tgi",
...     messages=[
...         {"role": "system", "content": "You are a helpful assistant."},
...         {"role": "user", "content": "Why is open-source software important?"},
...     ],
...     stream=True,
...     max_tokens=500,
... )

>>> # iterate and print stream
>>> for message in chat_completion:
...     print(message.choices[0].delta.content, end="")
Open-source software is important due to a number of reasons, including:

1. Collaboration: The collaborative nature of open-source software allows developers from around the world to work together, share their ideas and improve the code. This often results in faster progress and better software.

2. Transparency: With open-source software, the code is publicly available, making it easy to see exactly how the software functions, and allowing users to determine if there are any security vulnerabilities.

3. Customization: Being able to access the code also allows users to customize the software to better suit their needs. This makes open-source software incredibly versatile, as users can tweak it to suit their specific use case.

4. Quality: Open-source software is often developed by large communities of dedicated developers, who work together to improve the software. This results in a higher level of quality than might be found in proprietary software.

5. Cost: Open-source software is often provided free of charge, which makes it accessible to a wider range of users. This can be especially important for organizations with limited budgets for software.

6. Shared Benefit: By sharing the code of open-source software, everyone can benefit from the hard work of the developers. This contributes to the overall advancement of technology, as users and developers work together to improve and build upon the software.

In summary, open-source software provides a collaborative platform that leads to high-quality, customizable, and transparent software, all available at little or no cost, benefiting both individuals and the technology community as a whole.<|im_end|>

在幕后,TGI 的 Messages API 会使用其聊天模板,自动将消息列表转换为模型所需的指令格式。

注意:某些 OpenAI 功能,如函数调用,与 TGI 不兼容。目前,Messages API 支持以下聊天补全参数:streammax_new_tokensfrequency_penaltylogprobsseedtemperaturetop_p

使用 JavaScript 客户端

这是上面相同的流式传输示例,但使用的是 OpenAI Javascript/Typescript 库

import OpenAI from "openai";

const openai = new OpenAI({
  baseURL: "<ENDPOINT_URL>" + "/v1/", // replace with your endpoint url
  apiKey: "<HF_API_TOKEN>", // replace with your token
});

async function main() {
  const stream = await openai.chat.completions.create({
    model: "tgi",
    messages: [
      { role: "system", content: "You are a helpful assistant." },
      { role: "user", content: "Why is open-source software important?" },
    ],
    stream: true,
    max_tokens: 500,
  });
  for await (const chunk of stream) {
    process.stdout.write(chunk.choices[0]?.delta?.content || "");
  }
}

main();

3. 与 LangChain 和 LlamaIndex 集成

现在,让我们看看如何将这个新创建的端点与流行的 RAG 框架(如 LangChain 和 LlamaIndex)一起使用。

如何与 LangChain 一起使用

要在 LangChain 中使用它,只需创建一个 ChatOpenAI 的实例,并按如下方式传递您的 <ENDPOINT_URL><HF_API_TOKEN>

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model_name="tgi",
    openai_api_key=HF_API_KEY,
    openai_api_base=os.path.join(BASE_URL, "v1/"),
)
llm.invoke("Why is open-source software important?")

我们能够直接利用与使用 OpenAI 模型时相同的 ChatOpenAI 类。这使得所有以前的代码只需更改一行代码即可与我们的端点配合使用。

现在,让我们在一个简单的 RAG 管道中使用我们的 Mixtral 模型,来回答一个关于 HF 博客文章内容的问题。

from langchain import hub
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.runnables import RunnableParallel
from langchain_community.embeddings import HuggingFaceEmbeddings

# Load, chunk and index the contents of the blog
loader = WebBaseLoader(
    web_paths=("https://huggingface.co/blog/open-source-llms-as-agents",),
)
docs = loader.load()

# declare an HF embedding model
hf_embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-large-en-v1.5")

text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(documents=splits, embedding=hf_embeddings)

# Retrieve and generate using the relevant snippets of the blog
retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain_from_docs = (
    RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"]))) | prompt | llm | StrOutputParser()
)

rag_chain_with_source = RunnableParallel({"context": retriever, "question": RunnablePassthrough()}).assign(
    answer=rag_chain_from_docs
)

rag_chain_with_source.invoke("According to this article which open-source model is the best for an agent behaviour?")

如何与 LlamaIndex 一起使用

同样,您也可以在 LlamaIndex 中使用 TGI 端点。我们将使用 OpenAILike 类,并通过配置一些额外的参数(即 is_local, is_function_calling_model, is_chat_model, context_window)来实例化它。

注意:context_window 参数应与您端点先前设置的 MAX_TOTAL_TOKENS 值匹配。

from llama_index.llms.openai_like import OpenAILike

llm = OpenAILike(
    model="tgi",
    api_key=HF_API_KEY,
    api_base=BASE_URL + "/v1/",
    is_chat_model=True,
    is_local=False,
    is_function_calling_model=False,
    context_window=4096,
)

llm.complete("Why is open-source software important?")

我们现在可以在一个类似的 RAG 管道中使用它。请记住,您在推理端点中先前选择的 MAX_INPUT_LENGTH 将直接影响模型可以处理的检索到的块(similarity_top_k)的数量。

from llama_index.core import VectorStoreIndex, download_loader
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.query_engine import CitationQueryEngine

SimpleWebPageReader = download_loader("SimpleWebPageReader")

documents = SimpleWebPageReader(html_to_text=True).load_data(
    ["https://huggingface.co/blog/open-source-llms-as-agents"]
)

# Load embedding model
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-large-en-v1.5")

# Pass LLM to pipeline
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model, show_progress=True)

# Query the index
query_engine = CitationQueryEngine.from_args(
    index,
    similarity_top_k=2,
)
response = query_engine.query("According to this article which open-source model is the best for an agent behaviour?")

response.response

总结

当您使用完端点后,可以暂停或删除它。此步骤可以通过 UI 完成,或者像下面这样以编程方式完成。

# pause our running endpoint
endpoint.pause()

# optionally delete
# endpoint.delete()
< > 在 GitHub 上更新