开源 AI 食谱文档

使用 TGI 的 Messages API 从 OpenAI 迁移到开放 LLM

Hugging Face's logo
加入 Hugging Face 社区

并获得增强的文档体验

开始使用

Open In Colab

使用 TGI 的 Messages API 从 OpenAI 迁移到开放 LLM

作者:Andrew Reed

本笔记本演示了如何轻松地从 OpenAI 模型过渡到开放 LLM,而无需重构任何现有代码。

Text Generation Inference (TGI) 现在提供 Messages API,使其直接兼容 OpenAI Chat Completion API。这意味着任何使用 OpenAI 模型(通过 OpenAI 客户端库或 LangChain 或 LlamaIndex 等第三方工具)的现有脚本都可以直接替换为使用在 TGI 端点上运行的任何开放 LLM!

这使您可以快速测试并受益于开放模型提供的众多优势。例如:

  • 完全控制模型和数据的透明度
  • 不再需要担心速率限制
  • 能够根据您的特定需求完全自定义系统

在本笔记本中,我们将向您展示如何:

  1. 创建推理端点以使用 TGI 部署模型
  2. 使用 OpenAI 客户端库查询推理端点
  3. 将端点与 LangChain 和 LlamaIndex 工作流程集成

让我们开始吧!

设置

首先,我们需要安装依赖项并设置 HF API 密钥。

!pip install --upgrade -q huggingface_hub langchain langchain-community langchainhub langchain-openai llama-index chromadb bs4 sentence_transformers torch torchvision torchaudio llama-index-llms-openai-like llama-index-embeddings-huggingface
import os
import getpass

# enter API key
os.environ["HF_TOKEN"] = HF_API_KEY = getpass.getpass()

1. 创建推理端点

首先,让我们使用 TGI 将 Nous-Hermes-2-Mixtral-8x7B-DPO(一个微调的 Mixtral 模型)部署到推理端点。

我们可以通过 UI 在几次点击中部署模型,或者利用 huggingface_hub Python 库以编程方式创建和管理推理端点。

我们将在此处使用 Hub 库,指定端点名称和模型仓库,以及文本生成的任务。在本示例中,我们使用受保护类型,因此访问已部署的模型将需要有效的 Hugging Face 令牌。我们还需要配置硬件要求,例如供应商、区域、加速器、实例类型和大小。您可以使用此 API 调用查看可用资源选项列表,并在此处 查看目录中精选模型的推荐配置

注意:您可能需要通过发送电子邮件至 api-enterprise@huggingface.co 来请求配额升级

>>> from huggingface_hub import create_inference_endpoint

>>> endpoint = create_inference_endpoint(
...     "nous-hermes-2-mixtral-8x7b-demo",
...     repository="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
...     framework="pytorch",
...     task="text-generation",
...     accelerator="gpu",
...     vendor="aws",
...     region="us-east-1",
...     type="protected",
...     instance_type="p4de",
...     instance_size="2xlarge",
...     custom_image={
...         "health_route": "/health",
...         "env": {
...             "MAX_INPUT_LENGTH": "4096",
...             "MAX_BATCH_PREFILL_TOKENS": "4096",
...             "MAX_TOTAL_TOKENS": "32000",
...             "MAX_BATCH_TOTAL_TOKENS": "1024000",
...             "MODEL_ID": "/repository",
...         },
...         "url": "ghcr.io/huggingface/text-generation-inference:sha-1734540",  # must be >= 1.4.0
...     },
... )

>>> endpoint.wait()
>>> print(endpoint.status)
running

我们的部署启动需要几分钟时间。我们可以使用 .wait() 实用程序阻止正在运行的线程,直到端点达到最终的“运行”状态。运行后,我们可以确认其状态并通过 UI Playground 试用

IE UI Overview

太棒了,我们现在有了一个可用的端点!

注意:使用 huggingface_hub 部署时,您的端点在默认情况下会在空闲 15 分钟后缩减为零,以在不活动期间优化成本。查看 Hub Python 库文档以了解管理端点生命周期可用的所有功能。

2. 使用 OpenAI 客户端库查询推理端点

如上所述,由于我们的模型托管在 TGI 上,它现在支持 Messages API,这意味着我们可以使用熟悉的 OpenAI 客户端库直接查询它。

使用 Python 客户端

下面的示例展示了如何使用 OpenAI Python 库进行此转换。只需将 <ENDPOINT_URL> 替换为您的端点 URL(请务必包含 v1/ 后缀),并使用有效的 Hugging Face 用户令牌填充 <HF_API_KEY> 字段。<ENDPOINT_URL> 可以从推理端点 UI 或我们上面使用 endpoint.url 创建的端点对象中收集。

然后我们可以像往常一样使用客户端,传递消息列表以从我们的推理端点流式传输响应。

>>> from openai import OpenAI

>>> BASE_URL = endpoint.url

>>> # init the client but point it to TGI
>>> client = OpenAI(
...     base_url=os.path.join(BASE_URL, "v1/"),
...     api_key=HF_API_KEY,
... )
>>> chat_completion = client.chat.completions.create(
...     model="tgi",
...     messages=[
...         {"role": "system", "content": "You are a helpful assistant."},
...         {"role": "user", "content": "Why is open-source software important?"},
...     ],
...     stream=True,
...     max_tokens=500,
... )

>>> # iterate and print stream
>>> for message in chat_completion:
...     print(message.choices[0].delta.content, end="")
Open-source software is important due to a number of reasons, including:

1. Collaboration: The collaborative nature of open-source software allows developers from around the world to work together, share their ideas and improve the code. This often results in faster progress and better software.

2. Transparency: With open-source software, the code is publicly available, making it easy to see exactly how the software functions, and allowing users to determine if there are any security vulnerabilities.

3. Customization: Being able to access the code also allows users to customize the software to better suit their needs. This makes open-source software incredibly versatile, as users can tweak it to suit their specific use case.

4. Quality: Open-source software is often developed by large communities of dedicated developers, who work together to improve the software. This results in a higher level of quality than might be found in proprietary software.

5. Cost: Open-source software is often provided free of charge, which makes it accessible to a wider range of users. This can be especially important for organizations with limited budgets for software.

6. Shared Benefit: By sharing the code of open-source software, everyone can benefit from the hard work of the developers. This contributes to the overall advancement of technology, as users and developers work together to improve and build upon the software.

In summary, open-source software provides a collaborative platform that leads to high-quality, customizable, and transparent software, all available at little or no cost, benefiting both individuals and the technology community as a whole.<|im_end|>

在幕后,TGI 的 Messages API 使用其 聊天模板自动将消息列表转换为模型所需的指令格式。

注意:某些 OpenAI 功能(如函数调用)与 TGI 不兼容。目前,Messages API 支持以下聊天完成参数:streammax_new_tokensfrequency_penaltylogprobsseedtemperaturetop_p

使用 JavaScript 客户端

这是上面相同的流式传输示例,但使用了 OpenAI Javascript/Typescript 库

import OpenAI from "openai";

const openai = new OpenAI({
  baseURL: "<ENDPOINT_URL>" + "/v1/", // replace with your endpoint url
  apiKey: "<HF_API_TOKEN>", // replace with your token
});

async function main() {
  const stream = await openai.chat.completions.create({
    model: "tgi",
    messages: [
      { role: "system", content: "You are a helpful assistant." },
      { role: "user", content: "Why is open-source software important?" },
    ],
    stream: true,
    max_tokens: 500,
  });
  for await (const chunk of stream) {
    process.stdout.write(chunk.choices[0]?.delta?.content || "");
  }
}

main();

3. 与 LangChain 和 LlamaIndex 集成

现在,让我们看看如何在 LangChain 和 LlamaIndex 等流行的 RAG 框架中使用这个新创建的端点。

如何与 LangChain 一起使用

要在 LangChain 中使用它,只需创建 ChatOpenAI 的实例并按如下方式传递您的 <ENDPOINT_URL><HF_API_TOKEN>

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model_name="tgi",
    openai_api_key=HF_API_KEY,
    openai_api_base=os.path.join(BASE_URL, "v1/"),
)
llm.invoke("Why is open-source software important?")

我们能够直接利用与 OpenAI 模型一起使用的相同的 ChatOpenAI 类。这允许所有以前的代码通过仅更改一行代码来与我们的端点一起工作。

现在,让我们在简单的 RAG 管道中使用我们的 Mixtral 模型来回答有关 HF 博客文章内容的问题。

from langchain import hub
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.runnables import RunnableParallel
from langchain_community.embeddings import HuggingFaceEmbeddings

# Load, chunk and index the contents of the blog
loader = WebBaseLoader(
    web_paths=("https://huggingface.co/blog/open-source-llms-as-agents",),
)
docs = loader.load()

# declare an HF embedding model
hf_embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-large-en-v1.5")

text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(documents=splits, embedding=hf_embeddings)

# Retrieve and generate using the relevant snippets of the blog
retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain_from_docs = (
    RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"]))) | prompt | llm | StrOutputParser()
)

rag_chain_with_source = RunnableParallel({"context": retriever, "question": RunnablePassthrough()}).assign(
    answer=rag_chain_from_docs
)

rag_chain_with_source.invoke("According to this article which open-source model is the best for an agent behaviour?")

如何与 LlamaIndex 一起使用

同样,您也可以在 LlamaIndex 中使用 TGI 端点。我们将使用 OpenAILike 类,并通过配置一些附加参数(即 is_localis_function_calling_modelis_chat_modelcontext_window)来实例化它。

注意:上下文窗口参数应与先前为端点的 MAX_TOTAL_TOKENS 设置的值匹配。

from llama_index.llms.openai_like import OpenAILike

llm = OpenAILike(
    model="tgi",
    api_key=HF_API_KEY,
    api_base=BASE_URL + "/v1/",
    is_chat_model=True,
    is_local=False,
    is_function_calling_model=False,
    context_window=4096,
)

llm.complete("Why is open-source software important?")

我们现在可以在类似的 RAG 管道中使用它。请记住,您之前在推理端点中选择的 MAX_INPUT_LENGTH 将直接影响模型可以处理的检索块 (similarity_top_k) 的数量。

from llama_index.core import VectorStoreIndex, download_loader
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.query_engine import CitationQueryEngine

SimpleWebPageReader = download_loader("SimpleWebPageReader")

documents = SimpleWebPageReader(html_to_text=True).load_data(
    ["https://huggingface.co/blog/open-source-llms-as-agents"]
)

# Load embedding model
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-large-en-v1.5")

# Pass LLM to pipeline
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model, show_progress=True)

# Query the index
query_engine = CitationQueryEngine.from_args(
    index,
    similarity_top_k=2,
)
response = query_engine.query("According to this article which open-source model is the best for an agent behaviour?")

response.response

总结

完成端点后,您可以暂停或删除它。此步骤可以通过 UI 完成,也可以通过编程方式完成,如下所示。

# pause our running endpoint
endpoint.pause()

# optionally delete
# endpoint.delete()
< > 在 GitHub 上更新