使用 TGI 的消息 API 从 OpenAI 迁移到开放式大型语言模型
作者:Andrew Reed
此笔记本演示了如何轻松地从 OpenAI 模型迁移到开放式大型语言模型,而无需重构任何现有代码。
文本生成推理 (TGI) 现在提供了一个 消息 API,使其与 OpenAI 聊天完成 API 直接兼容。这意味着任何使用 OpenAI 模型(通过 OpenAI 客户端库或 LangChain 或 LlamaIndex 等第三方工具)的现有脚本都可以直接替换为使用在 TGI 端点上运行的任何开放式大型语言模型!
这使您能够快速测试并受益于开放式模型提供的众多优势。例如:
- 完全控制和透明地管理模型和数据
- 不再担心速率限制
- 能够根据您的特定需求完全自定义系统
在此笔记本中,我们将向您展示如何:
让我们开始吧!
设置
首先,我们需要安装依赖项并设置 HF API 密钥。
!pip install --upgrade -q huggingface_hub langchain langchain-community langchainhub langchain-openai llama-index chromadb bs4 sentence_transformers torch torchvision torchaudio llama-index-llms-openai-like llama-index-embeddings-huggingface
import os
import getpass
# enter API key
os.environ["HF_TOKEN"] = HF_API_KEY = getpass.getpass()
1. 创建推理端点
首先,让我们使用 TGI 将 Nous-Hermes-2-Mixtral-8x7B-DPO(一个经过微调的 Mixtral 模型)部署到推理端点。
我们可以通过 UI **只需点击几下** 即可部署模型,或者利用 huggingface_hub
Python 库以编程方式创建和管理推理端点。
这里我们将使用 Hub 库,指定端点名称和模型仓库,以及 text-generation
任务。在本例中,我们使用 protected
类型,因此访问已部署的模型需要有效的 Hugging Face 令牌。我们还需要配置硬件要求,例如供应商、区域、加速器、实例类型和大小。您可以使用 此 API 调用 查看可用资源选项列表,并在此处查看目录中选定模型的推荐配置 此处。
注意:您可能需要通过发送电子邮件到 [email protected] 请求配额升级。
>>> from huggingface_hub import create_inference_endpoint
>>> endpoint = create_inference_endpoint(
... "nous-hermes-2-mixtral-8x7b-demo",
... repository="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
... framework="pytorch",
... task="text-generation",
... accelerator="gpu",
... vendor="aws",
... region="us-east-1",
... type="protected",
... instance_type="p4de",
... instance_size="2xlarge",
... custom_image={
... "health_route": "/health",
... "env": {
... "MAX_INPUT_LENGTH": "4096",
... "MAX_BATCH_PREFILL_TOKENS": "4096",
... "MAX_TOTAL_TOKENS": "32000",
... "MAX_BATCH_TOTAL_TOKENS": "1024000",
... "MODEL_ID": "/repository",
... },
... "url": "ghcr.io/huggingface/text-generation-inference:sha-1734540", # must be >= 1.4.0
... },
... )
>>> endpoint.wait()
>>> print(endpoint.status)
running
我们的部署需要几分钟才能启动。我们可以使用 .wait()
实用程序阻塞正在运行的线程,直到端点达到最终的“运行”状态。运行后,我们可以确认其状态并通过 UI Playground 进行测试。
太好了,我们现在拥有一个可用的端点!
注意:使用 huggingface_hub
部署时,您的端点默认会在 15 分钟的空闲时间后自动缩放到零,以在空闲期间优化成本。查看 Hub Python 库文档,了解可用于管理端点生命周期的所有功能。
2. 使用 OpenAI 客户端库查询推理端点
如上所述,由于我们的模型使用 TGI 托管,因此现在支持消息 API,这意味着我们可以使用熟悉的 OpenAI 客户端库直接查询它。
使用 Python 客户端
以下示例展示了如何使用 OpenAI Python 库 进行此转换。只需将 <ENDPOINT_URL>
替换为您的端点 URL(请务必包含 v1/
后缀),并将 <HF_API_KEY>
字段填充为有效的 Hugging Face 用户令牌。<ENDPOINT_URL>
可以从推理端点 UI 或我们上面使用 endpoint.url
创建的端点对象中获取。
然后,我们可以像往常一样使用客户端,传递消息列表以从我们的推理端点流式传输响应。
>>> from openai import OpenAI
>>> BASE_URL = endpoint.url
>>> # init the client but point it to TGI
>>> client = OpenAI(
... base_url=os.path.join(BASE_URL, "v1/"),
... api_key=HF_API_KEY,
... )
>>> chat_completion = client.chat.completions.create(
... model="tgi",
... messages=[
... {"role": "system", "content": "You are a helpful assistant."},
... {"role": "user", "content": "Why is open-source software important?"},
... ],
... stream=True,
... max_tokens=500,
... )
>>> # iterate and print stream
>>> for message in chat_completion:
... print(message.choices[0].delta.content, end="")
Open-source software is important due to a number of reasons, including: 1. Collaboration: The collaborative nature of open-source software allows developers from around the world to work together, share their ideas and improve the code. This often results in faster progress and better software. 2. Transparency: With open-source software, the code is publicly available, making it easy to see exactly how the software functions, and allowing users to determine if there are any security vulnerabilities. 3. Customization: Being able to access the code also allows users to customize the software to better suit their needs. This makes open-source software incredibly versatile, as users can tweak it to suit their specific use case. 4. Quality: Open-source software is often developed by large communities of dedicated developers, who work together to improve the software. This results in a higher level of quality than might be found in proprietary software. 5. Cost: Open-source software is often provided free of charge, which makes it accessible to a wider range of users. This can be especially important for organizations with limited budgets for software. 6. Shared Benefit: By sharing the code of open-source software, everyone can benefit from the hard work of the developers. This contributes to the overall advancement of technology, as users and developers work together to improve and build upon the software. In summary, open-source software provides a collaborative platform that leads to high-quality, customizable, and transparent software, all available at little or no cost, benefiting both individuals and the technology community as a whole.<|im_end|>
在幕后,TGI 的消息 API 使用其 聊天模板 自动将消息列表转换为模型所需的指令格式。
注意:某些 OpenAI 功能(如函数调用)与 TGI 不兼容。目前,消息 API 支持以下聊天完成参数:stream
、max_new_tokens
、frequency_penalty
、logprobs
、seed
、temperature
和 top_p
。
使用 JavaScript 客户端
这是上面相同的流式示例,但使用 OpenAI Javascript/Typescript 库。
import OpenAI from "openai";
const openai = new OpenAI({
baseURL: "<ENDPOINT_URL>" + "/v1/", // replace with your endpoint url
apiKey: "<HF_API_TOKEN>", // replace with your token
});
async function main() {
const stream = await openai.chat.completions.create({
model: "tgi",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Why is open-source software important?" },
],
stream: true,
max_tokens: 500,
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content || "");
}
}
main();
3. 集成 LangChain 和 LlamaIndex
现在,让我们看看如何将这个新创建的端点与流行的 RAG 框架(如 LangChain 和 LlamaIndex)一起使用。
如何在 LangChain 中使用
要在 LangChain 中使用它,只需创建一个 ChatOpenAI
实例,并传递您的 <ENDPOINT_URL>
和 <HF_API_TOKEN>
,如下所示。
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model_name="tgi",
openai_api_key=HF_API_KEY,
openai_api_base=os.path.join(BASE_URL, "v1/"),
)
llm.invoke("Why is open-source software important?")
我们可以直接利用与 OpenAI 模型一起使用的相同 ChatOpenAI
类。这允许所有以前的代码通过更改一行代码即可与我们的端点一起工作。
现在,让我们在一个简单的 RAG 管道中使用我们的 Mixtral 模型来回答关于 HF 博客文章内容的问题。
from langchain import hub
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.runnables import RunnableParallel
from langchain_community.embeddings import HuggingFaceEmbeddings
# Load, chunk and index the contents of the blog
loader = WebBaseLoader(
web_paths=("https://huggingface.co/blog/open-source-llms-as-agents",),
)
docs = loader.load()
# declare an HF embedding model
hf_embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-large-en-v1.5")
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(documents=splits, embedding=hf_embeddings)
# Retrieve and generate using the relevant snippets of the blog
retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
rag_chain_from_docs = (
RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"]))) | prompt | llm | StrOutputParser()
)
rag_chain_with_source = RunnableParallel({"context": retriever, "question": RunnablePassthrough()}).assign(
answer=rag_chain_from_docs
)
rag_chain_with_source.invoke("According to this article which open-source model is the best for an agent behaviour?")
如何在 LlamaIndex 中使用
类似地,您也可以在 LlamaIndex 中使用 TGI 端点。我们将使用 OpenAILike
类,并通过配置一些额外的参数(例如 is_local
、is_function_calling_model
、is_chat_model
、context_window
)来实例化它。
注意:上下文窗口参数应与您端点之前设置的 MAX_TOTAL_TOKENS
值匹配。
from llama_index.llms.openai_like import OpenAILike
llm = OpenAILike(
model="tgi",
api_key=HF_API_KEY,
api_base=BASE_URL + "/v1/",
is_chat_model=True,
is_local=False,
is_function_calling_model=False,
context_window=4096,
)
llm.complete("Why is open-source software important?")
我们现在可以在类似的 RAG 管道中使用它。请记住,您在推理端点中之前选择的 MAX_INPUT_LENGTH
将直接影响模型可以处理的检索到的块(similarity_top_k
)的数量。
from llama_index.core import VectorStoreIndex, download_loader
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.query_engine import CitationQueryEngine
SimpleWebPageReader = download_loader("SimpleWebPageReader")
documents = SimpleWebPageReader(html_to_text=True).load_data(
["https://huggingface.co/blog/open-source-llms-as-agents"]
)
# Load embedding model
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-large-en-v1.5")
# Pass LLM to pipeline
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model, show_progress=True)
# Query the index
query_engine = CitationQueryEngine.from_args(
index,
similarity_top_k=2,
)
response = query_engine.query("According to this article which open-source model is the best for an agent behaviour?")
response.response
总结
完成端点操作后,您可以暂停或删除它。此步骤可以通过 UI 或以编程方式完成,如下所示。
# pause our running endpoint
endpoint.pause()
# optionally delete
# endpoint.delete()