开源 LLM 作为 LangChain Agent

发布于 2024 年 1 月 24 日

在 GitHub 上更新

摘要

引言

目录

什么是 Agent？
ReAct Agent 内部工作原理的玩具示例

Agent 系统的挑战

使用 LangChain 运行 Agent

Agent 对决：开源 LLM 作为通用推理 Agent 的表现如何？
评估

模型

结果

TL;DR

开源 LLM 已经达到了一个性能水平，使其成为为 Agent 工作流提供动力的合适推理引擎：Mixtral 甚至在我们的基准测试中超越了 GPT-3.5，并且其性能可以通过微调轻松进一步提高。

我们发布了最简单的 Agent 库：smolagents！请在此处查看 smolagents 介绍博客。

引言

经过因果语言建模训练的大型语言模型 (LLM) 可以处理各种任务，但它们在逻辑、计算和搜索等基本任务上常常表现不佳。最糟糕的情况是，当它们在某个领域（例如数学）表现不佳时，却仍然尝试自行处理所有计算。

为了克服这一弱点，除了其他方法之外，还可以将 LLM 集成到一个可以调用工具的系统中：这种系统称为 LLM Agent。

在这篇文章中，我们将解释 ReAct Agent 的内部工作原理，然后展示如何使用 LangChain 最近集成的 ChatHuggingFace 类构建它们。最后，我们将几个开源 LLM 与 GPT-3.5 和 GPT-4 进行基准测试。

什么是 Agent？

LLM Agent 的定义相当宽泛：LLM Agent 是所有使用 LLM 作为其引擎并能够根据观察对其环境执行操作的系统。它们可以使用“感知 => 反思 => 行动”循环的多次迭代来完成任务，并且通常会通过规划或知识管理系统进行增强以提高其性能。您可以在Xi 等人，2023中找到关于 Agent 领域的良好综述。

今天，我们重点关注 ReAct Agent。ReAct 是一种构建 Agent 的方法，其名称由“推理 (Reasoning)”和“行动 (Acting)”两个词组合而成。在提示中，我们描述了模型、它可以使用的工具，并要求它“一步步”思考（也称为思维链行为），以规划和执行其下一步行动以达到最终答案。

drawing

ReAct Agent 内部工作原理的玩具示例

上面的图表看起来非常高层次，但其底层原理却相当简单。

请查看此笔记本：我们使用 Transformers 库实现了一个最简单的工具调用示例。

LLM 在循环中被调用，其提示本质上包含

Here is a question: "{question}" 
You have access to these tools: {tools_descriptions}. 
You should first reflect with ‘Thought: {your_thoughts}’, then you either:
- call a tool with the proper JSON formatting,
- or your print your final answer starting with the prefix ‘Final Answer:’

然后解析 LLM 的输出

如果它包含字符串 'Final Answer:'，则循环结束并打印答案，
否则，LLM 应该已经输出了一个工具调用：您可以解析此输出来获取工具名称和参数，然后使用所述参数调用所述工具。然后将此工具调用的输出附加到提示中，然后使用此扩展信息再次调用 LLM，直到它获得足够的信息最终提供问题的最终答案。

例如，当回答问题时，LLM 的输出可能看起来像这样：1:23:45 中有多少秒？

Thought: I need to convert the time string into seconds.

Action:
{
    "action": "convert_time",
    "action_input": {
    "time": "1:23:45"
    }
}

由于此输出不包含字符串 'Final Answer:'，因此它正在调用一个工具：因此我们解析此输出并获取工具调用参数：调用工具 convert_time，参数为 {"time": "1:23:45"}。运行此工具调用返回 {'seconds': '5025'}。

所以我们将这整个信息块附加到提示中。

现在的新提示是（一个稍详细的版本）

Here is a question: "How many seconds are in 1:23:45?"
You have access to these tools:
    - convert_time: converts a time given in hours:minutes:seconds into seconds.

You should first reflect with ‘Thought: {your_thoughts}’, then you either:
- call a tool with the proper JSON formatting,
- or your print your final answer starting with the prefix ‘Final Answer:’

Thought: I need to convert the time string into seconds.

Action:
{
    "action": "convert_time",
    "action_input": {
    "time": "1:23:45"
    }
}
Observation: {'seconds': '5025'}

➡️ 我们再次调用 LLM，使用这个新提示。鉴于它在 Observation 中可以访问工具调用的结果，LLM 现在极有可能输出

Thought: I now have the information needed to answer the question.
Final Answer: There are 5025 seconds in 1:23:45.

任务已解决！

Agent 系统的挑战

通常，运行 Agent 系统对于 LLM 引擎来说困难的部分在于

从提供的工具中，选择一个有助于实现预期目标的工具：例如，当被问及“大于 30,000 的最小素数是多少？”时，Agent 可以调用 Search 工具，参数为 "K2 的高度是多少"，但这无济于事。
使用严格的参数格式调用工具：例如，当尝试计算一辆汽车在 10 分钟内行驶 3 公里的速度时，您必须调用 Calculator 工具来将 distance 除以 time：即使您的 Calculator 工具接受 JSON 格式的调用：{”tool”: “Calculator”, “args”: “3km/10min”}，也有许多陷阱，例如
- 拼写错工具名称：“calculator” 或 “Compute” 将不起作用
- 给出参数名称而不是它们的值：“args”: “distance/time”
- 非标准化格式：“args": "3km in 10minutes”
有效摄取和使用过去观察中收集的信息，无论是初始上下文还是使用工具后返回的观察结果。

那么，一个完整的 Agent 设置会是什么样子呢？

使用 LangChain 运行 Agent

我们刚刚集成了 ChatHuggingFace 封装器，它允许您在 🦜🔗LangChain 中创建基于开源模型的 Agent。

创建 ChatModel 并为其提供工具的代码非常简单，您可以在 Langchain 文档中查看所有内容。

from langchain_community.llms import HuggingFaceEndpoint
from langchain_community.chat_models.huggingface import ChatHuggingFace

llm = HuggingFaceEndpoint(repo_id="HuggingFaceH4/zephyr-7b-beta")

chat_model = ChatHuggingFace(llm=llm)

您可以通过为其提供 ReAct 风格的提示和工具，将 chat_model 变为 Agent

from langchain import hub
from langchain.agents import AgentExecutor, load_tools
from langchain.agents.format_scratchpad import format_log_to_str
from langchain.agents.output_parsers import (
    ReActJsonSingleInputOutputParser,
)
from langchain.tools.render import render_text_description
from langchain_community.utilities import SerpAPIWrapper

# setup tools
tools = load_tools(["serpapi", "llm-math"], llm=llm)

# setup ReAct style prompt
prompt = hub.pull("hwchase17/react-json")
prompt = prompt.partial(
    tools=render_text_description(tools),
    tool_names=", ".join([t.name for t in tools]),
)

# define the agent
chat_model_with_stop = chat_model.bind(stop=["\nObservation"])
agent = (
    {
        "input": lambda x: x["input"],
        "agent_scratchpad": lambda x: format_log_to_str(x["intermediate_steps"]),
    }
    | prompt
    | chat_model_with_stop
    | ReActJsonSingleInputOutputParser()
)

# instantiate AgentExecutor
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

agent_executor.invoke(
    {
        "input": "Who is the current holder of the speed skating world record on 500 meters? What is her current age raised to the 0.43 power?"
    }
)

然后 Agent 将处理输入

Thought: To answer this question, I need to find age of the current speedskating world record holder.  I will use the search tool to find this information.
Action:
{
    "action": "search",
    "action_input": "speed skating world record holder 500m age"
}
Observation: ...

Agent 对决：开源 LLM 作为通用推理 Agent 的表现如何？

您可以在此处找到此基准测试的代码。

评估

我们希望衡量开源 LLM 作为通用推理 Agent 的性能。因此，我们选择需要使用逻辑和基本工具的问题：计算器和互联网搜索。最终的数据集是来自其他 3 个数据集的样本组合

为了测试互联网搜索能力：我们选择了来自 HotpotQA 的问题：这原本是一个检索数据集，但它可以用于一般问题回答，并可访问互联网。一些问题最初需要结合来自各种来源的信息：在我们的设置中，这意味着执行多个互联网搜索步骤以组合结果。
对于计算器使用，我们添加了来自 GSM8K 的问题：此数据集测试小学数学能力，并且可以通过正确利用 4 种运算符（加、减、乘、除）完全解决。
我们还从 GAIA 中选择了问题，GAIA 是一个非常困难的通用 AI 助手基准。原始数据集中的问题可能需要许多其他不同的工具，例如代码解释器或 PDF 阅读器：我们手工挑选了不需要除搜索和计算器之外的工具的问题。

评估使用 GPT-4 作为评判，基于 Prometheus 提示格式的提示，以 5 点李克特量表给出结果：请参阅此处的确切提示。

模型

我们评估了一些强大的开源模型

这些模型在 LangChain 的 ReAct 实现中进行评估。这意味着我们提示它们以这种格式输出其函数调用

{
  "action": $TOOL_NAME,
  "action_input": $INPUT
}

为了比较，我们还在相同的示例上使用 LangChain 的 OpenAI 专用 Agent 评估了 GPT-3.5 和 GPT-4。由于这使用了它们自己的函数调用模板，它们对此进行了微调，这意味着 OpenAI 模型应该处于其最佳性能。

结果

由于开源模型没有专门针对给定输出格式中的函数调用进行微调，因此与 OpenAI Agent 相比，它们略显劣势。

尽管如此，一些模型表现出色！💪

以下是 Mixtral-8x7B 回答问题的示例：“贵阳和塔城哪个城市人口更多？”

Thought: To answer this question, I need to find the current populations of both Guiyang and Tacheng. I will use the search tool to find this information.
Action:
{
    "action": "search",
    "action_input": "current population of Guiyang"
}
4.697 million (2016)

Thought: I have obtained the population of Guiyang, which is 4.697 million (as of 2016). Now, I will use the search tool again to find the population of Tacheng.
Action:
{
    "action": "search",
    "action_input": "current population of Tacheng"
}
Population: Xinjiang: Tacheng data was reported at 1,110.000 Persons in 2021. This records a decrease from the previous number of 1,138.638 Persons for 2020.
Population: Xinjiang: Tacheng data is updated yearly, averaging 1,023.000 Persons from Dec 2005 to 2021, with 17 observations.

I have obtained the population of Tacheng, which is approximately 1.11 million (as of 2021). Comparing the two populations, Guiyang has a larger population than Tacheng.

Thought: I now know the final answer
Final Answer: Guiyang has a larger population, which is approximately 4.697 million (as of 2016), compared to Tacheng's population of approximately 1.11 million (as of 2021).

以下是模型在我们评估数据集上的基准测试结果（原始 1-5 分的平均分已转换为 0-100% 的比例，以便于阅读）

benchmark of agents performance

如您所见，一些开源模型在 Agent 工作流方面表现不佳：虽然对于小型 Zephyr-7b 来说这是预料之中的，但 Llama2-70b 的表现却出奇地差。

👉 但 Mixtral-8x7B 表现非常出色：它甚至击败了 GPT-3.5！ 🏆

这还是开箱即用的性能：与 GPT-3.5 不同，Mixtral 未针对 Agent 工作流进行微调（据我们所知），这在一定程度上阻碍了其性能。例如，在 GAIA 上，10% 的问题失败是因为 Mixtral 尝试使用格式不正确的参数调用工具。通过对函数调用和任务规划技能进行适当的微调，Mixtral 的分数可能会更高。

➡️ 我们强烈建议开源构建者开始为 Agent 微调 Mixtral，以超越下一个挑战者：GPT-4！🚀

结语

GAIA 基准测试，尽管此处仅对一小部分问题和少量工具进行测试，但似乎是衡量 Agent 工作流整体模型性能的非常可靠的指标，因为它通常涉及多个推理步骤和严格的逻辑。
Agent 工作流允许 LLM 提高性能：例如，在 GSM8K 上，GPT-4 的技术报告指出 5-shot CoT 提示的准确率为 92%：为其提供计算器可使我们在零样本中达到 95%。对于 Mixtral-8x7B，LLM 排行榜显示 5-shot 的准确率为 57.6%，我们在零样本中达到 73%。（请记住，我们只测试了 GSM8K 的 20 个问题）

更多博客文章