调用许可：Transformers Agents 2.0 介绍

发布于 2024 年 5 月 13 日

在 GitHub 上更新

134

摘要

目录

什么是智能体？

Transformers Agents 的方法
主要元素

示例用例
自校正检索增强生成

使用简单的多智能体设置🤝实现高效网页浏览

测试我们的智能体
LLM 引擎基准测试

使用多模态智能体攀登 GAIA 排行榜

结论

TL;DR

我们正在发布 Transformers Agents 2.0！

⇒ 🎁 在我们现有的智能体类型之上，我们引入了两种新智能体，它们**可以根据过去的观察结果进行迭代以解决复杂任务**。

⇒ 💡 我们的目标是让代码**清晰、模块化，并让最终的提示和工具等通用属性透明可见**。

⇒ 🤝 我们增加了**共享选项**，以促进社区智能体的发展。

⇒ 💪 **极其高性能的新智能体框架**，让 Llama-3-70B-Instruct 智能体在 GAIA 排行榜上超越了基于 GPT-4 的智能体！

🚀 赶快尝试一下，在 GAIA 排行榜上更上一层楼吧！

transformers.agents 现已升级为独立库 smolagents！这两个库的 API 非常相似，因此切换很容易。请查看 smolagents 介绍博客。

什么是智能体？

大型语言模型 (LLM) 可以处理各种任务，但它们在逻辑、计算和搜索等特定任务上常常遇到困难。当在这些它们表现不佳的领域被提示时，它们经常无法生成正确的答案。

克服这一弱点的一种方法是创建一个**智能体**，它只是一个由 LLM 驱动的程序。智能体通过**工具**获得能力，以帮助其执行操作。当智能体需要特定技能来解决特定问题时，它会依赖其工具箱中合适的工具。

因此，当智能体在解决问题时需要特定技能时，它只需依赖其工具箱中合适的工具即可。

实验表明，智能体框架通常表现非常好，在多项基准测试中取得了最先进的性能。例如，请查看HumanEval 的顶级提交：它们都是智能体系统。

Transformers Agents 的方法

构建智能体工作流是复杂的，我们认为这些系统需要很高的清晰度和模块化。一年前我们发布了 Transformers Agents，现在我们正在加倍努力实现我们的核心设计目标。

我们的框架力求

通过简洁实现清晰：我们尽可能减少抽象。简单的错误日志和可访问的属性让您可以轻松检查正在发生的事情，并提供更高的清晰度。
模块化：我们倾向于提供构建块，而不是完整、复杂的特征集。您可以自由选择最适合您项目的构建块。
- 例如，由于任何智能体系统都只是由 LLM 引擎驱动的载体，我们决定在概念上将两者分离，这使您可以用任何底层 LLM 创建任何智能体类型。

最重要的是，我们有**共享功能**，让您可以站在巨人的肩膀上！

主要元素

Tool：这是让您使用工具或实现新工具的类。它主要由一个可调用前向方法组成，该方法执行工具操作，以及一组几个基本属性：name、descriptions、inputs和output_type。这些属性用于为工具动态生成使用手册并将其插入到 LLM 的提示中。
Toolbox：它是一组提供给智能体的工具，作为解决特定任务的资源。出于性能考虑，工具箱中的工具已经实例化并准备就绪。这是因为某些工具需要时间进行初始化，因此通常最好重用现有工具箱并只交换一个工具，而不是在每次智能体初始化时从头开始重新构建一组工具。
CodeAgent：一个非常简单的智能体，将其操作生成为单个 Python 代码块。它无法根据先前的观察进行迭代。
ReactAgent：ReAct 智能体遵循思考 ⇒ 行动 ⇒ 观察的循环，直到它们解决了任务。我们提供了两类 ReactAgent
- ReactCodeAgent 将其动作生成为 python 代码块。
- ReactJsonAgent 将其动作生成为 JSON 块。

查看文档以了解如何使用每个组件！

智能体在底层是如何工作的？

本质上，智能体的作用是“允许 LLM 使用工具”。智能体有一个关键的agent.run()方法，它会：

以**特定提示**的形式向 LLM 提供工具使用信息。这样，LLM 就可以选择要运行的工具来解决任务。
**解析** LLM 输出中的工具调用（可以是代码、JSON 格式或任何其他格式）。
**执行**调用。
如果智能体被设计为在先前输出上进行迭代，它会**保留一个带有先前工具调用和观察的内存**。这个内存的粒度可以根据你希望它的长期性而或多或少地精细。

graph of agent workflows

有关智能体的更多一般背景信息，您可以阅读 Lilian Weng 的这篇优秀博客文章，或我们早期关于使用 LangChain 构建智能体的博客文章。

要更深入地了解我们的包，请查看智能体文档。

示例用例

为了能够提前体验此功能，请首先从其main分支安装transformers

pip install "git+https://github.com/huggingface/transformers.git#egg=transformers[agents]"

Agents 2.0 将在 5 月中旬发布的 v4.41.0 版本中发布。

自校正检索增强生成

快速定义：检索增强生成 (RAG) 是指“使用 LLM 回答用户查询，但答案基于从知识库中检索到的信息”。它比使用普通或微调的 LLM 有许多优点：举几个例子，它允许将答案基于真实事实并减少胡编乱造，它允许为 LLM 提供特定领域的知识，并且它允许对知识库信息的访问进行细粒度控制。

假设我们要执行 RAG，并且一些参数必须动态生成。例如，根据用户查询，我们可能希望将搜索限制在知识库的特定子集，或者我们可能希望调整检索文档的数量。困难在于：如何根据用户查询动态调整这些参数？

好吧，我们可以通过让我们的智能体访问这些参数来做到这一点！

让我们设置这个系统。

运行下面这行命令来安装所需的依赖项

pip install langchain sentence-transformers faiss-cpu

我们首先加载一个知识库，我们希望在该知识库上执行 RAG：该数据集是许多huggingface包的文档页面的编译，以 markdown 格式存储。

import datasets
knowledge_base = datasets.load_dataset("m-ric/huggingface_doc", split="train")

现在我们通过处理数据集并将其存储到向量数据库中来准备知识库，以供检索器使用。我们将使用 LangChain，因为它具有出色的向量数据库实用程序

from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings

source_docs = [
    Document(
        page_content=doc["text"], metadata={"source": doc["source"].split("/")[1]}
    ) for doc in knowledge_base
]

docs_processed = RecursiveCharacterTextSplitter(chunk_size=500).split_documents(source_docs)[:1000]

embedding_model = HuggingFaceEmbeddings("thenlper/gte-small")
vectordb = FAISS.from_documents(
    documents=docs_processed,
    embedding=embedding_model
)

现在数据库已准备就绪，让我们构建一个基于它回答用户查询的 RAG 系统！

我们希望我们的系统根据查询只从最相关的信息源中进行选择。

我们的文档页面来自以下来源

>>> all_sources = list(set([doc.metadata["source"] for doc in docs_processed]))
>>> print(all_sources)

['blog', 'optimum', 'datasets-server', 'datasets', 'transformers', 'course',
'gradio', 'diffusers', 'evaluate', 'deep-rl-class', 'peft',
'hf-endpoints-documentation', 'pytorch-image-models', 'hub-docs']

我们如何根据用户查询选择相关来源？

👉 让我们将 RAG 系统构建为一个智能体，它可以自由选择其来源！

我们创建一个检索器工具，智能体可以使用它调用其选择的参数

import json
from transformers.agents import Tool
from langchain_core.vectorstores import VectorStore

class RetrieverTool(Tool):
    name = "retriever"
    description = "Retrieves some documents from the knowledge base that have the closest embeddings to the input query."
    inputs = {
        "query": {
            "type": "text",
            "description": "The query to perform. This should be semantically close to your target documents. Use the affirmative form rather than a question.",
        },
        "source": {
            "type": "text", 
            "description": ""
        },
    }
    output_type = "text"
    
    def __init__(self, vectordb: VectorStore, all_sources: str, **kwargs):
        super().__init__(**kwargs)
        self.vectordb = vectordb
        self.inputs["source"]["description"] = (
            f"The source of the documents to search, as a str representation of a list. Possible values in the list are: {all_sources}. If this argument is not provided, all sources will be searched."
          )

    def forward(self, query: str, source: str = None) -> str:
        assert isinstance(query, str), "Your search query must be a string"

        if source:
            if isinstance(source, str) and "[" not in str(source): # if the source is not representing a list
                source = [source]
            source = json.loads(str(source).replace("'", '"'))

        docs = self.vectordb.similarity_search(query, filter=({"source": source} if source else None), k=3)

        if len(docs) == 0:
            return "No documents found with this filtering. Try removing the source filter."
        return "Retrieved documents:\n\n" + "\n===Document===\n".join(
            [doc.page_content for doc in docs]
        )

现在，创建一个利用此工具的智能体就很容易了！

智能体在初始化时需要以下参数

tools：智能体可以调用的工具列表。
llm_engine：为智能体提供动力的 LLM。

我们的 llm_engine 必须是一个可调用对象，它以消息列表为输入并返回文本。它还需要接受一个 stop_sequences 参数，指示何时停止生成。为了方便起见，我们直接使用包中提供的 HfEngine 类来获取一个调用我们 Inference API 的 LLM 引擎。

from transformers.agents import HfEngine, ReactJsonAgent

llm_engine = HfEngine("meta-llama/Meta-Llama-3-70B-Instruct")

agent = ReactJsonAgent(
    tools=[RetrieverTool(vectordb, all_sources)],
    llm_engine=llm_engine
)

agent_output = agent.run("Please show me a LORA finetuning script")

print("Final output:")
print(agent_output)

由于我们将智能体初始化为ReactJsonAgent，它已自动获得一个默认系统提示，该提示告诉LLM引擎逐步处理并将工具调用生成为JSON blob（您可以根据需要用您自己的提示模板替换此提示）。

然后，当其 .run() 方法启动时，智能体会负责调用 LLM 引擎、解析工具调用 JSON blob 并执行这些工具调用，所有这些都在一个循环中，只有当最终答案提供时才结束。

我们得到以下输出

Calling tool: retriever with arguments: {'query': 'LORA finetuning script', 'source': "['transformers', 'datasets-server', 'datasets']"}
Calling tool: retriever with arguments: {'query': 'LORA finetuning script'}
Calling tool: retriever with arguments: {'query': 'LORA finetuning script example', 'source': "['transformers', 'datasets-server', 'datasets']"}
Calling tool: retriever with arguments: {'query': 'LORA finetuning script example'}
Calling tool: final_answer with arguments: {'answer': 'Here is an example of a LORA finetuning script: https://github.com/huggingface/diffusers/blob/dd9a5caf61f04d11c0fa9f3947b69ab0010c9a0f/examples/text_to_image/train_text_to_image_lora.py#L371'}

Final output:
Here is an example of a LORA finetuning script: https://github.com/huggingface/diffusers/blob/dd9a5caf61f04d11c0fa9f3947b69ab0010c9a0f/examples/text_to_image/train_text_to_image_lora.py#L371

我们可以看到自我纠正在起作用：智能体首先尝试限制来源，但由于缺乏相应的文档，它最终根本没有限制来源。

我们可以通过检查步骤 2 的日志中的 LLM 输出进行验证：print(agent.logs[2]['llm_output'])

Thought: I'll try to retrieve some documents related to LORA finetuning scripts from the entire knowledge base, without any source filtering.

Action:
{
  "action": "retriever",
  "action_input": {"query": "LORA finetuning script"}
}

使用简单的多智能体设置🤝实现高效网页浏览

在此示例中，我们希望构建一个智能体并在 GAIA 基准测试（Mialon et al. 2023）上对其进行测试。GAIA 是一个极其困难的基准测试，大多数问题需要使用不同的工具进行多步推理。一个特别困难的要求是拥有一个强大的网页浏览器，能够导航到具有特定限制的页面：使用网站的内部导航发现页面，及时选择特定文章……

网页浏览需要深入子页面并滚动浏览大量文本标记，这些标记对于更高层次的任务解决是不必要的。我们将网页浏览子任务分配给一个专门的网页浏览智能体。我们为其提供了一些用于浏览网页的工具和一个特定的提示（请查看仓库以查找具体实现）。

定义这些工具超出了本篇帖子的范围：但您可以查看仓库以查找具体实现。

from transformers.agents import ReactJsonAgent, HfEngine

WEB_TOOLS = [
    SearchInformationTool(),
    NavigationalSearchTool(),
    VisitTool(),
    DownloadTool(),
    PageUpTool(),
    PageDownTool(),
    FinderTool(),
    FindNextTool(),
]

websurfer_llm_engine = HfEngine(
    model="CohereForAI/c4ai-command-r-plus"
)  # We choose Command-R+ for its high context length

websurfer_agent = ReactJsonAgent(
    tools=WEB_TOOLS,
    llm_engine=websurfer_llm_engine,
)

为了让这个智能体能够被更高层次的任务解决智能体调用，我们可以简单地将其封装在另一个工具中

class SearchTool(Tool):
    name = "ask_search_agent"
    description = "A search agent that will browse the internet to answer a question. Use it to gather informations, not for problem-solving."

    inputs = {
        "question": {
            "description": "Your question, as a natural language sentence. You are talking to an agent, so provide them with as much context as possible.",
            "type": "text",
        }
    }
    output_type = "text"

    def forward(self, question: str) -> str:
        return websurfer_agent.run(question)

然后我们用这个搜索工具初始化任务解决智能体

from transformers.agents import ReactCodeAgent

llm_engine = HfEngine(model="meta-llama/Meta-Llama-3-70B-Instruct")
react_agent_hf = ReactCodeAgent(
    tools=[SearchTool()],
    llm_engine=llm_engine,
)

让我们用以下任务运行智能体

使用由 Marisa Alviar-Agnew 和 Henry Agnew 根据 CK-12 许可证在 LibreText 的《入门化学材料》中编译的（2023 年 8 月 21 日）密度测量数据。我有一加仑蜂蜜和一加仑蛋黄酱，温度均为 25C。我每次从一加仑蜂蜜中取出“一杯”蜂蜜。我需要取出多少次才能让蜂蜜的重量小于蛋黄酱？假设容器本身的重量相同。

Thought: I will use the 'ask_search_agent' tool to find the density of honey and mayonnaise at 25C.
==== Agent is executing the code below:
density_honey = ask_search_agent(question="What is the density of honey at 25C?")
print("Density of honey:", density_honey)
density_mayo = ask_search_agent(question="What is the density of mayonnaise at 25C?")
print("Density of mayo:", density_mayo)
===
Observation:
Density of honey: The density of honey is around 1.38-1.45kg/L at 20C. Although I couldn't find information specific to 25C, minor temperature differences are unlikely to affect the density that much, so it's likely to remain within this range.
Density of mayo: The density of mayonnaise at 25°C is 0.910 g/cm³.

===== New step =====
Thought: I will convert the density of mayonnaise from g/cm³ to kg/L and then calculate the initial weights of the honey and mayonnaise in a gallon. After that, I will calculate the weight of honey after removing one cup at a time until it weighs less than the mayonnaise.
==== Agent is executing the code below:
density_honey = 1.42 # taking the average of the range
density_mayo = 0.910 # converting g/cm³ to kg/L
density_mayo = density_mayo * 1000 / 1000 # conversion

gallon_to_liters = 3.785 # conversion factor
initial_honey_weight = density_honey * gallon_to_liters
initial_mayo_weight = density_mayo * gallon_to_liters

cup_to_liters = 0.236 # conversion factor
removed_honey_weight = cup_to_liters * density_honey
===
Observation:

===== New step =====
Thought: Now that I have the initial weights of honey and mayonnaise, I'll try to calculate the number of cups to remove from the honey to make it weigh less than the mayonnaise using a simple arithmetic operation.
==== Agent is executing the code below:
cups_removed = int((initial_honey_weight - initial_mayo_weight) / removed_honey_weight) + 1
print("Cups removed:", cups_removed)
final_answer(cups_removed)
===
>>> Final answer: 6

✅ 答案是**正确**的！

测试我们的智能体

让我们试用一下我们的智能体框架，并用它来对不同的模型进行基准测试！

以下所有实验代码都可以在这里找到。

LLM 引擎基准测试

agents_reasoning_benchmark 是一个小型但强大的推理测试，用于评估智能体的性能。此基准测试已在我们之前的博客文章中更详细地使用和解释过。

核心思想是，您与智能体一起使用的工具选择会极大地改变某些任务的性能。因此，此基准测试将使用的工具集限制为计算器和基本搜索工具。我们从几个可以使用这两种工具解决的数据集中挑选了问题

来自HotpotQA的 30 个问题（Yang et al., 2018），用于测试搜索工具的使用。
来自GSM8K的 40 个问题（Cobbe et al., 2021），用于测试计算器工具的使用。
来自GAIA的 20 个问题（Mialon et al., 2023），用于测试两种工具在解决难题时的使用。

这里我们尝试了 3 种不同的引擎：Mixtral-8x7B、Llama-3-70B-Instruct 和 GPT-4 Turbo。

benchmark of agent performances

结果如上所示——为了更精确，取两次完整运行的平均值。我们还测试了Command-R+和Mixtral-8x22B，但为了清晰起见，未显示它们。

⇒ Llama-3-70B-Instruct 在开源模型中处于领先地位：它与 GPT-4 不相上下，而且由于 Llama 3 强大的编码性能，它在 ReactCodeAgent 中表现尤其出色！

💡 比较基于 JSON 和基于代码的 React 智能体很有趣：对于 Mixtral-8x7B 等性能较低的 LLM 引擎，基于代码的智能体表现不如 JSON，因为 LLM 引擎经常无法生成高质量的代码。但基于代码的版本在与更强大的模型作为引擎配合时表现出色：根据我们的经验，基于代码的版本甚至在 Llama-3-70B-Instruct 上优于 JSON。因此，我们在下一个挑战中使用基于代码的版本：在完整的 GAIA 基准测试中进行测试。

使用多模态智能体攀登 GAIA 排行榜

GAIA（Mialon et al., 2023）是一个极其困难的基准测试：您可以在上面的agent_reasoning_benchmark中看到，即使我们挑选了可以用 2 个基本工具解决的任务，模型也无法达到 50% 以上的性能。

现在我们希望在完整数据集上获得分数，我们不再挑选问题。因此，我们必须涵盖所有模态，这促使我们使用这些特定工具

SearchTool：上面定义的网络浏览器。
TextInspectorTool：将文档作为文本文件打开并返回其内容。
SpeechToTextTool：将音频文件转录为文本。我们使用基于distil-whisper的默认工具。
VisualQATool：视觉分析图像。为此，我们使用了闪亮的新Idefics2-8b-chatty！

我们首先初始化这些工具（更多详细信息，请检查仓库中的代码）。

然后我们初始化我们的智能体

from transformers.agents import ReactCodeAgent, HfEngine

TASK_SOLVING_TOOLBOX = [
    SearchTool(),
    VisualQATool(),
    SpeechToTextTool(),
    TextInspectorTool(),
]

react_agent_hf = ReactCodeAgent(
    tools=TASK_SOLVING_TOOLBOX,
    llm_engine=HfEngine(model="meta-llama/Meta-Llama-3-70B-Instruct"),
    memory_verbose=True,
)

在完成 165 个问题所需的时间之后，我们将结果提交到 GAIA 排行榜，然后……🥁🥁🥁

GAIA leaderboard

⇒ 我们的智能体排名第四：它击败了许多基于 GPT-4 的智能体，现在是开源类别中的卫冕者！

结论

我们将在未来几个月内继续改进此软件包。我们已经确定了开发路线图中几个令人兴奋的方向

更多智能体共享选项：目前您可以从 Hub 推送或加载工具，我们也将实现推送/加载智能体。
更好的工具，特别是图像处理工具。
长期记忆管理。
多智能体协作。

👉 去试试 Transformers Agents 吧！我们期待收到您的反馈和想法。

让我们用更多的开源模型填满排行榜顶部！🚀

transformers.agents 现已升级为独立库 smolagents！这两个库的 API 非常相似，因此切换很容易。请查看 smolagents 介绍博客。

更多博客文章

作为LangChain Agents的开源LLM

作者： 2024年1月24日 • 69

CodeAgents + Structure: 一种更好的执行操作的方式

作者： 2025年5月28日 • 71

社区

通过拖放到文本输入框、粘贴或点击此处上传图片、音频和视频。

点击或粘贴此处以上传图片

· 注册或登录发表评论

134