在Agentic框架中使用视觉RAG进行深度搜索 🔎

社区文章发布于2025年3月21日

今天，我们将探讨视觉检索方法（例如ColPali）如何显著增强检索增强生成（RAG）系统，尤其是在集成到智能体环境中时。这些进步提高了检索质量，但代价是增加了测试时计算量。

深度搜索智能体查询示例。智能体使用内部来源响应，并仔细交叉核对在线信息以分配置信度标签。

我们的工具可在HuggingFace上获取！

理解RAG

检索增强生成（**RAG**）通过检索相关的外部信息来支持生成过程，从而增强大型语言模型（**LLM**）的能力。传统RAG系统通常会：

**检索**基于用户查询的知识语料库中的相关上下文。
**将**此上下文**添加到**原始查询的前面。
**将**丰富后的查询**转发**给LLM以生成响应。以下是标准RAG管道的图示

视觉RAG的兴起

RAG的一个令人兴奋的扩展是**视觉RAG**，在文档问答场景中尤其有利。传统的文档问答方法通常依赖于涉及光学字符识别（**OCR**）和文档布局检测的复杂管道。然而，最近的论文ColPali：使用视觉语言模型高效进行文档检索展示了像ColPali这样的视觉语言模型（**VLM**）如何通过**直接嵌入**文档页面的屏幕截图来简化此过程，从而无需文本提取管道。这既**提高了计算效率**，又改善了整体检索性能。

当前RAG系统的局限性

尽管取得了这些进步，但传统的RAG（视觉或文本）系统仍然面临显著的局限性：

**查询敏感性：** 系统对查询措辞过于敏感，如果查询风格与语料库内容不同，则可能会遗漏相关信息。
**单跳检索：** 典型的RAG系统只执行一次检索步骤，限制了其处理需要来自多个文档部分（例如，引用多个表格或分散的定义）的信息的复杂查询的能力。之前解决这些挑战的尝试只提供了微小的改进。

Agentic RAG的登场

解决这些局限性的一种有前途的解决方案是将检索器集成到**Agentic框架**中

在Agentic环境中，RAG系统获得了显著的灵活性：

**交互式查询处理：** 智能体将查询动态地重新表述为多个子查询。
**迭代检索：** 智能体进行多轮检索，逐步综合信息，直到获得全面且令人满意的响应。
**外部工具使用：** 智能体可以使用外部工具，如网络搜索，还可以生成和运行代码，使它们能够执行数学运算、生成图表等。这种迭代的、智能体驱动的方法确保了更丰富、更深入、更符合上下文的响应，显著增强了ColPali等视觉RAG系统。

设计一个视觉RAG智能体

让我们使用smolagents一步一步构建一个简单的视觉RAG智能体。

智能体通过使用专门的、为其目标量身定制的工具来有效地操作。为了展示这一点，我们构建了一个专用的**视觉RAG工具**。

具体来说，我们通过创建一个继承自 smolagents.Tool 的自定义类来构建我们的工具。

首先，我们定义类并重写 setup 方法，该方法用于在使用工具之前对其进行准备。

class VisualRAGTool(Tool):
    name = "visual_rag"
    description = """Performs a RAG query on your internal PDF documents and returns the generated text response."""
    inputs = {...}
    output_type = "string"

  def _init_models(self, model_name: str) -> None:
      import torch
      from colpali_engine.models import ColQwen2, ColQwen2Processor
  
      self.device = "cuda" if torch.cuda.is_available() else "cpu" # or 'mps' for Apple silicon
  
      # Init the model and processor
      self.model = ColQwen2.from_pretrained(
              model_name,
              torch_dtype=torch.bfloat16,
              device_map="auto",
              attn_implementation="flash_attention_2"
          ).eval()
      self.processor = ColQwen2Processor.from_pretrained(model_name)
  
  def setup(self):
      """
      Overwrite this method here for any operation that is expensive and needs to be executed before you start using your tool. Such as loading a big model.
      """
      # Init the models
      self._init_models(self.model_name)
  
      # Initialize the DBs
      self.embds = []
      self.pages = []
  
      self.is_initialized = True

接下来，我们通过创建一个函数 `index` 来定义我们的索引方法，该函数填充 `self.pages` 和 `self.embds` 属性。

def index(self, files: list, contextualize: bool = True, api_key: str = None) -> int:
    """Indexes the uploaded files."""
    if not self.is_initialized:
        self.setup()
        
    # Convert files to images and extract metadata
    pgs = self.preprocess(files, contextualize=contextualize, api_key=api_key or self.api_key)

    # Embed the images
    embds = self.compute_embeddings(pgs)
    
    # Extend the pages
    self.pages.extend(pgs)
    self.embds.extend(embds)
    
    return len(embds)

这里的 `preprocess` 负责将文件转换为 `Page` 类型，该类型表示文档页面图像及其元数据。然后，页面被嵌入并存储在属性中。

⚠️ 这里我们使用了一种非常基本的策略。使用向量数据库而不是数组进行索引可以提高工具的性能！

最后，我们需要定义工具的主要功能： `forward`。在我们的例子中，`forward` 是执行整个RAG管道并返回LLM响应的函数。

def forward(self, query: str, k: int = 1, api_key: str = None) -> str:
    assert isinstance(query, str), "Your search query must be a string"

    # Retrieve the top k documents and generate response. We return the second element of the tuple only (the RAG answer)
    return self.search(
        query=query, 
        k=k, 
        api_key=api_key
    )[1]

def search(self, query: str, k: int = 1, api_key: str = None) -> tuple:
    """Searches for the most relevant pages based on the query."""
    # Retrieve the top k documents
    context = self.retrieve(query, k)

    # Generate response from GPT-4o-mini
    rag_answer = self.generate_answer(
        query=query, 
        docs=context, 
        api_key=api_key
    )

    return context, rag_answer.content

请注意，我们使用了一个中间函数`search`来访问检索到的上下文。然而，`Tool`类只允许返回一种类型！

视觉RAG工具实战

让我们测试一下我们的工具！

首先，我们使用**ColQwen2**索引我们的整个文档语料库。然后，当系统在运行时接收到查询时，它利用**ColQwen2**获取最相关的**k**个文档页面。这些检索到的页面，以及一些额外的上下文，然后被提供给GPT-4o-mini，它直接生成文本响应来回答用户的查询。我们特别注意确保GPT-4o-mini准确引用用于形成每个响应的确切页面和文档。我们的工具可在HuggingFace上获取。让我们用一个实际的例子来说明这一点，使用欧盟委员会出版的概述气候变化科学的青年杂志。

使用起来就是这么简单：

from smolagents import load_tool
# Load the visual RAG tool
visual_rag_tool = load_tool(
    "vidore/visual-rag-tool",
    trust_remote_code=True,
    api_key="YOUR_OPENAI_KEY"
)

# Index the PDF document
visual_rag_tool.index(["./report.pdf"])

# Query the tool
visual_rag_tool("What share of the world's water is suitable for human consumption?", k=3)

以下是我们得到的有益答案：

地球上只有2.5%的水是淡水。在这部分淡水中，超过三分之二以冰川和极地冰盖的形式冻结，使其大部分无法用于消费。因此，适合人类消费的可用水份额极少[1，第11页]。

来源
[1] climate_youth_magazine.pdf

正如所示，视觉RAG工具提供精确的答案，并准确引用其来源。您可以随意尝试不同的检索页面数量（`k`）以获取上下文。

将工具集成到Agentic框架中

DeepSearch是一个新兴的AI框架，智能体在该框架中执行复杂的、多步骤的研究任务，以彻底回答用户的查询。当智能体收到查询时，它会制定一个结构化的计划，分解问题并策略性地在各种受信任的来源和外部工具中执行搜索。

智能体收到查询时生成的计划示例。首先，它将使用视觉RAG工具进行分析。然后，它将与团队成员`验证者`核实信息。最后，它将通过咨询`格式化器`.

确保遵循预期的格式。将我们的视觉RAG工具集成到DeepSearch框架中是一个特别引人注目的方向。将视觉RAG与具有更大程度自主性的编排智能体相结合的主要优势在于，它们能够分解用户查询并独立解决所有必要的步骤，以生成符合所有必要约束的复杂答案。

以下是增强了视觉RAG的DeepSearch设置可能的样子：

在此设置中，**QA智能体**充当协调器，处理传入的查询并通过与我们的视觉RAG工具以及两个专门的智能体交互来规划最佳响应方式：

**网络验证智能体**：交叉引用内部检索到的信息与外部网络源，分配置信度分数以确保准确性和可靠性。
**格式化智能体**：增强响应的清晰度和可读性，有效地为用户构建它们，并可能可视化关键数据洞察。

使用 `smolagents` 构建QA智能体

我们使用用户友好的 `smolagents` 框架实现了这个协调器。以下是实际设置方法，通过插入我们之前创建的视觉RAG工具：

from smolagents import CodeAgent, DuckDuckGoSearchTool
from smolagents import OpenAIServerModel

gpt_4o_mini = OpenAIServerModel(
    model_id="gpt-4o-mini",
    api_base="https://api.openai.com/v1",
    api_key="YOUR_API_KEY",
)

# Define the verifier agent
VERIFIER_AGENT = CodeAgent(
    tools=[
        DuckDuckGoSearchTool(),
    ],
    model=gpt_4o_mini,
    max_steps=6,
    verbosity_level=2,
    planning_interval=3,
    name="verifier",
    description=\
        """This agent takes the user query as an input, associated information and context found by the previous agent, and must output a response that confirms the veracity of the previous agent's response using web searches.
           The verifier should provide a confidence score (high, medium, low) and a textual explanation of the confidence score.
            If the verifier cannot find relevant information, it should state it as 'unverified'."""
)

# Define the formatter agent
FORMATTER_AGENT = CodeAgent(
    tools=[
        DuckDuckGoSearchTool(),
    ],
    model=gpt_4o_mini,
    max_steps=3,
    verbosity_level=2,
    planning_interval=1,
    name="formatter",
    description=\
        """This agent takes the agent's response as an input and must output a formatted response that is easy to read and understand.
            The response should follow the user's specifications and be as clear as possible.
            The agent can ask for additional information if needed."""
)

使用这些，我们可以定义我们的 QA_AGENT，其中我们使用之前创建的 visual_rag_tool。

gpt_4o = OpenAIServerModel(
    model_id="gpt-4o",
    api_base="https://api.openai.com/v1",
    api_key="YOUR_API_KEY",,
)

# Define the QA Agent
QA_AGENT = CodeAgent(
        name="qa_agent",
        description=\
        """The agent takes a user query as input and is tasked with providing a detailed response to the query. 
            It uses internal documents via the RAG Tool as its first source of informations to answer the questions. It can use external sources (such as web searches) only as a fallback when no relevant information is found within the internal sources.
           Once the agent has gathered the information, it will assess the confidence level in the sources found using the `verifier` agent. This confidence score will be included in the final response to give the user an understanding of how reliable the provided information is.
           The final response should be detailed and cite the information sources. It should follow the format specified by the user using the `formatter` agent.""",
        tools=[
            visual_rag_tool,
            DuckDuckGoSearchTool(),
        ],
        managed_agents=[
            VERIFIER_AGENT, 
            FORMATTER_AGENT
        ],
        model=gpt_4o,
        max_steps=10,
        verbosity_level=2,
        planning_interval=3,
        add_base_tools=True,
        additional_authorized_imports=["pandas", "seaborn", "numpy", "matplotlib", "PIL", "io"],
    )

这里有一些评论：

`DuckDuckGoSearchTool` 是用于执行在线搜索的工具。它由网络浏览器DuckDuckGo提供支持。
`CodeAgent` 类的 `description` 字段至关重要，因为它描述了代理在生态系统中的目的。它有助于其他代理在计划调用此代理时进行规划。
`max_steps` 限制了智能体的搜索步骤数量，而 `planning_interval` 定义了智能体运行规划步骤的间隔。请随意根据您的需求修改这些参数。
我们选择 `gpt-4o` 和 `gpt-4o-mini` 作为我们代理的骨干LLM，但您可以随意使用smolagents支持的任何模型。

我们使用`smolagents`预定义的前端来与代理进行交互，只需运行

GradioUI(QA_AGENT).launch()

现在我们来看看智能体的表现。我们使用之前向RAG工具提出的相同查询。

输出

根据内部和外部来源的参考，已确定地球总水量中不到1%的水适合人类饮用。这一比例是根据地球上只有2.5%的水是淡水，且其中超过2/3被困在冰川、极地冰盖或无法获取的地下水中这一事实推导出来的 [1]。外部来源也证实了这一点，表明地球总水量的约0.5%到1%可用于人类饮用，这与美国垦务局和各种世卫组织支持的水安全和可用性指南的估计相符 [2][3]。经验证任务证实，该结论具有高度可靠性，置信度评级为*高*。

来源

《气候青年杂志》。“水资源可用性：事实与影响。”PDF文档，第11页。

美国垦务局。“水资源事实 - 全球水供应。”2023年10月访问。https://www.usbr.gov/mp/arwec/water-facts-ww-water-sup.html

世界水资源储备。“地球上饮用水的百分比是多少？”2023年10月访问。https://worldwaterreserve.com/percentage-of-drinkable-water-on-earth/

结论

世界上适合人类消费的水资源估计在0.5%到1%之间。考虑到所引用来源的可靠性和所应用的验证过程，该答案具有高度的准确性和可信度，证实了在地球上庞大水资源储备中，可供人类使用的水资源比例相对较小。