Vision Agents with smolagents

本节中的示例需要访问强大的 VLM 模型。我们使用 GPT-4o API 对它们进行了测试。但是，Why use smolagents 讨论了 smolagents 和 Hugging Face 支持的替代解决方案。如果您想探索其他选项，请务必查看该部分。

为了解决超出文本处理的任务，为智能体赋予视觉能力至关重要。许多现实世界的挑战，例如网页浏览或文档理解，都需要分析丰富的视觉内容。幸运的是，smolagents 提供了对视觉语言模型 (VLM) 的内置支持，使智能体能够有效地处理和解释图像。

在本例中，假设韦恩庄园的管家阿弗雷德的任务是验证参加聚会的客人的身份。您可以想象，阿弗雷德可能不熟悉所有到场的人。为了帮助他，我们可以使用一个智能体，通过使用 VLM 搜索有关他们外貌的视觉信息来验证他们的身份。这将使阿弗雷德能够就谁可以进入做出明智的决定。让我们构建这个例子！

在智能体执行开始时提供图像

您可以按照这个 notebook 中的代码进行操作，您可以使用 Google Colab 运行它。

在这种方法中，图像在开始时传递给智能体，并与任务提示一起存储为 task_images。然后，智能体在其整个执行过程中处理这些图像。

考虑一下这种情况：阿弗雷德想要验证参加聚会的超级英雄的身份。他已经拥有来自先前聚会的图像数据集，其中包含客人的姓名。给定新访客的图像，智能体可以将其与现有数据集进行比较，并就允许他们进入做出决定。

在这种情况下，一位客人试图进入，而阿弗雷德怀疑这位访客可能是小丑冒充神奇女侠。阿弗雷德需要验证他们的身份，以防止任何不受欢迎的人进入。

让我们构建这个例子。首先，加载图像。在本例中，我们使用来自维基百科的图像以保持示例的最小化，但请想象可能的用例！

from PIL import Image
import requests
from io import BytesIO

image_urls = [
    "https://upload.wikimedia.org/wikipedia/commons/e/e8/The_Joker_at_Wax_Museum_Plus.jpg", # Joker image
    "https://upload.wikimedia.org/wikipedia/en/9/98/Joker_%28DC_Comics_character%29.jpg" # Joker image
]

images = []
for url in image_urls:
    response = requests.get(url)
    image = Image.open(BytesIO(response.content)).convert("RGB")
    images.append(image)

现在我们有了图像，智能体会告诉我们一位客人实际上是超级英雄（神奇女侠）还是反派（小丑）。

from smolagents import CodeAgent, OpenAIServerModel

model = OpenAIServerModel(model_id="gpt-4o")

# Instantiate the agent
agent = CodeAgent(
    tools=[],
    model=model,
    max_steps=20,
    verbosity_level=2
)

response = agent.run(
    """
    Describe the costume and makeup that the comic character in these photos is wearing and return the description.
    Tell me if the guest is The Joker or Wonder Woman.
    """,
    images=images
)

在我的运行情况下，输出如下，尽管在您的情况下可能会有所不同，正如我们已经讨论过的那样

    {
        'Costume and Makeup - First Image': (
            'Purple coat and a purple silk-like cravat or tie over a mustard-yellow shirt.',
            'White face paint with exaggerated features, dark eyebrows, blue eye makeup, red lips forming a wide smile.'
        ),
        'Costume and Makeup - Second Image': (
            'Dark suit with a flower on the lapel, holding a playing card.',
            'Pale skin, green hair, very red lips with an exaggerated grin.'
        ),
        'Character Identity': 'This character resembles known depictions of The Joker from comic book media.'
    }

在这种情况下，输出显示此人冒充他人，因此我们可以阻止小丑进入聚会！

提供动态检索的图像

您可以按照这个 Python 文件中的代码进行操作

先前的方法很有价值，并且具有许多潜在的用例。但是，在数据库中没有客人的情况下，我们需要探索其他识别他们的方法。一种可能的解决方案是从外部来源动态检索图像和信息，例如浏览网络以获取详细信息。

在这种方法中，图像在执行期间动态添加到智能体的内存中。众所周知，smolagents 中的智能体基于 MultiStepAgent 类，它是 ReAct 框架的抽象。此类在结构化循环中运行，其中各种变量和知识在不同阶段被记录

SystemPromptStep: 存储系统提示。
TaskStep: 记录用户查询和任何提供的输入。
ActionStep: 捕获来自智能体操作和结果的日志。

这种结构化方法允许智能体动态地结合视觉信息，并自适应地响应不断演变的任务。下面是我们已经看到的图表，它说明了动态工作流程过程以及不同步骤如何在智能体生命周期中集成。浏览时，智能体可以截取屏幕截图，并将其另存为 ActionStep 中的 observation_images。

Dynamic image retrieval

现在我们了解了需求，让我们构建完整的示例。在本例中，阿弗雷德希望完全控制客人验证过程，因此浏览详细信息成为一种可行的解决方案。为了完成此示例，我们需要为智能体提供一套新工具。此外，我们将使用 Selenium 和 Helium，它们是浏览器自动化工具。这将使我们能够构建一个智能体，该智能体可以探索网络，搜索有关潜在客人的详细信息并检索验证信息。让我们安装所需的工具

pip install "smolagents[all]" helium selenium python-dotenv

我们将需要一套专门为浏览而设计的智能体工具，例如 search_item_ctrl_f、go_back 和 close_popups。这些工具允许智能体的行为就像一个人在网络上导航一样。

@tool
def search_item_ctrl_f(text: str, nth_result: int = 1) -> str:
    """
    Searches for text on the current page via Ctrl + F and jumps to the nth occurrence.
    Args:
        text: The text to search for
        nth_result: Which occurrence to jump to (default: 1)
    """
    elements = driver.find_elements(By.XPATH, f"//*[contains(text(), '{text}')]")
    if nth_result > len(elements):
        raise Exception(f"Match n°{nth_result} not found (only {len(elements)} matches found)")
    result = f"Found {len(elements)} matches for '{text}'."
    elem = elements[nth_result - 1]
    driver.execute_script("arguments[0].scrollIntoView(true);", elem)
    result += f"Focused on element {nth_result} of {len(elements)}"
    return result


@tool
def go_back() -> None:
    """Goes back to previous page."""
    driver.back()


@tool
def close_popups() -> str:
    """
    Closes any visible modal or pop-up on the page. Use this to dismiss pop-up windows! This does not work on cookie consent banners.
    """
    webdriver.ActionChains(driver).send_keys(Keys.ESCAPE).perform()

我们还需要保存屏幕截图的功能，因为这将是我们的 VLM 智能体用于完成任务的重要组成部分。此功能捕获屏幕截图并将其保存在 step_log.observations_images = [image.copy()] 中，从而允许智能体在导航时动态存储和处理图像。

def save_screenshot(step_log: ActionStep, agent: CodeAgent) -> None:
    sleep(1.0)  # Let JavaScript animations happen before taking the screenshot
    driver = helium.get_driver()
    current_step = step_log.step_number
    if driver is not None:
        for step_logs in agent.logs:  # Remove previous screenshots from logs for lean processing
            if isinstance(step_log, ActionStep) and step_log.step_number <= current_step - 2:
                step_logs.observations_images = None
        png_bytes = driver.get_screenshot_as_png()
        image = Image.open(BytesIO(png_bytes))
        print(f"Captured a browser screenshot: {image.size} pixels")
        step_log.observations_images = [image.copy()]  # Create a copy to ensure it persists, important!

    # Update observations with current URL
    url_info = f"Current url: {driver.current_url}"
    step_log.observations = url_info if step_logs.observations is None else step_log.observations + "\n" + url_info
    return

此函数作为 step_callback 传递给智能体，因为它在智能体执行期间的每个步骤结束时触发。这允许智能体在其整个过程中动态捕获和存储屏幕截图。

现在，我们可以生成用于浏览网络的视觉智能体，为其提供我们创建的工具，以及 DuckDuckGoSearchTool 以探索网络。此工具将帮助智能体检索必要的信息，以根据视觉线索验证客人的身份。

from smolagents import CodeAgent, OpenAIServerModel, DuckDuckGoSearchTool
model = OpenAIServerModel(model_id="gpt-4o")

agent = CodeAgent(
    tools=[DuckDuckGoSearchTool(), go_back, close_popups, search_item_ctrl_f],
    model=model,
    additional_authorized_imports=["helium"],
    step_callbacks=[save_screenshot],
    max_steps=20,
    verbosity_level=2,
)

有了这个，阿弗雷德就可以检查客人的身份，并就是否让他们进入聚会做出明智的决定

agent.run("""
I am Alfred, the butler of Wayne Manor, responsible for verifying the identity of guests at party. A superhero has arrived at the entrance claiming to be Wonder Woman, but I need to confirm if she is who she says she is.

Please search for images of Wonder Woman and generate a detailed visual description based on those images. Additionally, navigate to Wikipedia to gather key details about her appearance. With this information, I can determine whether to grant her access to the event.
""" + helium_instructions)

您可以看到我们将 helium_instructions 作为任务的一部分包含在内。这个特殊的提示旨在控制智能体的导航，确保它在浏览网络时遵循正确的步骤。

让我们在下面的视频中看看它是如何工作的

这是最终输出

Final answer: Wonder Woman is typically depicted wearing a red and gold bustier, blue shorts or skirt with white stars, a golden tiara, silver bracelets, and a golden Lasso of Truth. She is Princess Diana of Themyscira, known as Diana Prince in the world of men.

有了这一切，我们已经成功地为聚会创建了身份验证器！阿弗雷德现在拥有必要的工具，以确保只有合适的客人才能通过大门。一切都准备就绪，可以在韦恩庄园度过美好的时光！

Agents Course

Vision Agents with smolagents

在智能体执行开始时提供图像

提供动态检索的图像

延伸阅读