ScreenEnv：部署您的全栈桌面智能体

发布于 2025 年 7 月 10 日

摘要 (TL;DR)：ScreenEnv 是一个强大的 Python 库，可让您在 Docker 容器中创建隔离的 Ubuntu 桌面环境，用于测试和部署 GUI 智能体（又称计算机使用智能体）。通过内置对模型上下文协议 (MCP) 的支持，部署能够查看、点击和与真实应用程序交互的桌面智能体从未如此简单。

什么是 ScreenEnv？

想象一下，您需要自动化桌面任务、测试 GUI 应用程序或构建一个能与软件交互的 AI 智能体。过去，这需要复杂的虚拟机设置和脆弱的自动化框架。

ScreenEnv 通过提供一个在 Docker 容器中运行的**沙盒化桌面环境**来改变这一点。您可以将其视为一个完整的虚拟桌面会话，您的代码可以完全控制它——不仅仅是点击按钮和输入文本，而是管理整个桌面体验，包括启动应用程序、组织窗口、处理文件、执行终端命令以及记录整个会话。

为何选择 ScreenEnv？

🖥️ 完全的桌面控制：完整的鼠标和键盘自动化、窗口管理、应用程序启动、文件操作、终端访问和屏幕录制
🤖 双重集成模式：同时支持用于 AI 系统的模型上下文协议 (MCP) 和直接的沙盒 API——适应任何智能体或后端逻辑
🐳 Docker 原生：无需复杂的虚拟机设置——只需 Docker。环境是隔离的、可复现的，并且可以在不到 10 秒的时间内轻松部署到任何地方。支持 AMD64 和 ARM64 架构。

🎯 一键安装

from screenenv import Sandbox
sandbox = Sandbox()  # That's it!

两种集成方法

ScreenEnv 提供了**两种互补的方式**来与您的智能体和后端系统集成，让您可以灵活地选择最适合您架构的方法

选项 1：直接使用沙盒 API

非常适合自定义智能体框架、现有后端，或当您需要细粒度控制时

from screenenv import Sandbox

# Direct programmatic control
sandbox = Sandbox(headless=False)
sandbox.launch("xfce4-terminal")
sandbox.write("echo 'Custom agent logic'")
screenshot = sandbox.screenshot()
image = Image.open(BytesIO(screenshot_bytes))
...
sandbox.close()
# If close() isn’t called, you might need to shut down the container yourself.

选项 2：MCP 服务器集成

非常适合支持模型上下文协议的 AI 系统

from screenenv import MCPRemoteServer
from mcp import ClientSession
from mcp.client.streamable_http import streamablehttp_client

# Start MCP server for AI integration
server = MCPRemoteServer(headless=False)
print(f"MCP Server URL: {server.server_url}")

# AI agents can now connect and control the desktop
async def mcp_session():
    async with streamablehttp_client(server.server_url) as streams:
        async with ClientSession(*streams) as session:
            await session.initialize()
            print(await session.list_tools())

            response = await session.call_tool("screenshot", {})
            image_bytes = base64.b64decode(response.content[0].data)
            image = Image.open(BytesIO(image_bytes))

server.close()
# If close() isn’t called, you might need to shut down the container yourself.

这种双重方法意味着 ScreenEnv 能够适应您现有的基础设施，而不是强迫您改变您的智能体架构。

✨ 使用 screenenv 和 smolagents 创建桌面智能体

screenenv 原生支持 smolagents，让您可以轻松构建自己的自定义桌面智能体以实现自动化。以下是如何仅用几个步骤创建您自己的 AI 驱动的桌面智能体

1. 选择您的模型

选择您想用来驱动智能体的后端 VLM。

import os

from smolagents import OpenAIServerModel
model = OpenAIServerModel(
    model_id="gpt-4.1",
    api_key=os.getenv("OPENAI_API_KEY"),
)

# Inference Endpoints
from smolagents import HfApiModel
model = HfApiModel(
    model_id="Qwen/Qwen2.5-VL-7B-Instruct",
    token=os.getenv("HF_TOKEN"),
    provider="nebius",
)

# Transformer models
from smolagents import TransformersModel
model = TransformersModel(
    model_id="Qwen/Qwen2.5-VL-7B-Instruct",
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True,
)

# Other providers
from smolagents import LiteLLMModel
model = LiteLLMModel(model_id="anthropic/claude-sonnet-4-20250514")

# see smolagents to get the list of available model connectors

2. 定义您的自定义桌面智能体

继承 DesktopAgentBase 并实现 _setup_desktop_tools 方法来构建您自己的动作空间！

from screenenv import DesktopAgentBase, Sandbox
from smolagents import Model, Tool, tool
from smolagents.monitoring import LogLevel
from typing import List

class CustomDesktopAgent(DesktopAgentBase):
    """Agent for desktop automation"""

    def __init__(
        self,
        model: Model,
        data_dir: str,
        desktop: Sandbox,
        tools: List[Tool] | None = None,
        max_steps: int = 200,
        verbosity_level: LogLevel = LogLevel.INFO,
        planning_interval: int | None = None,
        use_v1_prompt: bool = False,
        **kwargs,
    ):
        super().__init__(
            model=model,
            data_dir=data_dir,
            desktop=desktop,
            tools=tools,
            max_steps=max_steps,
            verbosity_level=verbosity_level,
            planning_interval=planning_interval,
            use_v1_prompt=use_v1_prompt,
            **kwargs,
        )

        # OPTIONAL: Add a custom prompt template - see src/screenenv/desktop_agent/desktop_agent_base.py for more details about the default prompt template
        # self.prompt_templates["system_prompt"] = CUSTOM_PROMPT_TEMPLATE.replace(
        #     "<<resolution_x>>", str(self.width)
        # ).replace("<<resolution_y>>", str(self.height))
        # Important: Adjust the prompt based on your action space to improve results.

    def _setup_desktop_tools(self) -> None:
        """Define your custom tools here."""
        
        
        @tool
        def click(x: int, y: int) -> str:
            """
            Clicks at the specified coordinates.
            Args:
                x: The x-coordinate of the click
                y: The y-coordinate of the click
            """
            self.desktop.left_click(x, y)
            # self.click_coordinates = (x, y) to add the click coordinate to the observation screenshot 
            return f"Clicked at ({x}, {y})"
        
        self.tools["click"] = click
        

        @tool
        def write(text: str) -> str:
            """
            Types the specified text at the current cursor position.
            Args:
                text: The text to type
            """
            self.desktop.write(text, delay_in_ms=10)
            return f"Typed text: '{text}'"

        self.tools["write"] = write

        @tool
        def press(key: str) -> str:
            """
            Presses a keyboard key or combination of keys
            Args:
                key: The key to press (e.g. "enter", "space", "backspace", etc.) or a multiple keys string to press, for example "ctrl+a" or "ctrl+shift+a".
            """
            self.desktop.press(key)
            return f"Pressed key: {key}"

        self.tools["press"] = press
        
        @tool
        def open(file_or_url: str) -> str:
            """
            Directly opens a browser with the specified url or opens a file with the default application.
            Args:
                file_or_url: The URL or file to open
            """

            self.desktop.open(file_or_url)
            # Give it time to load
            self.logger.log(f"Opening: {file_or_url}")
            return f"Opened: {file_or_url}"

        @tool
        def launch_app(app_name: str) -> str:
            """
            Launches the specified application.
            Args:
                app_name: The name of the application to launch
            """
            self.desktop.launch(app_name)
            return f"Launched application: {app_name}"

        self.tools["launch_app"] = launch_app

        ... # Continue implementing your own action space.

3. 在桌面任务上运行智能体

from screenenv import Sandbox

# Define your sandbox environment
sandbox = Sandbox(headless=False, resolution=(1920, 1080))

# Create your agent
agent = CustomDesktopAgent(
    model=model,
    data_dir="data",
    desktop=sandbox,
)

# Run a task
task = "Open LibreOffice, write a report of approximately 300 words on the topic ‘AI Agent Workflow in 2025’, and save the document."

result = agent.run(task)
print(f"📄 Result: {result}")

sandbox.close()

如果您遇到 docker 访问被拒绝的错误，您可以尝试使用 sudo -E python -m test.py 运行智能体，或将您的用户添加到 docker 组。

💡 有关完整的实现，请参阅 GitHub 上的 CustomDesktopAgent 源代码。

立即开始

# Install ScreenEnv
pip install screenenv

# Try the examples
git clone git@github.com:huggingface/screenenv.git
cd screenenv
python -m examples.desktop_agent
# use 'sudo -E python -m examples.desktop_agent` if you're not in 'docker' group

下一步是什么？

ScreenEnv 的目标是超越 Linux，支持 **Android、macOS 和 Windows**，从而实现真正的跨平台 GUI 自动化。这将使开发人员和研究人员能够构建以最少设置即可在不同环境中泛化的智能体。

这些进步为创建**可复现的沙盒环境**铺平了道路，这些环境非常适合基准测试和评估。

代码仓库：https://github.com/huggingface/screenenv

更多博客文章

使用 Gradio MCP 服务器构建一个 AI 购物助手

作者： 2025 年 7 月 31 日 • 35

Gradio MCP 服务器的五大改进

作者： 2025 年 7 月 17 日 • 20

社区

tc-wolf

大约 1 个月前

这真的太酷了，我很想试试看，看是否能在 Mac OS 上用本地 MCP 服务器和工具调用实现来运行它。

现有的点击、截图等工具调用实现在哪里？

也就是说，我在沙盒类中看到了一个 left_click 方法，但这些都是向 Docker Provider (?) 的 IP 地址发出请求。

def left_click(self, x: Optional[int] = None, y: Optional[int] = None):
    """
    Clicks the left button of the mouse at the specified coordinates.
    """
    self._make_request("POST", "/left_click", params={"x": x, "y": y})

我看到使用了 Docker 镜像 amhma/ubuntu-desktop:22.04-0.0.1-dev，但是服务器代码（如果在那里的话）和 Dockerfile 会非常有帮助（特别是对于用 aarch64 Linux 镜像重新构建和运行）。

A-Mahla

文章作者大约 1 个月前

•

大约 1 个月前编辑

你好 @tc-wolf 。
感谢您的关注！我们正在积极努力开源 Docker 镜像。一旦我们有了一个可以分享的稳定版本，就会公布。
不过，您已经可以在 Mac 上使用 arm64 架构的 Docker 镜像了。该镜像支持 amd64 和 arm64 (aarch64)。您已经在您的 macOS 上测试过了吗？

已删除

大约 1 个月前

此评论已被隐藏

YacineMk

26 天前

•

26 天前编辑

一如既往地出色，感谢 HuggingFace 团队！
这个库有什么路线图或者 Discord 社区服务器吗？
我目前正在开发一个用于计算机操作的多智能体系统（使用 smolagents）。通常我的测试套件在本地运行，我正在寻找一种方法让它在容器化环境中运行以保证一致性和可复现性，所以这太完美了！
我很乐意与开发这个项目的团队建立更紧密的联系并做出贡献。

nageshsomayajula

10 天前

非常有用！！

通过拖放到文本输入框、粘贴或点击此处上传图片、音频和视频。

点击或粘贴此处以上传图片

· 注册或登录后发表评论