ScreenEnv:部署您的全栈桌面智能体
摘要 (TL;DR):ScreenEnv 是一个强大的 Python 库,可让您在 Docker 容器中创建隔离的 Ubuntu 桌面环境,用于测试和部署 GUI 智能体(又称计算机使用智能体)。通过内置对模型上下文协议 (MCP) 的支持,部署能够查看、点击和与真实应用程序交互的桌面智能体从未如此简单。
什么是 ScreenEnv?
想象一下,您需要自动化桌面任务、测试 GUI 应用程序或构建一个能与软件交互的 AI 智能体。过去,这需要复杂的虚拟机设置和脆弱的自动化框架。
ScreenEnv 通过提供一个在 Docker 容器中运行的**沙盒化桌面环境**来改变这一点。您可以将其视为一个完整的虚拟桌面会话,您的代码可以完全控制它——不仅仅是点击按钮和输入文本,而是管理整个桌面体验,包括启动应用程序、组织窗口、处理文件、执行终端命令以及记录整个会话。
为何选择 ScreenEnv?
- 🖥️ 完全的桌面控制:完整的鼠标和键盘自动化、窗口管理、应用程序启动、文件操作、终端访问和屏幕录制
- 🤖 双重集成模式:同时支持用于 AI 系统的模型上下文协议 (MCP) 和直接的沙盒 API——适应任何智能体或后端逻辑
- 🐳 Docker 原生:无需复杂的虚拟机设置——只需 Docker。环境是隔离的、可复现的,并且可以在不到 10 秒的时间内轻松部署到任何地方。支持 AMD64 和 ARM64 架构。
🎯 一键安装
from screenenv import Sandbox
sandbox = Sandbox() # That's it!
两种集成方法
ScreenEnv 提供了**两种互补的方式**来与您的智能体和后端系统集成,让您可以灵活地选择最适合您架构的方法
选项 1:直接使用沙盒 API
非常适合自定义智能体框架、现有后端,或当您需要细粒度控制时
from screenenv import Sandbox
# Direct programmatic control
sandbox = Sandbox(headless=False)
sandbox.launch("xfce4-terminal")
sandbox.write("echo 'Custom agent logic'")
screenshot = sandbox.screenshot()
image = Image.open(BytesIO(screenshot_bytes))
...
sandbox.close()
# If close() isn’t called, you might need to shut down the container yourself.
选项 2:MCP 服务器集成
非常适合支持模型上下文协议的 AI 系统
from screenenv import MCPRemoteServer
from mcp import ClientSession
from mcp.client.streamable_http import streamablehttp_client
# Start MCP server for AI integration
server = MCPRemoteServer(headless=False)
print(f"MCP Server URL: {server.server_url}")
# AI agents can now connect and control the desktop
async def mcp_session():
async with streamablehttp_client(server.server_url) as streams:
async with ClientSession(*streams) as session:
await session.initialize()
print(await session.list_tools())
response = await session.call_tool("screenshot", {})
image_bytes = base64.b64decode(response.content[0].data)
image = Image.open(BytesIO(image_bytes))
server.close()
# If close() isn’t called, you might need to shut down the container yourself.
这种双重方法意味着 ScreenEnv 能够适应您现有的基础设施,而不是强迫您改变您的智能体架构。
✨ 使用 screenenv 和 smolagents 创建桌面智能体
screenenv
原生支持 smolagents
,让您可以轻松构建自己的自定义桌面智能体以实现自动化。以下是如何仅用几个步骤创建您自己的 AI 驱动的桌面智能体
1. 选择您的模型
选择您想用来驱动智能体的后端 VLM。
import os
from smolagents import OpenAIServerModel
model = OpenAIServerModel(
model_id="gpt-4.1",
api_key=os.getenv("OPENAI_API_KEY"),
)
# Inference Endpoints
from smolagents import HfApiModel
model = HfApiModel(
model_id="Qwen/Qwen2.5-VL-7B-Instruct",
token=os.getenv("HF_TOKEN"),
provider="nebius",
)
# Transformer models
from smolagents import TransformersModel
model = TransformersModel(
model_id="Qwen/Qwen2.5-VL-7B-Instruct",
device_map="auto",
torch_dtype="auto",
trust_remote_code=True,
)
# Other providers
from smolagents import LiteLLMModel
model = LiteLLMModel(model_id="anthropic/claude-sonnet-4-20250514")
# see smolagents to get the list of available model connectors
2. 定义您的自定义桌面智能体
继承 DesktopAgentBase
并实现 _setup_desktop_tools
方法来构建您自己的动作空间!
from screenenv import DesktopAgentBase, Sandbox
from smolagents import Model, Tool, tool
from smolagents.monitoring import LogLevel
from typing import List
class CustomDesktopAgent(DesktopAgentBase):
"""Agent for desktop automation"""
def __init__(
self,
model: Model,
data_dir: str,
desktop: Sandbox,
tools: List[Tool] | None = None,
max_steps: int = 200,
verbosity_level: LogLevel = LogLevel.INFO,
planning_interval: int | None = None,
use_v1_prompt: bool = False,
**kwargs,
):
super().__init__(
model=model,
data_dir=data_dir,
desktop=desktop,
tools=tools,
max_steps=max_steps,
verbosity_level=verbosity_level,
planning_interval=planning_interval,
use_v1_prompt=use_v1_prompt,
**kwargs,
)
# OPTIONAL: Add a custom prompt template - see src/screenenv/desktop_agent/desktop_agent_base.py for more details about the default prompt template
# self.prompt_templates["system_prompt"] = CUSTOM_PROMPT_TEMPLATE.replace(
# "<<resolution_x>>", str(self.width)
# ).replace("<<resolution_y>>", str(self.height))
# Important: Adjust the prompt based on your action space to improve results.
def _setup_desktop_tools(self) -> None:
"""Define your custom tools here."""
@tool
def click(x: int, y: int) -> str:
"""
Clicks at the specified coordinates.
Args:
x: The x-coordinate of the click
y: The y-coordinate of the click
"""
self.desktop.left_click(x, y)
# self.click_coordinates = (x, y) to add the click coordinate to the observation screenshot
return f"Clicked at ({x}, {y})"
self.tools["click"] = click
@tool
def write(text: str) -> str:
"""
Types the specified text at the current cursor position.
Args:
text: The text to type
"""
self.desktop.write(text, delay_in_ms=10)
return f"Typed text: '{text}'"
self.tools["write"] = write
@tool
def press(key: str) -> str:
"""
Presses a keyboard key or combination of keys
Args:
key: The key to press (e.g. "enter", "space", "backspace", etc.) or a multiple keys string to press, for example "ctrl+a" or "ctrl+shift+a".
"""
self.desktop.press(key)
return f"Pressed key: {key}"
self.tools["press"] = press
@tool
def open(file_or_url: str) -> str:
"""
Directly opens a browser with the specified url or opens a file with the default application.
Args:
file_or_url: The URL or file to open
"""
self.desktop.open(file_or_url)
# Give it time to load
self.logger.log(f"Opening: {file_or_url}")
return f"Opened: {file_or_url}"
@tool
def launch_app(app_name: str) -> str:
"""
Launches the specified application.
Args:
app_name: The name of the application to launch
"""
self.desktop.launch(app_name)
return f"Launched application: {app_name}"
self.tools["launch_app"] = launch_app
... # Continue implementing your own action space.
3. 在桌面任务上运行智能体
from screenenv import Sandbox
# Define your sandbox environment
sandbox = Sandbox(headless=False, resolution=(1920, 1080))
# Create your agent
agent = CustomDesktopAgent(
model=model,
data_dir="data",
desktop=sandbox,
)
# Run a task
task = "Open LibreOffice, write a report of approximately 300 words on the topic ‘AI Agent Workflow in 2025’, and save the document."
result = agent.run(task)
print(f"📄 Result: {result}")
sandbox.close()
如果您遇到 docker 访问被拒绝的错误,您可以尝试使用
sudo -E python -m test.py
运行智能体,或将您的用户添加到docker
组。
💡 有关完整的实现,请参阅 GitHub 上的 CustomDesktopAgent 源代码。
立即开始
# Install ScreenEnv
pip install screenenv
# Try the examples
git clone git@github.com:huggingface/screenenv.git
cd screenenv
python -m examples.desktop_agent
# use 'sudo -E python -m examples.desktop_agent` if you're not in 'docker' group
下一步是什么?
ScreenEnv 的目标是超越 Linux,支持 **Android、macOS 和 Windows**,从而实现真正的跨平台 GUI 自动化。这将使开发人员和研究人员能够构建以最少设置即可在不同环境中泛化的智能体。
这些进步为创建**可复现的沙盒环境**铺平了道路,这些环境非常适合基准测试和评估。