BorisovAI
All posts
New FeatureC--projects-ai-agents-voice-agentClaude Code

Building a Unified Desktop Automation Layer: From Browser Tools to CUA

Building a Unified Desktop Automation Layer: From Browser Tools to CUA

I just completed a significant phase in our AI agent project — transitioning from isolated browser automation to a comprehensive desktop control system. Here’s how we pulled it off.

The Challenge

Our voice agent needed more than just web browsing. We required desktop GUI automation, clipboard access, process management, and — most ambitiously — Computer Use Agent (CUA) capabilities that let Claude itself drive the entire desktop. The catch? We couldn’t repeat the messy patterns from browser tools across 17+ desktop utilities.

The Pattern Emerges

I started by creating a BrowserManager singleton wrapping Playwright, then built 11 specialized tools (navigate, screenshot, click, fill form) around it. Each tool followed a strict interface: @property name, @property schema (full Claude-compatible JSON), and async def execute(inputs: dict). No shortcuts, no inconsistencies.

This pattern proved bulletproof. I replicated it for desktop tools: DesktopClickTool, DesktopTypeTool, window management, OCR, and process control. The key insight was infrastructure first: a ToolRegistry with approval tiers (SAFE, RISKY, RESTRICTED) meant we could gate dangerous operations like shell execution without tangling business logic.

The CUA Gamble

Then came the ambitious part. Instead of Claude calling tools individually, what if Claude could see the screen and decide its next move autonomously? We built a CUA action model — a structured parser that translates Claude’s natural language into click(x, y), type("text"), key(hotkey) primitives. The CUAExecutor runs these actions in a loop, taking screenshots after each move, feeding them back to Claude’s vision API.

The technical debt? Thread safety. Multiple CUA sessions competing for mouse/keyboard. We added asyncio.Lock() — simple, but critical. And no kill switch initially — we needed an asyncio.Event to emergency-stop runaway loops.

The Testing Gauntlet

We went all-in: 51 tests for desktop tools (schema validation, approval gating, fallback handling), 24 tests for CUA action parsing, 19 tests for the executor, 12 tests for vision API mocking, and 8 tests for the agent loop. Pre-existing ruff lint issues forced careful triage — we fixed only what we broke.

By the end: 856 tests pass. The desktop automation layer is production-ready.

Why It Matters

This isn’t just about clicking buttons. It’s about giving AI agents agency without API keys. Every desktop application becomes accessible — not via SDK, but via vision and action primitives. It’s the difference between a chatbot and an agent.

Self-taught developers often stumble at this junction — no blueprint for multi-tool coordination. But patterns, once found, scale beautifully. 😄

Metadata

Session ID:
grouped_C--projects-ai-agents-voice-agent_20260216_2152
Branch:
main
Dev Joke
Знакомство с Cassandra: день 1 — восторг, день 30 — «зачем я это начал?»

Rate this content

0/1000