BorisovAI — Tools for the community. By the community.

When you’re building an autonomous voice agent that orchestrates multiple tools—UI automation, API calls, local computation—your architecture docs become just as critical as the code itself. Recently, I faced exactly this challenge: our voice-agent project had evolved beyond its original design, and the documentation was starting to lag behind reality.

The catalyst came from adding CUA (UI-TARS VLM) for visual understanding alongside desktop automation. Suddenly, we weren’t just calling APIs anymore. We had agents controlling Windows UI, processing screenshots through vision models, and managing complex tool chains. The old three-tier capability model—Web APIs, CLI tools, and code execution—didn’t capture this anymore.

Here’s what we discovered while refactoring: local package integration deserved its own tier. We created Tier 4 to explicitly acknowledge dependencies like cua, pyautogui, and custom wrappers that agents load via pip install. This wasn’t just semantic—it changed how we think about dependency management. Web APIs live on someone else’s infrastructure. CLI tools are system-wide. But local packages? Those ship with your agent, versioned and cached. That distinction matters when you’re deploying across different machines.

The real work came in the desktop automation tree. We’d added three new GUI tools—desktop_drag, desktop_scroll, desktop_wait—that weren’t documented. Meanwhile, our old OCR strategy via Tesseract felt clunky compared to CUA’s vision-based approach. So we ripped out the Tesseract section and rewrote it around UI-TARS, which uses actual visual understanding instead of brittle text parsing.

One decision I wrestled with: should Phase 3 (our most ambitious phase) target 12 tools or 21? The answer came from counting what we’d actually built. Twenty-one tools across FastAPI routes, AgentCore methods, and desktop automation—that was our reality. Keeping old numbers would’ve confused the team about what was actually complete.

I also realized we’d scattered completion markers throughout the docs—”(NEW)” labels, “(3.1–3.9) complete” scattered across files. Consolidating these into a single task list with checkmarks made the project status transparent at a glance.

The lesson: Architecture documentation isn’t overhead—it’s your agent’s brain blueprint. When your system grows from “call this API” to “understand the screen, move the mouse, run the script, then report back,” that complexity must live in your docs. Otherwise, your team spends cycles re-discovering decisions you’ve already made.

Tools evolved. Documentation caught up. Both are now in sync.

Scaling AI Agent Documentation: From Three Tiers to Four

Metadata