BorisovAI — Tools for the community. By the community.

Running 121 Tests Green: When Router Fixes Become a Full Test Suite Victory

The task was straightforward on paper: validate a new probabilistic tool router implementation across the ai-agents project. But what started as a simple “run the tests” moment turned into discovering that we’d accidentally built something far more comprehensive than initially planned.

I kicked off the test suite and watched the results roll in. 120 passed, 1 failed. Not bad for a first run. The culprit was test_threshold_filters_low_scores—a test checking exact name matching for a “weak tool” that was scoring 0.85, just barely creeping above the 0.8 threshold. This wasn’t a bug; it was the router doing exactly what it should. The test’s expectations were outdated. A quick fix later, and we were at 121 passing tests in 1.61 seconds.

But here’s where it got interesting. I needed to verify that nothing broke backward compatibility. The older test suite—15 tests from test_core.py—all came back green within 0.76 seconds. That’s when I realized the scope of what had actually been implemented.

The test coverage told a story of meticulous architectural work. There were 36 tests validating five different adapters: the LLMResponse handler, ToolCall processors, and implementations for Anthropic, Claude CLI, SQLite, SearxNG, and a Telegram platform adapter. Then came the routing layer—30 tests drilling into the four-tier scoring system. We had regex matching, exact name matching, semantic scoring, and keyword-based filtering all working in concert. The orchestrator alone had 26 tests covering initialization, agent wrappers, ChatEvent handling, and tool call handlers. Even the desktop plugin got its due: 29 tests across tray integration, GUI components, and Windows notification support.

Here’s something most developers don’t realize about testing: When you’re building a probabilistic system like a tool router, your tests become documentation. Each test case—especially ones checking scoring thresholds, semantic similarity, and fallback behavior—serves as a specification. Someone reading test_exact_name_matching doesn’t just see verification; they see how the system is meant to behave under specific conditions. That’s invaluable when onboarding new team members or debugging edge cases months later.

The factory functions that generated adapters from settings files passed without issue. The system prompt injection points in the orchestrator held up. The ChatEvent message flow remained consistent. No regressions, no surprises—just a solid foundation.

What struck me most was the discipline here: every component had tests, every scoring algorithm was validated, and every platform integration was verified independently. The backward compatibility suite meant we could refactor with confidence. That’s not luck; that’s architecture done right.

The lesson? Test-driven development doesn’t just catch bugs—it shapes how you think about systems. You end up building more modular code because each piece needs to be testable. You avoid tight coupling because loose coupling is easier to test. You document through tests because tests are executable specifications.

The deployment pipeline was ready. All 121 new tests green. All 15 legacy tests green. The router was production-ready.

😄 What’s the object-oriented way to become wealthy? Inheritance.

121 Tests Green: The Router Victory Nobody Planned

Running 121 Tests Green: When Router Fixes Become a Full Test Suite Victory

Metadata