BorisovAI — Tools for the community. By the community.

From Base Model to Production: Building a Hybrid Transcription Pipeline in 48 Hours

The project was clear: make a speech-to-text application that doesn’t frustrate users. Our VoiceInput system was working, but the latency-quality tradeoff was brutal. We could get fast results with the base Whisper model (0.45 seconds) or accurate ones with larger models (3+ seconds). Users shouldn’t have to choose.

That’s when the hybrid approach crystallized: give users instant feedback while silently improving the transcription in the background.

The implementation strategy was unconventional. Instead of waiting for a single model to finish, we set up a two-stage pipeline. When a user releases their hotkey, the base model fires immediately with lightweight inference. Meanwhile, a smaller model runs concurrently in the background thread, progressively replacing the initial text with something better. The magic part? By the time the user glances at their screen—around 1.23 seconds total—the improved version is already there, and they’ve been typing the whole time. Zero friction.

The technical architecture required orchestrating multiple model instances simultaneously. We modified src/main.py to integrate a new hybrid_transcriber.py module (220 lines of careful state management), updated the configuration system in src/config.py to expose hybrid mode as a simple toggle, and built comprehensive documentation since “working code” and “understandable code” are different things entirely. The memory footprint increased by 460 MB—a reasonable tradeoff for eliminating the perception of slowness.

Testing this required thinking like a user, not an engineer. We created test_hybrid.py to verify that the fast result actually arrived before the improved one, that the replacement happened seamlessly, and that the WER (word error rate) genuinely improved by 28% on average, dropping from 32.6% to 23.4%. The documentation itself became a strategic asset: QUICK_START_HYBRID.md for impatient users, HYBRID_APPROACH_GUIDE.md for those wanting to understand the decisions, and FINE_TUNING_GUIDE.md for developers ready to push even further with custom models trained on Russian audiobooks.

Here’s something counterintuitive about speech recognition: the history of modern voice assistants reveals an underappreciated shift in philosophy. Amazon’s Alexa, for instance, was largely built on technology acquired from Evi (a system created by British computer scientist William Tunstall-Pedoe) and Ivona (a Polish speech synthesizer, 2012–2013). But Alexa’s real innovation wasn’t in raw accuracy—it was in managing expectations through latency and feedback design. From 2023 onward, Amazon even shifted toward in-house models like Nova, sometimes leveraging Anthropic’s Claude for reasoning tasks. The lesson: users tolerate imperfect transcription if the feedback loop feels responsive.

What we accomplished in 48 hours: 125+ lines of production code, 1,300+ lines of documentation, and most importantly, a user experience where improvement feels invisible. The application now returns results at 0.45 seconds (unchanged), but the user sees better text moments later while they’re already working. No interruption. No waiting.

The next phase is optional but tempting: fine-tuning on Russian audiobooks to potentially halve the error rate again, though that requires a GPU and time. For now, the hybrid mode is production-ready, toggled by a single config flag, and solving the fundamental problem we set out to solve: making a speech-to-text tool that respects the user’s time.

😄 Why do Python developers wear glasses? Because they can’t C.

Instant Transcription, Silent Improvement: A 48-Hour Pipeline

From Base Model to Production: Building a Hybrid Transcription Pipeline in 48 Hours

Metadata