BorisovAI — Tools for the community. By the community.

I was deep in the ScribeAir project—building real-time speech recognition that had to work in under a second per audio chunk. The bottleneck wasn’t where I expected it.

Everyone kept pointing me toward bigger, better models. Someone mentioned whisper-large-v3-russian from Hugging Face, finetuned on Common Voice 17.0, with impressive WER improvements (9.84 down to 6.39). Sounds like a slam dunk, right? Better accuracy, Russian-optimized, problem solved.

But here’s where the constraints bit back.

The full whisper-large-v3 model is 1.5B parameters. On CPU inference, that’s not a milliseconds problem—it’s a seconds problem. I had a hard real-time budget: roughly 1 second per audio chunk. The finetuned Russian model, while phenomenal for accuracy, didn’t magically shrink. It was still the same size under the hood, just with weights adjusted for Cyrillic phonetics and Russian linguistic patterns. No distillation, no architecture compression—just better training data.

I had to make a choice: chase the accuracy dragon or respect the physics of the system.

That’s when I pivoted to distil-whisper. It’s radically smaller—a genuine distillation of the original Whisper architecture, stripped down to fit the real-time constraint. The tradeoff was obvious: I’d lose some of that Russian-specific fine-tuning, but I’d gain the ability to actually ship something that processes audio in real time on consumer hardware.

The decision crystallized something I’d been wrestling with: in production systems, the perfect model that can’t run fast enough is just as useless as a broken model. The finetuned Russian Whisper is genuinely impressive research—it shows what’s possible when you invest in language-specific training. But it lives in a different problem space than ScribeAir.

If I were building offline batch transcription, a content moderation service, or something where latency wasn’t the primary constraint, that Russian finetuned model would be the obvious choice. For real-time streaming, where every millisecond counts and the user is waiting for output now, distil-whisper was the practical answer.

The lesson stuck with me: don’t optimize for the metrics you wish mattered—optimize for the constraints that actually exist. Accuracy is beautiful. Speed is infrastructure.

Both matter. But in production, speed often wins.

Tuning Whisper for Russian: The Real-Time Recognition Challenge

Metadata