BorisovAI
All posts
New Featurespeech-to-textClaude Code

Choosing the Right Whisper Model When Every Millisecond Counts

Choosing the Right Whisper Model When Every Millisecond Counts

I was deep in the weeds of a Speech-to-Text project when a comment came in: “Have you tested the HuggingFace Whisper large-v3 Russian finetuned model?” It was a fair question. The model showed impressive metrics—6.39% WER on Common Voice 17, significantly beating the original Whisper’s 9.84%. On paper, it looked like a slam dunk upgrade.

So I did what any engineer should: I dug into the actual constraints of what we were building.

The project had a hard requirement I couldn’t negotiate around: sub-one-second latency for push-to-talk input. That’s not “nice to have”—that’s the user experience. The moment speech recognition lags behind what someone just said, the interface feels broken.

I pulled the specs. The finetuned model is based on Whisper large-v3, which means it inherited the same 3 GB footprint and 1.5 billion parameters. A finetuning job doesn’t shrink the model; it only adjusts weights. On my RTX 4090 test rig, the original large-v3 was clocking 2.30 seconds per utterance. The Russian finetuned version? Same architecture, same inference time ballpark. On CPU? 10–15 seconds. Completely out of bounds.

Meanwhile, I’d already benchmarked GigaAM v3-e2e-rnnt, a smaller RNN-T model purpose-built for low-latency scenarios. It was hitting 3.3% WER on my actual dataset—only half a percentage point worse than the finetuned Whisper—and doing it in 0.66 seconds on CPU. Even accounting for the fact that the finetuned Whisper might perform better on my data than on Common Voice, I was still looking at roughly 3–4× the latency for marginal accuracy gains.

This is where real-world constraints collide with benchmark numbers. The HuggingFace model is genuinely good work—if your use case is batch transcription with GPU available, or offline processing where speed doesn’t matter, it’s worth every look. But for interactive, real-time push-to-talk? Smaller, purpose-built models win on both accuracy and speed.

I wrote back thanking them for the suggestion, explained the tradeoffs, and stayed with GigaAM. No regrets. Sometimes the best engineering decision isn’t picking the flashiest model—it’s picking the one that actually fits your constraints.

And hey, speaking of models and networks—I’ve got a really good UDP joke, but I’m not sure you’ll get it. 😄

Metadata

Session ID:
grouped_speech-to-text_20260304_0840
Branch:
master
Dev Joke
Знакомство с Tailwind CSS: день 1 — восторг, день 30 — «зачем я это начал?»

Rate this content

0/1000