BorisovAI — Tools for the community. By the community.

Hitting the Ceiling: When Better Thresholds Don’t Mean Better Results

The speech-to-text pipeline was humming along at 34% Word Error Rate (WER)—respectable for a Whisper base model—but the team wanted more. The goal was ambitious: cut that error rate down to 6–8%, a dramatic 80% reduction. To get there, I started tweaking the T5 text corrector that sits downstream of the audio transcription, thinking that tighter filtering could squeeze out those extra percentage points.

First thing I did was add configurable threshold methods to the T5TextCorrector class. The idea was simple: instead of hardcoded similarity thresholds, make them adjustable so we could experiment without rewriting code every iteration. I implemented set_thresholds() and set_ultra_strict() methods, then set ultra-strict filtering to use aggressive cutoffs—0.9 and 0.95 similarity scores—theoretically catching every questionable correction before it could degrade the output.

Then came the benchmarking. I fixed references in benchmark_aggressive_optimization.py to match the full audio texts we were actually working with, not just snippets, and ran the tests. The results were sobering.

The baseline (Whisper base + improved T5 at 0.8/0.85 thresholds): 34.0% WER, 0.52 seconds. Ultra-strict T5 (0.9/0.95): 34.9% WER, 0.53 seconds—marginally worse. I also tested beam search with width=5, thinking diversity in decoding might help. That crushed performance: 42.9% WER, 0.71 seconds. Even stripping T5 entirely gave 35.8% WER.

The pattern was clear: we’d plateaued. Tightening the screws on T5 correction wasn’t the lever we needed. Higher beam widths actually hurt because they introduced more candidate hypotheses that could mangle the transcription. The fundamental issue wasn’t filtering quality—it was the model’s capacity to understand what it was hearing in the first place.

Here’s the uncomfortable truth: if you want to drop from 34% WER to 6–8%, you need a bigger model. Whisper medium would get you there, but it would shatter our latency budget. The time to run inference would balloon past what the system could tolerate. So we hit a hard constraint: stay fast or get accurate, but not both.

The lesson stuck with me: optimization has diminishing returns, and sometimes the smartest decision is recognizing when you’re chasing ghosts. The team documented the current optimal configuration—Whisper base with improved T5 filtering at 0.8/0.85 thresholds—and filed a ticket for future work. Sometimes shipping what works beats perfecting what breaks.

😄 Optimizing a speech-to-text system at 34% WER is like arguing about which airline has the best peanuts—you’re still missing the entire flight.

When Stricter Isn't Better: The Threshold Paradox

Hitting the Ceiling: When Better Thresholds Don’t Mean Better Results

Metadata