BorisovAI — Tools for the community. By the community.

The “zapis” wake-word detector was frustratingly broken. In my testing, it achieved near-perfect accuracy on clean audio—97.7% validation accuracy, 99.9% true positive rate—but the moment I tested it against real microphone input with ambient noise, it completely failed. Zero detection. The model had learned to recognize a perfectly sanitized voice in silence, but that’s not how the world works.

The culprit was obvious once I examined the training data: I’d been padding the audio with artificial zeros—mathematically clean silence. The neural network had essentially learned to exploit that artifact. When it encountered actual background noise during streaming tests, the model didn’t know what to do.

So I retrained from scratch, this time feeding the model realistic scenarios: voice embedded in genuine microphone noise, without the artificial padding. The architecture grew from 6,000 parameters to 107,137—the exported ONNX file ballooned from 22 KB to 433 KB—but the tradeoff was worth it.

The results were dramatic. Test scenarios that previously scored 0.0 now achieved 0.9997 accuracy. A simulated real-time streaming test with noise-voice-noise sequences? Perfect detection. The model had learned what it actually needed to learn: distinguishing a wake word from the chaotic symphony of real life.

There were costs, of course. The retrained model now struggles with the artificial-silence test case—accuracy dropped from 0.9998 to 0.118. But that’s not a bug; it’s the correct behavior. In production, microphones never deliver silence; they deliver a constant hum of ambient noise. Optimizing for zeros would be optimizing for a problem that doesn’t exist.

While waiting for the companion “stop” model to finish training on the same realistic data, I realized something: machine learning models are brutally literal. They don’t generalize from clean training data to messy real data the way humans do. They exploit whatever patterns are easiest, whether those patterns are meaningful or just artifacts of how you labeled your examples. The gap between lab conditions and production is where most AI projects fail—not because the algorithms are weak, but because the training data lied about what the world actually looks like.

Next step: test both models end-to-end in an actual voice control loop. But for now, the wake-word detector finally lives in reality instead of a sterile simulation.

Sometimes the best model isn’t the one with the highest accuracy—it’s the one trained on truth. 😄

Training a Speech Recognition Model to Handle Real-World Noise

Metadata