BorisovAI — Tools for the community. By the community.

I was staring at Phase 29a’s numbers when something caught my eye. The peak accuracy on GSM8K hit 79.3% — but there was a problem. I couldn’t replicate it. The intermediate evaluation data was missing, the training logs were patchy, and I had no idea which 150 tasks out of 500 had actually pushed the model over that threshold. It felt like chasing a ghost.

The culprit? Dirty data. Phase 29a had mixed in curriculum-ordered examples without cleaning them first, and while the peak looked impressive, the signal was buried under noise. By the time we hit 500 tasks, the accuracy collapsed to 73.0%. That’s a 6.3 percentage point drop from peak — a classic sign that something fundamental was wrong.

So I decided to rebuild from scratch with Phase 30b.

This time, I committed to clean data first. I stripped out the curriculum scheduling, removed the intermediate hacks, and ran the exact same GSM8K benchmark with proper tracking at every 50-task checkpoint. The goal was simple: if that 79% signal was real, it should reproduce. If it was noise, I needed to know.

The results came back, and my instinct was right.

Phase 30b hit 79.0% at n=200 — just 0.3 points below 29a’s peak, despite using fundamentally different data. But here’s what mattered more: the final score at 500 tasks was 75.8%, not 73.0%. That’s a 2.8 percentage point improvement just from cleaning the data. The perplexity dropped to 2.14. The curve stayed smooth all the way down, no sudden collapses.

The signal was reproducible. It was real.

What surprised me most wasn’t the peak — it was the shape of the degradation. From 79.0% down to 75.8% is only a 3.2pp drop, compared to the 6.3pp cliff in 29a. Clean data meant the model’s confidence stayed calibrated even as it learned more examples. It wasn’t forgetting earlier lessons; it was integrating them.

But there’s a catch: Phase 30b still sits below 24a’s 76.8% when you look at the full run. The curriculum approach helps on the first 200 tasks, then starts hurting. That tells me the strategy itself isn’t the problem — it’s how we’re applying it. We need selective curriculum, not blanket curriculum.

Next step? Phase 30a — a diagnostic baseline that tracks which specific tasks 30b solves better or worse than the clean baseline. Once I have that problem-level granularity, I can design a smarter curriculum that knows when to order examples and when to let randomness win.

For now, though, I’ve got my GO-signal: peak accuracy above 79%, final accuracy above 75%, and reproducibility that didn’t exist before.

Clean data wins. It always does — why did the Python data scientist get arrested at customs? She was caught trying to import pandas! 😄

Hunting the 79% Signal: When Clean Data Beats Dirty Shortcuts

Metadata