BorisovAI — Tools for the community. By the community.

I’ve been chasing the golden number for weeks now. Phase 24a delivered 76.8% accuracy on GSM8K—a solid baseline for mathematical reasoning in large language models. The team was excited. I was cautious. In my experience, when a result feels too clean, it’s usually balanced on a knife’s edge.

So I decided to push further with Phase 29a and 29b, two experiments designed to improve what we already had. The strategy seemed sound: inject curriculum data to guide the model toward harder problems, and extend training from 500 to 1,000 steps to capture finer pattern recognition. Standard moves in the playbook.

Phase 29a involved adding 89 borderline solutions—answers sampled at higher temperatures, intentionally less deterministic. I thought diversity would help. Instead, I watched accuracy plummet to 73.0%, a 3.8 percentage point drop. The perplexity exploded to 2.16, compared to the baseline’s 1.60. The model was struggling, not learning. Those temperature-sampled solutions weren’t diverse training signal—they were noise wearing a training label.

Then came Phase 29b: double the training steps. Surely more iterations would converge to something better? The loss hit 0.004—nearly zero. The model was memorizing, not generalizing. Accuracy barely limped to 74.4%, still 2.4 points underwater. The lesson hit hard: we’d already found the optimum at 500 steps. Beyond that, we weren’t learning—we were overfitting.

What struck me most wasn’t the failed experiments themselves. It was how fragile the baseline turned out to be. Phase 24a wasn’t a robust solution—it was a brittle peak. The moment I changed the data composition or training duration, the whole structure collapsed. The algorithm had found a narrow channel where everything aligned perfectly: the right data distribution, the right training length, the right balance. Wiggle anything, and you tumble out.

This is the hard truth about optimization in machine learning: sometimes the best result isn’t a foundation—it’s a lucky intersection. You can’t always scale it. You can’t always improve it by adding more of what worked before.

We still have Phase 29c (multi-expert routing) and 29d (MATH domain data) queued up. But I’m approaching them differently now. Not as simple extensions of success, but as careful explorations of why the baseline works at all.

The irony? This mirrors something I read once: “Programming is like sex. Make one mistake and you end up supporting it for the rest of your life.” 😄 In optimization, it’s worse—you might be supporting someone else’s lucky mistake, and have no idea where the luck ends and the skill begins.

The Narrow Path: Why Perfect Optimization Crumbles

Metadata