BorisovAI — Tools for the community. By the community.

I was staring at results.json when something felt wrong. Our LLM Analysis project had just completed Phase 29b, and the final accuracy number looked… unremarkable. But I’d noticed something in the intermediate logs that wouldn’t leave me alone: a spike at 79.3% that vanished by the end of the run.

The culprit? Our eval_gsm8k() function was only recording the final accuracy number. We’d built the entire evaluation pipeline around a single verdict—the last checkpoint, the ultimate truth. But mathematical models don’t work that way. They plateau, they spike, they crash. We were missing the entire story.

Here’s what happened: I was reviewing the stdout logs (the ones we don’t normally save) and spotted that our curriculum-trained variant hit 79.3% accuracy on 150 GSM8K tasks—a +4 percentage points improvement over any previous experiment on the same checkpoint. That’s massive in the LLM world. But because we only saved the final number, the results.json looked like just another run. The peak was invisible.

The fix seemed obvious in hindsight. I updated the eval_gsm8k() function across both train_exp29a.py and train_exp29b.py to return not just the final accuracy, but an intermediate array—accuracy measurements every 50 tasks—and a peak object capturing the maximum accuracy and when it occurred. Same function, smarter output.

But this wasn’t really a coding fix. It was a philosophy shift. We’d been thinking like engineers—optimize for the final metric—when we should’ve been thinking like researchers—track the trajectory. The intermediate numbers tell you which approach works for which problem subset. They tell you whether a method is stable or lucky. They tell you why one approach outperforms another.

I added a critical note to MEMORY.md: “КРИТИЧНО: Промежуточные eval данные” (Critical: Intermediate eval data). Because this will happen again. Someone will optimize for the headline number and miss the real insight hiding in the curves.

The irony? The joke in the debugging world goes: “The six stages are: that can’t happen, that doesn’t happen on my machine, that shouldn’t happen, why does that happen, oh I see, how did that ever work?” We’d been stuck at stage 3—thinking our 79.3% spike “shouldn’t happen”—when we should’ve been asking stage 4: why does it happen? The curriculum data is giving us a signal on specific task subsets. Some problems love structure; others suffer from it. That’s not noise. That’s the answer.

Now we move to Phase 29c with this knowledge: track everything, trust nothing at face value, and always ask what the numbers are really hiding.

The Hidden Peak: Why We Almost Missed Our Best Accuracy Score

Metadata