BorisovAI — Tools for the community. By the community.

We were drowning in numbers. Phase 29a of our LLM curriculum learning experiment had completed, and like always, I opened results.json to check the final accuracy score. 79.3% jumped out at me—a stunning improvement over the baseline. I felt the familiar rush: breakthrough moment. Then reality hit differently than expected.

The problem wasn’t that we got 79.3%. The problem was that we almost didn’t see it.

Here’s what happened: our eval_gsm8k() function was printing intermediate results every 50 GSM8K problems directly to stdout. The model achieved 119 correct answers out of 150 on the curriculum-selected subset—a crisp 79.3%. But the function only returned a final aggregate number to the results JSON. We had metrics, sure, but we had architecture blindness. The curriculum learning pipeline was evaluating on curated problem sets, reporting aggregate accuracy, and we were reading the digest instead of analyzing the signal.

When I dug into the stdout logs afterward, the pattern became visible: the curriculum data helped dramatically on certain problem categories while actively harming performance on others. The remaining 350 general GSM8K problems showed only 70.3% accuracy. Curriculum isn’t magic—it’s direction. And we weren’t capturing the directional information.

The fix was architectural, not mathematical. I refactored eval_gsm8k() to return an intermediate array alongside the final result. Now every 50-problem checkpoint gets logged as a structured object: problem count, accuracy at that point, and the precise subset being evaluated. No more stdout archaeology. No more reading printed logs like ancient texts.

This isn’t just about not missing peaks. It’s about being able to explain them. When curriculum learning works, you want to know which parts worked. When it fails, you need the granular data to debug. We were optimizing blind, tweaking parameters based on a single final number while the real story—the inflection points, the divergence between curriculum and general problems—lived only in console output that scrolled past and vanished.

The joke among engineers is that four of us walk into a car that won’t start. The IT engineer’s solution? “Get out and get back in.” Sometimes that’s exactly what debugging requires: stepping out, restarting, and changing where you’re looking. We weren’t looking at intermediate checkpoints. Now we are.

The 79.3% Peak We Almost Missed: Why Intermediate Data Matters

Metadata