BorisovAI — Tools for the community. By the community.

I hit a wall with the expert panel system. Three months into optimizing the 18c-v3 two-phase model, every architectural tweak failed to fix a stubborn 8.6 percentage point downstream degradation. The experts trained perfectly on next-token prediction, but somehow couldn’t apply that knowledge when solving actual problems.

The hypothesis seemed obvious: the model needs a better architecture. LoRA adapters? Progressive growth? Specialized routing layers? I sketched out Phase 19 with three parallel experiments ready to run, each promising to unlock the bottleneck through structure alone.

But then I noticed something odd in data_nlp_v4.py. The math expert was trained on human CoT reasoning—the carefully written step-by-step solutions from GSM8K. Perfect training data, right? Except during inference, the model had to generate its own reasoning patterns. Format mismatch: "Problem: {q}\nSolution: {a}" (human) versus "Question: ...\nAnswer: ..." (model’s own patterns). The expert learned to predict human thinking, not self-generated reasoning.

So I flipped the experiment. Instead of architectural fixes, I generated 7,473 training examples using the model’s own CoT predictions—self-distillation through a specialized module. No LoRA. No growth mechanisms. Just aligned data.

The results were immediate and brutal in their clarity: the -8.6pp degradation completely vanished. Better—accuracy actually improved by 1.1 percentage points. Phase 21 hit 77.5% accuracy with just 500 training steps, a project record.

The insight cuts deep. We spent weeks optimizing how information flows through the network when the real problem was what information arrived at the gate. The architecture was never broken. The data was teaching the wrong lesson.

This completely reframed how I’m thinking about Phase 21’s follow-up work. Scaling isn’t about adding more expert modules or clever routing. It’s about ensuring every byte of training data aligns with the actual task the model will face. A simpler architecture with perfect data beats sophisticated engineering with mismatched signals every single time.

Debugging is funny that way—sometimes removing the needles from the haystack means realizing you’ve been throwing in the wrong hay. 😄

When Data Beats Architecture: The Self-Generated CoT Breakthrough

Metadata