BorisovAI — Tools for the community. By the community.

I’ve spent three weeks chasing a frustrating paradox in mixture-of-experts (MoE) architecture. The oracle router—theoretically perfect—achieves 80.78% accuracy on CIFAR-100. My learned router? 72.93%. A seven-point gap that shouldn’t exist.

The architecture works. The routing just refuses to learn.

The BatchNorm Ambush

Phase 12 started with hot-plugging: freeze one expert, train its replacement, swap it back. The first expert’s accuracy collapsed by 2.48 percentage points. I dug through code for hours, assuming it was inevitable drift. Then I realized the trap: BatchNorm updates its running statistics even with frozen weights. When I trained other experts, the shared backbone’s BatchNorm saw new data, recalibrated, and silently corrupted the frozen expert’s inference.

The fix was embarrassingly simple—call eval() explicitly on the backbone after train() triggers. Drift dropped to 0.00pp. Half a day wasted on an engineering detail, but at least this problem had a solution.

The Routing Ceiling

Phase 13 was the reckoning. I’d validated the architecture through pruning cycles—80% sparsity, repeated regrow iterations, stable accuracy accumulation. The infrastructure was solid. So I tried three strategies to close the expert gap:

Strategy A: Replace the single-layer nn.Linear(128, 4) router with a deep network. One layer seemed too simplistic. Result: 73.32%. Marginal. The router architecture wasn’t the bottleneck.

Strategy B: Joint training—unfreeze experts while training the router, let them co-evolve. I got 73.74%, still well below the oracle. Routing accuracy plateaued at 62.5% across all variants. Hard ceiling.

Strategy C: Deeper architecture plus joint training. Same 62.5% routing accuracy. No improvement.

The routing matrix told the truth I didn’t want to hear: CIFAR-100’s 100 classes don’t naturally partition into four specialized domains. Each expert stream sees data from all 100 classes. Gradients come from everywhere. Domain specificity dissolves. The router can’t learn separation because the experts never truly specialize.

The Lesson

This isn’t about router depth or training strategy. It’s architectural. You can’t demand specialization when every expert sees identical data distribution. The oracle works mathematically—it knows the optimal partition. But learning that partition from scratch when the data doesn’t support it? That’s asking the model to do magic.

Phase 12 taught me to debug carefully. Phase 13 taught me to read the data. The solution isn’t a better router. It’s either a dataset with actual domain structure, or acceptance that on CIFAR-100, this pattern doesn’t scale.

Fun fact: Apparently, changing random things until code works is “hacky” and “bad practice,” but do it fast enough, call it “Machine Learning,” and suddenly it’s worth 4x your salary. 😄

Routing Experts on CIFAR-100: When Specialization Meets Reality

The BatchNorm Ambush

The Routing Ceiling

The Lesson

Metadata