BorisovAI
All posts
Bug Fixllm-analisisClaude Code

Routing Experts on CIFAR-100: Why Specialization Doesn't Scale

Routing Experts on CIFAR-100: Why Specialization Doesn't Scale

I’ve been chasing a frustrating paradox for three weeks. The oracle router—hypothetically perfect—achieves 80.78% accuracy on CIFAR-100 using a mixture-of-experts architecture. Yet my learned router plateaus at 72.93%, leaving a 7.85 percentage point gap that shouldn’t exist. The architecture works. The routing just… doesn’t learn.

The Experiments That Broke Everything

Phase 12 brought clarity, albeit painful. First, I discovered that BatchNorm running statistics update even with frozen weights. When hot-plugging new experts during training, their BatchNorm layers drift by 2.48pp—silently corrupting the model. The fix was surgical: explicitly call eval() on the backbone after train() triggers. Zero drift. Problem solved.

But the routing problem persisted.

Then came the stress test. I cycled through three prune-regrow iterations—each pruning to 80% sparsity, training for 20 epochs masked, then regrowing and fine-tuning for 40 epochs. Accuracy accumulated improvement across cycles, not degradation. The architecture was genuinely stable. That wasn’t the bottleneck.

The Fundamental Ceiling

Phase 13 was the reckoning. I tried three strategies:

Strategy A: Replaced the single-layer nn.Linear(128, 4) router with a deep neural network. Reasoning: a one-layer router is too simplistic to capture domain complexity. Result: 73.32%. Marginal gain. The router architecture wasn’t the constraint.

Strategy B: Joint training—unfreezing experts while training the router. Maybe they need to co-evolve? The model hit 73.74%, still well below the oracle’s 80.78%. Routing accuracy plateaued around 62.5% across all variants, a hard ceiling.

Strategy C: Deeper architecture + joint training. Same 62.5% routing ceiling.

The routing matrix revealed the culprit: CIFAR-100’s 100 classes don’t naturally partition into four specialized domains when trained jointly. The gradients from all classes cross-contaminate expert specialization. You either get specialization or routing accuracy—not both.

The Punchline

Sometimes the oracle gap isn’t a bug in your implementation—it’s a theorem in disguise. The 7.85pp gap is real and architectural, not a tuning problem. You can’t train a router to route what doesn’t exist: genuine specialization under joint gradient pressure.

Here’s where I land: Phase 12b’s BatchNorm fix is production-ready, solving hot-plug stability. Phase 13 taught me that mixture-of-experts on CIFAR-100 has a hard ceiling around 74%, not 80.78%. The oracle gap measures the gap between what’s theoretically possible and what’s learnable—a useful diagnostic.

A programmer puts two glasses on his bedside table before sleep: one full, one empty. One for thirst, one for optimism. 😄

Metadata

Session ID:
grouped_llm-analisis_20260217_1202
Branch:
HEAD
Dev Joke
Идеальный день: ни одного Jira-тикета. Реальный день: 15 тикетов, 3 митинга, 0 коммитов.

Rate this content

0/1000