BorisovAI — Tools for the community. By the community.

I started Experiment 10 with a bold hypothesis: could a Mixture of Experts architecture with shared parameters actually beat a hand-tuned baseline using fewer expert modules? The baseline sat at 70.45% accuracy with 4.5M parameters across 12 independent experts. I was skeptical.

The setup was straightforward but clever. Condition B implemented a SharedParam MoE with only 4 experts instead of 12—but here’s the trick: the experts shared underlying parameters, making the whole model just 2.91M parameters. I added Loss-Free Balancing to keep all 4 experts alive during training, preventing the usual expert collapse that plagues MoE systems.

The first real surprise came at epoch 80: Condition B hit 65.54%, already trading blows with Condition A (my no-MoE control). By epoch 110, the gap widened—B reached 69.07% while A stalled at 67.91%. The routing mechanism was working. Each expert held utilization around 0.5, perfectly balanced, never dead-weighting.

Then epoch 130 hit like a plot twist. Condition B: 70.71%—already above baseline. I’d beaten the reference point with one-third fewer parameters. The inference time penalty was real (29.2ms vs 25.9ms), but the accuracy gain felt worth it. All 4 experts were alive and thriving across the entire training run—no zombie modules, no wasted capacity.

When Condition B finally completed, it settled at 70.95% accuracy. Let me repeat that: a sparse MoE with 4 shared-parameter experts, trained without expert collapse, exceeded a 12-expert baseline by 0.50 percentage points while weighing 35% less.

But I didn’t stop there. I ran Condition C (Wide Shared variant) as a control—it maxed out at 69.96%, below B. Then came the real challenge: MixtureGrowth (Exp 10b). What if I started tiny—182K parameters—and grew the model during training?

The results were staggering. The grown model hit 69.65% accuracy starting from a seed, while a scratch-trained baseline of identical final size only reached 64.08%. That’s a 5.57 percentage point gap just from the curriculum effect of gradual growth. The seed-based approach took longer (3537s vs 2538s), but the quality jump was undeniable.

By the end, I had a clear winner: SharedParam MoE at 70.95%, just 0.80pp below Phase 7a’s theoretical ceiling. The routing was efficient, the experts stayed alive, and the parameter budget was brutal. Four experts with shared weights beat twelve independent ones—a reminder that in deep learning, architecture matters more than scale.

As I fixed a Unicode error on Windows and restarted the final runs with corrected schedulers, I couldn’t help but laugh: how do you generate a random string? Put a Windows user in front of Vim and tell them to exit. 😄

SharedParam MoE Beat the Baseline: How 4 Experts Outperformed 12

Metadata