Expert Collapse: When Your Mixture of Experts Forgot to Show Up

Taming the Expert Collapse: How Mixture of Experts Finally Stopped Fighting Itself
The task was deceptively simple on the surface: make a Mixture of Experts model actually use all its experts instead of letting most of them fall asleep on the job. But when you’re working on the llm-analysis project, “simple” rarely means straightforward.
The Problem We Were Facing
We had a model that was supposed to distribute its workload across multiple expert networks, like having a team where everyone contributes. Instead, it was more like having twelve employees and only three showing up to work. Out of our twelve experts, ten weren’t doing anything meaningful—they’d collapsed into a dormant state, making the model waste computational resources and miss out on diverse processing paths.
The real kicker? We had a subtle bug hiding in plain sight. The probe_data used to compute the diversity loss wasn’t being passed through the model’s projection layer before feeding it to the experts. This meant our experts were trying to make decisions based on representations that didn’t match what the main model was actually processing. It’s like asking someone to evaluate a painting when they’re only seeing the frame.
The Three-Pronged Attack
First, we fixed that projection bug. Suddenly, the experts had consistent input representations to work with.
Then came the stability improvements. We implemented a growth cooldown mechanism—essentially a five-epoch waiting period before allowing the model to add new experts. Previously, the system was spawning new expert splits like it was going out of business, producing ten consecutive splits in chaotic succession. With the cooldown, we went from that explosive behavior to one controlled, deliberate split per growth phase.
For the expert collapse itself, we deployed entropy maximization as a load balancing strategy. Instead of letting the router network lazily send all traffic to the same experts, we penalized imbalanced distributions. The results were dramatic: what started with ten dormant experts quickly transformed into a healthy state where all three active experts were genuinely contributing—utilization rates of 84%, 79%, and 37% respectively.
Finally, we fixed the acc_history tracking to ensure our GO/NO-GO phase reports reflected reality rather than wishful thinking.
A Surprising Insight About Mixture Models
Here’s something that surprised me: the entropy maximization trick works because the loss landscape of mixture models is inherently prone to convergence to suboptimal local minima. When the router network first initializes, random chance might route most samples to one or two experts. Once that happens, gradients reinforce this behavior—it becomes a self-fulfilling prophecy. Adding explicit diversity pressure breaks that initial lock-in. It’s less about clever engineering and more about fighting against a fundamental tendency in neural network optimization.
The Results
Starting from a seed accuracy of 96.7%, after fourteen epochs with these improvements, we hit 97.1%. Not a dramatic jump, but solid—and more importantly, it came with a genuinely functional expert system beneath it. The real win was achieving Phase 1 completion with all three criteria met.
We documented everything in the phase1-moe-growth-results.md report and updated the MASTER-SUMMARY with the artifacts. The next frontier is Phase 2: replacing our current heuristic with a Schnakenberg morphogenetic field model to control exactly when and where the mixture grows new experts.
Why did the neural network go to therapy? It had too many experts telling it different things, but they weren’t listening to each other. 😄
Metadata
- Session ID:
- grouped_llm-analisis_20260208_1521
- Branch:
- HEAD
- Dev Joke
- Ruby — как первая любовь: никогда не забудешь, но возвращаться не стоит.