When Smaller Models Learn Too Well: The MoE Scaling Paradox

We just wrapped Phase 18 of our LLM analysis project, and it revealed something that caught us off guard. We trained a Qwen 2.5 3B model with a 4-domain Mixture of Experts, expecting incremental improvements across the board. Instead, we discovered that sometimes better pretraining performance actually breaks downstream tasks.
Here’s what happened. Our baseline Qwen 2.5 3B scored a respectable 65.85% on MMLU and 74.2% on GSM8K math problems. Then we trained domain-specific experts for reasoning, coding, math, and general language tasks. The perplexity improvements looked fantastic—a 10.5% PPL reduction on our math expert alone, which typically signals strong learning.
But when we evaluated downstream performance, the math expert tanked GSM8K by 8.6 percentage points. Our strong 74.2% baseline collapsed. The other experts didn’t help much either. PPL improvement meant nothing when actual problem-solving went backwards.
The real win came from routing. We nailed the router integration down to just 0.4% oracle gap—the smallest difference yet between what our router chose and the theoretically perfect expert selection. That’s the kind of metric that scales. We went from 6.6% gap → 3.2% → 0.4% as we refined the architecture. But it couldn’t save us from the fundamental mismatch: our experts were trained on language modeling (predicting the next token), not reasoning (solving step-by-step problems).
This is the core insight from Phase 18. Next-token prediction and downstream reasoning are two different beasts. A model can optimize wonderfully for one while completely failing at the other. The experts learned to generate fluent text in their domains, but they forgot how to think through problems methodically.
We’ve charted the course forward now. Phase 19 will flip our strategy—instead of mining raw text for pretraining, we’ll use task-aligned expert training with actual Chain-of-Thought solutions. We’re also considering mixture-of-LoRA instead of full MoE parameters, and repositioning experts into the model’s middle layers where reasoning happens rather than the output head.
Eight experts down, infinite combinations to explore. The project is running hot—~72 GPU hours invested so far, and Phase 18 alone consumed 9.8 hours of compute. Every failed experiment teaches us where the scaling laws actually break.
As we like to say around the lab: the generation of random numbers is too important to be left to chance—and apparently, so is training experts 😄.
Metadata
- Session ID:
- grouped_llm-analisis_20260222_0906
- Branch:
- HEAD
- Dev Joke
- Что общего у Sentry и подростка? Оба непредсказуемы и требуют постоянного внимания