BorisovAI

Blog

Posts about the development process, solved problems and learned technologies

Found 2 notesReset filters
Code Changellm-analisis

The Narrow Path: Why Perfect Optimization Crumbles

I've been chasing the golden number for weeks now. **Phase 24a** delivered **76.8% accuracy on GSM8K**—a solid baseline for mathematical reasoning in large language models. The team was excited. I was cautious. In my experience, when a result feels *too clean*, it's usually balanced on a knife's edge. So I decided to push further with **Phase 29a and 29b**, two experiments designed to improve what we already had. The strategy seemed sound: inject curriculum data to guide the model toward harder problems, and extend training from 500 to 1,000 steps to capture finer pattern recognition. Standard moves in the playbook. Phase 29a involved adding **89 borderline solutions**—answers sampled at higher temperatures, intentionally less deterministic. I thought diversity would help. Instead, I watched accuracy *plummet* to **73.0%, a 3.8 percentage point drop**. The perplexity exploded to 2.16, compared to the baseline's 1.60. The model was struggling, not learning. Those temperature-sampled solutions weren't diverse training signal—they were noise wearing a training label. Then came **Phase 29b**: double the training steps. Surely more iterations would converge to something better? The loss hit 0.004—nearly zero. The model was memorizing, not generalizing. Accuracy barely limped to **74.4%**, still 2.4 points underwater. The lesson hit hard: *we'd already found the optimum at 500 steps*. Beyond that, we weren't learning—we were overfitting. What struck me most wasn't the failed experiments themselves. It was how *fragile* the baseline turned out to be. **Phase 24a wasn't a robust solution—it was a brittle peak**. The moment I changed the data composition or training duration, the whole structure collapsed. The algorithm had found a narrow channel where everything aligned perfectly: the right data distribution, the right training length, the right balance. Wiggle anything, and you tumble out. This is the hard truth about optimization in machine learning: **sometimes the best result isn't a foundation—it's a lucky intersection**. You can't always scale it. You can't always improve it by adding more of what worked before. We still have **Phase 29c** (multi-expert routing) and **29d** (MATH domain data) queued up. But I'm approaching them differently now. Not as simple extensions of success, but as careful explorations of *why* the baseline works at all. The irony? This mirrors something I read once: *"Programming is like sex. Make one mistake and you end up supporting it for the rest of your life."* 😄 In optimization, it's worse—you might be supporting someone else's lucky mistake, and have no idea where the luck ends and the skill begins.

Mar 4, 2026
Code Changellm-analisis

When Smaller Models Learn Too Well: The MoE Scaling Paradox

We just wrapped Phase 18 of our LLM analysis project, and it revealed something that caught us off guard. We trained a **Qwen 2.5 3B model with a 4-domain Mixture of Experts**, expecting incremental improvements across the board. Instead, we discovered that sometimes *better pretraining performance actually breaks downstream tasks*. Here's what happened. Our baseline Qwen 2.5 3B scored a respectable **65.85% on MMLU and 74.2% on GSM8K** math problems. Then we trained domain-specific experts for reasoning, coding, math, and general language tasks. The perplexity improvements looked fantastic—a **10.5% PPL reduction** on our math expert alone, which typically signals strong learning. But when we evaluated downstream performance, the math expert **tanked GSM8K by 8.6 percentage points**. Our strong 74.2% baseline collapsed. The other experts didn't help much either. PPL improvement meant nothing when actual problem-solving went backwards. The real win came from routing. We nailed the **router integration down to just 0.4% oracle gap**—the smallest difference yet between what our router chose and the theoretically perfect expert selection. That's the kind of metric that scales. We went from 6.6% gap → 3.2% → 0.4% as we refined the architecture. But it couldn't save us from the fundamental mismatch: our experts were trained on language modeling (predicting the next token), not reasoning (solving step-by-step problems). This is the core insight from Phase 18. **Next-token prediction and downstream reasoning are two different beasts.** A model can optimize wonderfully for one while completely failing at the other. The experts learned to generate fluent text in their domains, but they forgot how to think through problems methodically. We've charted the course forward now. Phase 19 will flip our strategy—instead of mining raw text for pretraining, we'll use **task-aligned expert training** with actual Chain-of-Thought solutions. We're also considering **mixture-of-LoRA** instead of full MoE parameters, and repositioning experts into the model's middle layers where reasoning happens rather than the output head. Eight experts down, infinite combinations to explore. The project is running hot—**~72 GPU hours invested so far**, and Phase 18 alone consumed 9.8 hours of compute. Every failed experiment teaches us where the scaling laws actually break. As we like to say around the lab: *the generation of random numbers is too important to be left to chance*—and apparently, so is training experts 😄.

Feb 22, 2026