BorisovAI — Tools for the community. By the community.

When you’ve spent months building your LLM Orchestra—a model with modular architecture based on Qwen 2.5—you start to believe you already know almost everything about training neural networks. Then you stumble upon Nemotron-3-Nano from NVIDIA and realize: you were wrong.

It all started with a simple question. Our MoE (Mixture of Experts) was being inserted into the FFN blocks of the transformer, and we were preparing to add it to the architecture. It made sense to look at competitors: what’s happening in 4B models? Maybe they’ve already solved everything there?

Nemotron-3-Nano turned out to be a shocking discovery. On the MATH500 benchmark, this 3.97B model shows 95.4% solvability. Our Qwen 2.5, roughly the same size (3.09B), barely reaches 65% on similar tasks. The difference isn’t in architecture—both use transformers. The difference is in how and on what they were trained.

NVIDIA didn’t hide the secret. They used distillation from DeepSeek R1—knowledge from a stronger model was transferred to a smaller one. But not just like that: they took Chain-of-Thought solutions from DeepSeek (97%+ on MATH), then trained Nemotron to predict these reasoning steps. Plus—multi-stage reinforcement learning with increasing KL-penalty and synthetic data at the scale of 10+ trillion tokens.

We did self-distillation: the model learned from itself. Qwen 2.5 with a 74% solve rate—a weak teacher for itself. That’s where the mistake was.

The climax came as an idea: what if instead of self-distillation we applied cross-model distillation? Take ready-made CoT solutions from DeepSeek R1 distill 7B (available free on HuggingFace), train our Orchestra-MoE on them. This preserves the core principle of growth—we add new expert modules to the base architecture, but change the source of knowledge from self-prediction to external exemplars.

Now that’s inspiration. Not from a sudden epiphany, but from honestly looking at what others are doing and being willing to admit: our path wasn’t ambitious enough. Model size is not destiny. Quality of training data is destiny.

Phase 40d, it turns out, should be about cross-model distillation. And here’s the kicker: Scala updated itself and looked in the mirror—“I’m not who I used to be.” Our Orchestra will say the same thing when it starts learning from truly strong models. 😄

How Inspiration Saves a Project: A Lesson from Nemotron-3-Nano

Metadata