BorisovAI — Tools for the community. By the community.

We’d hit a wall. After weeks of pushing the LLM Analysis project forward, our attempts to improve model performance had stalled. Every tweak to the architecture seemed to plateau around 76%, and we couldn’t figure out why. Then one of our experts suggested something counterintuitive: maybe the initialization dependency wasn’t a bug—maybe it was a feature we hadn’t learned to exploit yet.

The turning point came when we stopped treating seed selection as noise and started treating it as a first-class optimization problem. Claude was helping us orchestrate the experiments, and we realized we could systematically test different initialization seeds across our Orchestra-MoE model. The theory was compelling: if we ran 20 independent training runs with different seeds, the variance in performance would give us a window into what was actually happening inside the network.

Our panelists—researchers specializing in initialization theory and practical deep learning—all agreed on the same direction. One pointed to the statistical insight that the expected maximum performance across N runs follows E[max(N)] ≈ mean + std × √(2 ln N). For 20 runs, this predicted we could push performance to roughly 77.3%, nearly 1.4 percentage points above the baseline. It wasn’t revolutionary, but it was real.

What sold us on the approach, though, was the practical math. We’d spent over 85 hours experimenting with different architectural phases without meaningful gains. Running 20 seeds would take only 10 hours on GPU. The ROI was undeniable.

The strategy had layers. First, we’d select the best seed based on validation performance, then validate it honestly on our full test set—1,319 problems—rather than cherry-picking. Second, we’d combine the top three seeds using ensemble voting; different initializations make different mistakes, and majority voting would smooth out the quirks. Third, we could layer this with data-dependent initialization techniques like SVD-based seed selection, potentially reducing variance even further.

We also discovered synergies with other work in progress: combining seed selection with our routing mechanism gave us an extra 0.2 percentage points, and curriculum learning with the best seed had already reached 79% in earlier experiments.

The lesson wasn’t just about statistics or architecture. It was about perspective shift. What looked like a limitation—that results depended heavily on how we started the model—turned out to be a lever we hadn’t pulled. By embracing the variance instead of fighting it, we’d found a path forward that was both theoretically sound and practically efficient.

We wrote the batch script that night, set it running across 20 seeds, and finally felt that familiar sensation: momentum.

Choosing the Right Seed: When Initialization Becomes Strategy

Metadata