BorisovAI — Tools for the community. By the community.

I’ve just wrapped up Experiment 13b on the llm-analysis project, and the results have left me with more questions than answers—in the best way possible.

The premise was straightforward: could a deep router with supervised training finally crack the code on specialized expert networks? I’d been chasing this idea through multiple iterations, watching single-layer routers plateau around 62–63% accuracy. So I built something more ambitious: a multi-layer routing architecture trained to explicitly learn which expert should handle which image class.

The numbers looked promising at first. The deep router achieved 79.5% routing accuracy—a decisive 1.28× improvement over the baseline single-layer approach. That’s the kind of jump that makes you think you’re onto something. I compared it against three other strategies (pure routing, mixed, and two-phase), and this one dominated on the routing front.

Then I checked the actual CIFAR-100 accuracy.

73.15%.

That’s a gain of just 0.22 percentage points over the two-phase approach. Essentially flat. The oracle accuracy hovered around 84.5%, leaving a 11-point gap that perfect routing couldn’t bridge.

Here’s what haunted me: I could demonstrate that the router was making better decisions—selecting the right expert 4 out of 5 times. Yet those correct decisions weren’t translating into correct classifications. That paradox forced me to confront an uncomfortable truth: the problem wasn’t routing efficiency. The problem was that specialization itself might not be the solution for CIFAR-100’s complexity.

The expert networks were learning narrow patterns, sure. But on a general-purpose image classification task with 100 fine-grained categories, that specialization came with hidden costs—fewer examples per expert, reduced generalization, potential overfitting to routing decisions that looked good in isolation but failed downstream.

I updated the documentation, logged the experiment metrics (routing accuracy, oracle accuracy, the works), and stored the final memory state. The 12b-fix variant and 13a experiments filled in the picture, but 13b delivered the real insight: sometimes the most elegant technical solution isn’t the answer your problem actually needs.

Now I’m rethinking the whole approach. Maybe the future lies in different architectures entirely—or maybe ensemble methods with selective routing rather than hard expert assignment.

Why did the router walk into a bar? It had to make a decision about where to go. 😄

When Perfect Routing Isn't Enough: The CIFAR-100 Specialization Puzzle

Metadata