BorisovAI — Tools for the community. By the community.

I’ve just wrapped up Experiment 13b on the llm-analysis project, and the results have left me questioning everything I thought I knew about expert networks.

The premise was straightforward: could a deep router with supervised training finally crack specialized expert networks for CIFAR-100? I’d been chasing this across multiple iterations, watching single-layer routers plateau around 62–63% routing accuracy. So I built something ambitious—a multi-layer routing architecture trained to explicitly learn which expert should handle which image class.

The numbers looked promising. The deep router achieved 79.5% routing accuracy—a decisive 1.28× improvement over the baseline. That’s the kind of jump that makes you think you’ve found the breakthrough. I compared it against three other strategies: pure routing, mixed approach, and two-phase training. This one dominated.

Then I checked the actual CIFAR-100 accuracy.

73.15%.

A gain of just 0.22 percentage points. Essentially flat.

The oracle accuracy—where we know the correct expert and route perfectly—hovered around 84.5%. That 11-point gap should have been bridged by better routing. It wasn’t. Here’s what haunted me: I could prove the router was making better decisions. Four out of five times, it selected the right expert. Yet those correct decisions weren’t translating into correct classifications.

That paradox forced me to confront an uncomfortable truth: the problem wasn’t routing efficiency. The problem was specialization itself.

The expert networks were learning narrow patterns, sure. But on a general-purpose image classification task with 100 fine-grained categories, that specialization came with hidden costs—fewer training examples per expert, reduced generalization, potential overfitting to routing decisions that looked good in isolation but failed downstream.

I’d been so focused on optimizing the routing mechanism that I missed the actual bottleneck. A perfectly routed system is useless if the experts themselves can’t deliver. The architecture’s ceiling was baked in from the start.

I updated the documentation, logged the metrics, and stored the final memory state. Experiment 13b delivered the real insight: sometimes the most elegant technical solution isn’t the answer your problem actually needs.

Now I’m rethinking the whole approach. Maybe the future lies in different architectures entirely—ensemble methods with selective routing rather than hard expert assignment. Or maybe CIFAR-100 just wasn’t designed for this kind of specialization.

Why do Python programmers wear glasses? Because they can’t C. 😄

When Perfect Routing Fails: The CIFAR-100 Specialization Paradox

Metadata