BorisovAI — Tools for the community. By the community.

Chasing That Last 1.3%: When Model Architecture Meets Optimizer Reality

The CIFAR-10 accuracy sat stubbornly at 83.7%, just 1.3 percentage points shy of the 85% target. I was deep in the llm-analysis project, staring at the training curves with that peculiar frustration only machine learning developers understand—so close, yet somehow impossibly far.

The diagnosis was clear: the convolutional backbone needed more capacity. The model’s channels were too narrow to capture the complexity required for those final critical percentages. But this wasn’t just about arbitrarily increasing numbers. I needed to make the architecture configurable, allowing for flexible channel widths without redesigning the entire network each time.

First, I refactored the model instantiation to accept configurable channel parameters. This is where clean architecture pays dividends—instead of hardcoding layer dimensions, I could now scale the backbone horizontally. I widened the channels across the network, giving the model more representational power to learn those nuanced features that separate 83.7% from 85%.

Then came the optimizer revelation. The training script was still using Adam, the ubiquitous default for deep learning. But here’s the thing about CIFAR-10—it’s a dataset where SGD with momentum has historically outperformed Adam for achieving those final accuracy gains. The switch wasn’t arbitrary; it’s a well-known pattern in the computer vision community, yet easy to overlook when you’re in the flow of incremental improvements.

This revealed a deeper architectural issue: after growth events in the training pipeline (where the model dynamically expands), the optimizer gets rebuilt. The code was still initializing Adam in those rebuilds. I had to hunt down every instance—the primary optimizer loop, the Phase B optimizer updates—and swap them all to SGD with momentum hyperparameters. Each change felt small, but they compounded into a coherent optimization strategy.

While I was optimizing the obvious, I spotted something lurking in the RigL sparsity implementation—the sparse training mechanism was overshooting its target sparsity levels slightly. RigL (Rigged Lottery Learning) uses dynamic sparse training to prune connections during training, but when the sparsity calculations drift even marginally from their targets, it can destabilize convergence. I traced through the sparsity growth schedule, checking where the overshoot accumulated.

Here’s something fascinating about Adam optimizers: they were introduced in 2014 by Kingma and Ba, and they became the default across industry precisely because they’re forgiving and work well across diverse problems. But this universality is also their weakness in specialized domains. For image classification on small, well-curated datasets like CIFAR-10, simpler first-order optimizers with momentum often achieve better final accuracies because they converge to sharper minima—a phenomenon that still fascinates researchers today.

By the end of the session, the pieces were in place: wider channels, consistent SGD with momentum, and fixed sparsity behavior. The model wasn’t fundamentally different, but it was now optimized for what CIFAR-10 actually rewards. Sometimes closing that last percentage point gap isn’t about revolutionary changes—it’s about aligning every component toward a single goal.

😄 Hunting down every optimizer instance in your codebase after switching algorithms is like playing Where’s Waldo, except Waldo is your bug and the entire technical documentation is the book.

What to fix: - Punctuation: missing or extra commas, periods, dashes, quotes - Spelling: typos, misspelled words - Grammar: subject-verb agreement, tense consistency, word order - Meaning: illogical phrases, incomplete sentences, repeated ideas, inconsistent narrative - Style: replace jargon with clearer language, remove tautologies

Rules: - Return ONLY the corrected text, no comments or annotations - Do NOT change structure, headings, or formatting (Markdown) - Do NOT add or remove paragraphs or sections - Do NOT rewrite the text — only targeted error fixes - If there are no errors — return the text as is

From 83.7% to 85%: Architecture and Optimizer Choices Matter

Chasing That Last 1.3%: When Model Architecture Meets Optimizer Reality

Metadata