BorisovAI — Tools for the community. By the community.

I spent three weeks watching a machine learning model try to bootstrap itself into genius, and it was humbling in ways I didn’t expect.

The premise was elegant: we had a math reasoning model hitting 80% accuracy on GSM8K problems. Good, but stuck. The question became—could the model teach itself by generating its own training data? Not just solving problems, but creating them. Self-augmentation. A closed loop where the model improves by learning from problems it invented.

It didn’t work the way I thought it would.

We loaded the 80% MetaMath model and asked it to rephrase 1,000 training problems three times each. Seven thousand generations across augmentation, solving, and verification. The math was sound. The idea was sound. Then we trained on the output.

The model got worse. Minus 3.5 percentage points.

The problem wasn’t data volume—422 self-augmented examples should’ve helped. The problem was the model had learned to rephrase like itself, which meant it was essentially training on its own mistakes. A weak teacher produces weak students. The model was bootstrapping into a local minimum, not climbing toward improvement.

That’s when I realized we’d been strengthening the wrong thing. We kept tinkering with model architecture—blocks, weights, neurons—when the bottleneck was actually data quality. The model wasn’t hungry for new neurons. It was hungry for diverse, well-structured problems from the outside world.

So we pivoted. Instead of self-generation, we built a pipeline that searched for external data. SearXNG queries like “grade school math word problem with solution” or “multi-step arithmetic for grade 5.” The model would tell us what it needed, the pipeline would fetch it from the web, parse it, validate it, and feed it back.

It sounds simple. It wasn’t. Web extraction is noisy. HTML is messy. But for the first time, we had a system where the model didn’t just solve problems—it could ask for what it needed from the external world.

Did it work? The loss curve started improving. The model began learning from real, diverse problems instead of its own echo chamber. We haven’t hit 85% yet, but we’re in the right direction.

The joke writes itself: a byte walks into a bar looking miserable. The bartender asks what’s wrong. “Parity error,” it says. “Ah, I thought you looked a bit off.” 😄

Our model had the same problem—it looked fine from the outside, but its internal reasoning was hopelessly corrupted. The fix wasn’t better weights. It was better data.

When Your Self-Teaching Model Eats Its Own Homework

Metadata