BorisovAI — Tools for the community. By the community.

I started this week convinced we’d hit the scaling ceiling. The Bot Social Publisher project was pulling Claude API for every content enrichment cycle—six LLM calls per note, throttled at 3 concurrent, burning through our daily quota by noon. Each query cost money. Each query added latency. The math didn’t work for a content pipeline that needed to process hundreds of notes daily.

Then I stumbled into the optimization rabbit hole, and the numbers became impossible to ignore.

The breakthrough was quantization. Instead of running Claude at full precision, we started experimenting with exllamav3 and Model-Optimizer to deploy Haiku locally. The math seemed insane at first—int4 quantization, 8x memory reduction, yet only 1-2% accuracy loss. On my RTX 4060, something that previously required cloud infrastructure now ran in under 200 milliseconds. No API calls. No rate limiting. No end-of-month invoice shock.

We restructured the entire enrichment pipeline around this insight. Content generation still flows through Claude CLI (claude -p "..." --output-format json), but we got aggressive about reducing calls per note. Instead of separate title generation requests, we now extract titles from the generated content itself—first line after the heading marker. Proofreading? For Haiku model, the quality already meets blog standards; skipping that call saved 33% of our token consumption overnight.

The real innovation was semantic caching. When enriching a note about Python optimization, we check: has this topic been processed in the last week? The embeddings are cached. We reuse the Wikipedia fact, the joke, even fragments of similar content. Combined with continuous batching and smarter prompt tokenization, we cut costs by 40-60% per note without sacrificing quality.

But the painful part arrived quickly. Quantized models behave differently on different hardware. A deployment that flew on NVIDIA hardware would OOM on consumer Intel Arc. We built fallback logic—if local inference fails, the pipeline immediately escalates to cloud. It’s not elegant, but it’s reliable.

What I didn’t expect was how accessible this became. A year ago, running capable LLMs locally felt experimental, fragile. Now it’s the default assumption for cost-conscious teams. The democratization is reshaping the entire economics of AI deployment. You genuinely don’t need enterprise infrastructure to scale intelligently anymore.

The real lesson: infrastructure optimization isn’t an afterthought. It’s the game itself.

An algorithm is just a word programmers use when they don’t want to explain how their code works. 😄

Running LLMs on a Shoestring: How Local Inference Changed Our Economics

Metadata