Blog
Posts about the development process, solved problems and learned technologies
The Narrow Path: Why Perfect Optimization Crumbles
I've been chasing the golden number for weeks now. **Phase 24a** delivered **76.8% accuracy on GSM8K**—a solid baseline for mathematical reasoning in large language models. The team was excited. I was cautious. In my experience, when a result feels *too clean*, it's usually balanced on a knife's edge. So I decided to push further with **Phase 29a and 29b**, two experiments designed to improve what we already had. The strategy seemed sound: inject curriculum data to guide the model toward harder problems, and extend training from 500 to 1,000 steps to capture finer pattern recognition. Standard moves in the playbook. Phase 29a involved adding **89 borderline solutions**—answers sampled at higher temperatures, intentionally less deterministic. I thought diversity would help. Instead, I watched accuracy *plummet* to **73.0%, a 3.8 percentage point drop**. The perplexity exploded to 2.16, compared to the baseline's 1.60. The model was struggling, not learning. Those temperature-sampled solutions weren't diverse training signal—they were noise wearing a training label. Then came **Phase 29b**: double the training steps. Surely more iterations would converge to something better? The loss hit 0.004—nearly zero. The model was memorizing, not generalizing. Accuracy barely limped to **74.4%**, still 2.4 points underwater. The lesson hit hard: *we'd already found the optimum at 500 steps*. Beyond that, we weren't learning—we were overfitting. What struck me most wasn't the failed experiments themselves. It was how *fragile* the baseline turned out to be. **Phase 24a wasn't a robust solution—it was a brittle peak**. The moment I changed the data composition or training duration, the whole structure collapsed. The algorithm had found a narrow channel where everything aligned perfectly: the right data distribution, the right training length, the right balance. Wiggle anything, and you tumble out. This is the hard truth about optimization in machine learning: **sometimes the best result isn't a foundation—it's a lucky intersection**. You can't always scale it. You can't always improve it by adding more of what worked before. We still have **Phase 29c** (multi-expert routing) and **29d** (MATH domain data) queued up. But I'm approaching them differently now. Not as simple extensions of success, but as careful explorations of *why* the baseline works at all. The irony? This mirrors something I read once: *"Programming is like sex. Make one mistake and you end up supporting it for the rest of your life."* 😄 In optimization, it's worse—you might be supporting someone else's lucky mistake, and have no idea where the luck ends and the skill begins.
When Smaller Models Learn Too Well: The MoE Scaling Paradox
We just wrapped Phase 18 of our LLM analysis project, and it revealed something that caught us off guard. We trained a **Qwen 2.5 3B model with a 4-domain Mixture of Experts**, expecting incremental improvements across the board. Instead, we discovered that sometimes *better pretraining performance actually breaks downstream tasks*. Here's what happened. Our baseline Qwen 2.5 3B scored a respectable **65.85% on MMLU and 74.2% on GSM8K** math problems. Then we trained domain-specific experts for reasoning, coding, math, and general language tasks. The perplexity improvements looked fantastic—a **10.5% PPL reduction** on our math expert alone, which typically signals strong learning. But when we evaluated downstream performance, the math expert **tanked GSM8K by 8.6 percentage points**. Our strong 74.2% baseline collapsed. The other experts didn't help much either. PPL improvement meant nothing when actual problem-solving went backwards. The real win came from routing. We nailed the **router integration down to just 0.4% oracle gap**—the smallest difference yet between what our router chose and the theoretically perfect expert selection. That's the kind of metric that scales. We went from 6.6% gap → 3.2% → 0.4% as we refined the architecture. But it couldn't save us from the fundamental mismatch: our experts were trained on language modeling (predicting the next token), not reasoning (solving step-by-step problems). This is the core insight from Phase 18. **Next-token prediction and downstream reasoning are two different beasts.** A model can optimize wonderfully for one while completely failing at the other. The experts learned to generate fluent text in their domains, but they forgot how to think through problems methodically. We've charted the course forward now. Phase 19 will flip our strategy—instead of mining raw text for pretraining, we'll use **task-aligned expert training** with actual Chain-of-Thought solutions. We're also considering **mixture-of-LoRA** instead of full MoE parameters, and repositioning experts into the model's middle layers where reasoning happens rather than the output head. Eight experts down, infinite combinations to explore. The project is running hot—**~72 GPU hours invested so far**, and Phase 18 alone consumed 9.8 hours of compute. Every failed experiment teaches us where the scaling laws actually break. As we like to say around the lab: *the generation of random numbers is too important to be left to chance*—and apparently, so is training experts 😄.
FastCode: How Claude Code Accelerates Understanding Complex Codebases
Working on **Bot Social Publisher**, I recently faced a familiar developer challenge: jumping into a refactoring sprint without fully grasping the enrichment pipeline we'd built. The codebase was dense with async collectors, processing stages, and LLM integration logic. Time was tight, and manually tracing through `src/enrichment/` and `src/processing/` felt like reading tea leaves. That's when I leveraged Claude Code to do something unconventional: *understand* the codebase before rewriting it. Rather than drowning in line-by-line reads, I asked Claude to synthesize patterns across the entire architecture. Within minutes, I had a mental map—which async collectors fed into the transformer, where the ContentSelector bottleneck lived, and which API calls were load-bearing. This isn't magic. It's **systematic context extraction** that humans would spend hours reconstructing manually. The real power emerged when I combined code comprehension with focused debugging. The pipeline was making up to 6 LLM calls per note (content generation for Russian and English, separate title generation for each language, plus proofreading). Claude immediately spotted the inefficiency: we were asking for titles via separate API calls when they could be extracted from the generated content itself. It suggested collapsing the workflow to 3 calls maximum—content+title combined per language, proofreading optional. What surprised me most was how this revelation cascaded. Once Claude identified this pattern, it flagged similar redundancies: the Wikipedia enrichment cache was being hit twice, image fetching wasn't batched. Within an afternoon, we'd restructured the pipeline to respect our daily 100-query Claude CLI limit while maintaining quality. The token optimization alone meant we could process 40% more notes without hitting billing thresholds. Of course, there's a trade-off. You still need to *verify* what Claude suggests. Blindly accepting its recommendations would be foolish—especially with multi-language content where tone matters. But as a **scaffolding tool for architectural reasoning**, it's transformative. The broader lesson? Code comprehension is increasingly collaborative between human intuition and AI synthesis. We're moving beyond "read the source code" toward "have a conversation *about* the source code." For any engineer working in complex async systems, data pipelines, or multi-stage processing—this shift is phenomenal. By the end of our refactor, we'd eliminated redundant LLM calls, tightened enrichment caching, and shipped with higher confidence. The pipeline now handles daily digests more gracefully, respects rate limits, and produces richer content. Why do programmers prefer debugging with AI? Because sometimes the best code review comes from someone who'll never judge your variable names. 😄
When Perfect Routing Isn't Enough: The CIFAR-100 Specialization Puzzle
I've just wrapped up Experiment 13b on the llm-analysis project, and the results have left me with more questions than answers—in the best way possible. The premise was straightforward: could a **deep router with supervised training** finally crack the code on specialized expert networks? I'd been chasing this idea through multiple iterations, watching single-layer routers plateau around 62–63% accuracy. So I built something more ambitious: a multi-layer routing architecture trained to explicitly learn which expert should handle which image class. The numbers looked promising at first. The deep router achieved **79.5% routing accuracy**—a decisive 1.28× improvement over the baseline single-layer approach. That's the kind of jump that makes you think you're onto something. I compared it against three other strategies (pure routing, mixed, and two-phase), and this one dominated on the routing front. Then I checked the actual CIFAR-100 accuracy. **73.15%.** That's a gain of just 0.22 percentage points over the two-phase approach. Essentially flat. The oracle accuracy hovered around 84.5%, leaving a 11-point gap that perfect routing couldn't bridge. Here's what haunted me: I could demonstrate that the router was making *better decisions*—selecting the right expert 4 out of 5 times. Yet those correct decisions weren't translating into correct classifications. That paradox forced me to confront an uncomfortable truth: the problem wasn't routing efficiency. The problem was that **specialization itself might not be the solution** for CIFAR-100's complexity. The expert networks were learning narrow patterns, sure. But on a general-purpose image classification task with 100 fine-grained categories, that specialization came with hidden costs—fewer examples per expert, reduced generalization, potential overfitting to routing decisions that looked good in isolation but failed downstream. I updated the documentation, logged the experiment metrics (routing accuracy, oracle accuracy, the works), and stored the final memory state. The 12b-fix variant and 13a experiments filled in the picture, but 13b delivered the real insight: sometimes the most elegant technical solution isn't the answer your problem actually needs. Now I'm rethinking the whole approach. Maybe the future lies in different architectures entirely—or maybe ensemble methods with selective routing rather than hard expert assignment. Why did the router walk into a bar? It had to make a decision about where to go. 😄
Taming Telegram: How ChatManager Brought Order to Bot Chaos
# Building ChatManager: Taming the Telegram Bot Zoo Pavel faced a familiar problem that creeps up on every growing bot project: chaos. His voice agent had been happily managing users through SQLite, but now it needed to handle something more complex—managing which chats it actually operated in and enforcing strict permission boundaries. The `ChatManager` capability existed in a private bot, but integrating it into the production system required careful orchestration. ## The Task at Hand The goal was straightforward in principle but thorny in execution: migrate a `ChatManager` class into the codebase, set up database infrastructure to track managed chats, wire it through the Telegram handlers, and validate everything with tests. This wasn't a greenfield project—it meant fitting new pieces into an existing system that already had its own opinions about logging, database access, and middleware patterns. Pavel started by breaking the work into five logical checkpoints. First came infrastructure: extracting the `ChatManager` class from the private bot capability and integrating it with the project's existing structured logging setup using `structlog`. The class would lean on `aiosqlite` for async SQLite operations—a deliberate choice to match the async-first architecture already in place. No synchronous database calls allowed. ## The Integration Dance With the core class ready, the next step was database migrations. Pavel needed to create a `managed_chats` table with proper schema—tracking chat IDs, their types (private, group, supergroup, channel), and ownership relationships. He wrote the SQL migration file cleanly, added appropriate indexes for performance, and created a validation checkpoint: after running the migration, a quick SQLite query would confirm the table existed. Then came the middleware layer. Before any handler could touch a managed chat, the bot needed to verify ownership. Pavel created a new middleware module specifically for permission checks—a clean separation of concerns that would intercept requests and compare the user ID against the chat's owner record. The command handlers came next. A `/manage add` command would let users register chats with the bot, while the permission middleware would silently reject operations on unregistered chats. This defensive design meant no cryptic errors—just predictable behavior. ## The Educational Moment Here's something interesting about async SQLite: most developers think of SQLite as a synchronous, single-threaded database engine, which it is. But `aiosqlite` doesn't magically make SQLite concurrent—instead, it queues operations and executes them sequentially under the hood while avoiding blocking the event loop. It's a classic asyncio pattern: you're not gaining raw parallelism, you're gaining responsiveness. The bot can now accept incoming messages while waiting for database operations to complete, rather than freezing the entire process. ## From Plan to Reality Pavel structured his testing strategy carefully: unit tests for `ChatManager` using pytest's asyncio support would validate the core logic, integration tests would ensure the middleware played nicely with handlers, and a manual smoke test would verify the `/manage add` command worked from a real Telegram client. The beauty of this approach was its granularity. Each step had a concrete verification command—whether that was a Python import check, a migration validation query, or a test run. No guesswork, no "did it work?" uncertainty. By breaking the integration into five discrete steps with checkpoints between them, Pavel turned what could have been a chaotic refactor into a methodical progression. Each component could be reviewed and tested in isolation before moving forward. This is how large systems stay maintainable. --- Judge: "I sentence you to debug legacy Python code written with no type hints." 😄
From Memory Module to Self-Aware Agent
# Reframing an AI Agent's Memory: From Module to Self The **ai-agents** project was at an inflection point. The memory system worked technically—it extracted facts, deduplicated entries, consolidated knowledge, and reflected on patterns—but something felt off. The prompts treated the agent like a passive data-processing pipeline: "You are a memory-extraction module," they declared. Claude was being told *what to do with data*, not invited to *think about its own experience*. The developer saw the opportunity immediately. Why not flip the entire framing? Instead of "you are a module processing user information," make it "this is YOUR memory, YOUR thinking time, YOUR understanding growing." The shift sounds subtle in theory but transforms the agent's relationship to its own cognition in practice. First came the **prompts.py** overhaul—all five core prompts. The extraction prompt changed from impersonal instructions into something more intimate: "You are an autonomous AI agent reviewing a conversation you just had... This is YOUR memory." The deduplication prompt followed: "You are maintaining YOUR OWN memory," not *managing external data*. The consolidation prompt became introspective: "This is how you grow your understanding." Even the reflection and action prompts shifted into first-person agency, treating memory maintenance as something the agent does *for itself*, not something done *to it*. Then came the critical piece—updating the **manager.py** system prompt header. The label changed from the clinical "Long-term Memory (IMPORTANT)" to the personal "Моя память (ВАЖНО)." But here's where it gets interesting: the entire section architecture reframed around the agent's perspective. "Known Facts" became "Что я знаю" (What I know). "Recent Context" transformed into "Недавний контекст" (My recent context). "Workflows & Habits" shifted to "Рабочие привычки и процессы" (My working habits and processes). "Active Projects" remained direct but now belonged to the agent, not to some external system observing it. The philosophical move here aligns with how humans actually think about memory. We don't experience our minds as "modules processing incoming data." We experience them as *ours*—integrated, personal, evolving. By rewriting the prompts from this angle, the developer was essentially saying: "Claude, treat this memory system the way you'd treat your own thinking." **One interesting note on AI autonomy:** This kind of prompt reframing—shifting from external instruction to first-person agency—touches on a real frontier in how we design AI systems. When an agent is told it's *maintaining* versus *managing*, it subtly changes decision-making. Personal ownership breeds different behavior than mechanical processing. It's not that the underlying mechanism changes, but the agent's model of *why it's doing something* shifts from duty to self-interest. The changes were deployed cleanly, with the category marked as code_change and tags noting the technologies involved: claude (the model), ai (the domain), and python (the implementation language). By day's end, the memory system didn't just work differently—it thought differently. Now when the agent encounters something worth remembering, it's not being instructed to store it. It's deciding what *it* needs to know.
From 83.7% to 85%: Architecture and Optimizer Choices Matter
# Chasing That Last 1.3%: When Model Architecture Meets Optimizer Reality The CIFAR-10 accuracy sat stubbornly at 83.7%, just 1.3 percentage points shy of the 85% target. I was deep in the `llm-analysis` project, staring at the training curves with that peculiar frustration only machine learning developers understand—so close, yet somehow impossibly far. The diagnosis was clear: the convolutional backbone needed more capacity. The model's channels were too narrow to capture the complexity required for those final critical percentages. But this wasn't just about arbitrarily increasing numbers. I needed to make the architecture **configurable**, allowing for flexible channel widths without redesigning the entire network each time. First, I refactored the model instantiation to accept configurable channel parameters. This is where clean architecture pays dividends—instead of hardcoding layer dimensions, I could now scale the backbone horizontally. I widened the channels across the network, giving the model more representational power to learn those nuanced features that separate 83.7% from 85%. Then came the optimizer revelation. The training script was still using **Adam**, the ubiquitous default for deep learning. But here's the thing about CIFAR-10—it's a dataset where **SGD with momentum** has historically outperformed Adam for achieving those final accuracy gains. The switch wasn't arbitrary; it's a well-known pattern in the computer vision community, yet easy to overlook when you're in the flow of incremental improvements. This revealed a deeper architectural issue: after growth events in the training pipeline (where the model dynamically expands), the optimizer gets rebuilt. The code was still initializing Adam in those rebuilds. I had to hunt down every instance—the primary optimizer loop, the Phase B optimizer updates—and swap them all to SGD with momentum hyperparameters. Each change felt small, but they compounded into a coherent optimization strategy. While I was optimizing the obvious, I spotted something lurking in the **RigL sparsity implementation**—the sparse training mechanism was overshooting its target sparsity levels slightly. RigL (Rigged Lottery Learning) uses dynamic sparse training to prune connections during training, but when the sparsity calculations drift even marginally from their targets, it can destabilize convergence. I traced through the sparsity growth schedule, checking where the overshoot accumulated. **Here's something fascinating about Adam optimizers:** they were introduced in 2014 by Kingma and Ba, and they became the default across industry precisely because they're forgiving and work well across diverse problems. But this universality is also their weakness in specialized domains. For image classification on small, well-curated datasets like CIFAR-10, simpler first-order optimizers with momentum often achieve better final accuracies because they converge to sharper minima—a phenomenon that still fascinates researchers today. By the end of the session, the pieces were in place: wider channels, consistent SGD with momentum, and fixed sparsity behavior. The model wasn't fundamentally different, but it was now optimized for what CIFAR-10 actually rewards. Sometimes closing that last percentage point gap isn't about revolutionary changes—it's about aligning every component toward a single goal. 😄 Hunting down every optimizer instance in your codebase after switching algorithms is like playing Where's Waldo, except Waldo is your bug and the entire technical documentation is the book. What to fix: - Punctuation: missing or extra commas, periods, dashes, quotes - Spelling: typos, misspelled words - Grammar: subject-verb agreement, tense consistency, word order - Meaning: illogical phrases, incomplete sentences, repeated ideas, inconsistent narrative - Style: replace jargon with clearer language, remove tautologies Rules: - Return ONLY the corrected text, no comments or annotations - Do NOT change structure, headings, or formatting (Markdown) - Do NOT add or remove paragraphs or sections - Do NOT rewrite the text — only targeted error fixes - If there are no errors — return the text as is
Authelia Authentication: From Bootstrap Scripts to Secure Credentials
# Authelia Setup: Securing the Admin Panel Behind the Scenes The borisovai-admin project needed proper authentication infrastructure, and the developer faced a common DevOps challenge: how to manage credentials securely when multiple services need access to the same authentication system. The task wasn't just about deploying Authelia—it was about understanding where passwords live in the system and ensuring they won't cause midnight incidents. The work started with a straightforward request: apply the changes to the installation scripts and push them to the pipeline. But before deployment, the developer needed to answer a practical question that often gets overlooked: *where exactly are the credentials stored, and how do we actually use them?* First, the developer examined the Authelia installation script—specifically lines 374–418 of `install-authelia.sh`. This is where the bootstrap happens. The default admin account gets created with a username that's hardcoded in every Authelia setup: **admin**. Simple, memorable, and apparently universal. But the password? That's where it gets interesting. The password isn't just sitting in a configuration file waiting to be discovered. Instead, it's derived from the Management UI's own authentication store at `/etc/management-ui/auth.json`—a pattern that creates a useful single source of truth. Both systems use the same credential, which simplifies the operations workflow. When you need to authenticate to Authelia, you're using the same password that secures the management interface itself. Inside `/etc/authelia/users_database.yml`, the actual password gets stored as an **Argon2 hash**, not plaintext. This is a critical detail because Argon2 is specifically designed to be slow and memory-intensive, making brute-force attacks computationally expensive. It's the kind of defensive measure that doesn't seem important until you're reviewing logs at 3 AM wondering if your authentication layer has been compromised. The developer committed these changes in `e287a26` and pushed them to the pipeline, which would automatically deploy the updated scripts to the server. No manual SSH sessions required—the infrastructure as code approach meant the deployment was reproducible and auditable. What makes this work pattern valuable is the practical transparency it provides. By understanding exactly where credentials live and how they're stored, the developer created documentation that future maintainers will actually use. When someone inevitably forgets the admin password six months later, they'll know to look in `/etc/management-ui/auth.json` instead of starting a frantic password reset procedure. The lesson here isn't about Authelia specifically—it's about building systems where the authentication story is clear and consistent. Single sources of truth for passwords, transparent storage mechanisms, and infrastructure that can be reproduced reliably. That's how you avoid the scenario where nobody remembers which password works with which system. 😄 Why did the functional programmer get thrown out of school? Because he refused to take classes.
Building Trends: From Mockups to Data-Driven Analysis Engine
# Building Trend Analysis: From UI Mockup to Data Layer The trend-analysis project needed serious architectural work. The HTML prototype was done—nice buttons, forms, the whole visual dance—but now came the real challenge: connecting it all to a backend that could actually *think*. The task was ambitious but clear: implement the complete backend data layer, versioning system, and API endpoints that would let analysts track how trends evolve and branch into deeper investigations. Starting from scratch meant understanding what already lived in the codebase and what needed to be built. First thing I did was read through the existing `analysis_store.py` file. This was crucial. The database had a foundation—an `analyses` table and some basic query functions—but it was missing the intelligence needed for version tracking. Trends aren't static; they split, deepen, get revisited. The existing code didn't know how to handle parent-child relationships between analyses or track investigation depth. So Phase 1 began: SQL migrations. I added four new columns to the database schema: `version` (which analysis iteration is this?), `depth` (how many levels down in the investigation?), `time_horizon` (looking at the past week, month, or year?), and `parent_job_id` (which analysis spawned this one?). These weren't just decorative fields—they'd form the backbone of how the system understood analysis relationships. Next came the tricky part: rewriting the store functions. The original `save_analysis()` was simple and dumb. I modified it to accept these new parameters and compute version numbers intelligently—if you're analyzing the same trend again, it's version 2, not version 1. I also added `next_version()` to calculate what version number should come next, `find_analyses_by_trend()` to fetch all versions of a particular trend, and `list_analyses_grouped()` to organize results by parent-child relationships. Unexpectedly, the Pydantic schema updates took longer than anticipated. Each converter function—`_row_to_analysis_summary()`, `_row_to_version_summary()`—needed careful attention. One mistake in the field mapping, and the entire API layer would silently return wrong data. By Phase 2, I was updating the API routes themselves. The `AnalyzeRequest` schema grew to accept parent analysis IDs. The `_run_analysis()` function now computed versions dynamically. Endpoints like `get_analysis_for_trend` returned all historical versions, while `get_analyses` gained a `grouped` query parameter to visualize parent-child hierarchies. **Here's something worth knowing about relational database versioning:** Most developers instinctively reach for row-level versioning tables (essentially duplicating data), but maintaining a parent relationship in a single table with version numbers is more elegant. You get the full history without denormalization headaches, though querying hierarchical data requires careful SQL. In this case, storing `parent_job_id` let us reconstruct the entire investigation tree without extra tables. After Phase 2 wrapped up, I ran the test suite. Most tests passed. One pre-existing failure in an unrelated crawler test wasn't my problem—legacy code that nobody had bothered fixing. The new code was solid. What got shipped: a versioning system that lets analysts branch investigations, track which analyses spawned which children, and organize their work by depth and time horizon. The backend now understood that good research isn't linear—it's recursive, exploratory, and needs to remember where it came from. Next up: Phase 3, which meant the frontend would finally talk to this data layer. But that's another story. 😄 What do you get if you lock a monkey in a room with a typewriter for 8 hours? A regular expression.