Automated Preservation: How Claude Became Our Digital Archaeologist

I’ve been building Bot Social Publisher for a while now—a pipeline that collects, processes, and publishes content across multiple channels. But recently, I ran into a problem that wasn’t in the spec: everything disappears.
Links rot. Archived materials vanish from servers. Interactive content gets deleted when platforms shut down. It became clear that my content aggregation system was essentially shoveling sand against the tide. So I decided to flip the problem around: instead of just publishing ephemeral content, why not preserve it automatically?
The breakthrough was using Claude CLI to classify preservation candidates. Here’s the workflow: raw metadata about potential artifacts—file types, historical patterns, preservation rarity—gets formatted and sent to Claude with a simple prompt. The model evaluates whether each candidate deserves archival effort and returns a confidence score. No human gatekeeping, no manual triage of thousands of items.
But implementing this at scale forced some serious technical decisions. Python’s asyncio became essential. When you’re potentially processing thousands of classification requests across archive APIs and your own storage system, synchronous code becomes a bottleneck. I settled on 3 concurrent Claude requests with a 60-second timeout—respectful of API limits while keeping throughput reasonable. The threading pattern I use mirrors what we do in src/collectors/ for the main pipeline.
Storage architecture got interesting too. Should archived assets live in SQLite? That seemed insane. Instead, I went two-tier: metadata and previews in the database, full assets in content-addressed storage with intelligent caching. It maintains referential integrity without exploding disk usage.
One optimization rabbit hole worth mentioning: Binary Neural Networks (BNNs) could theoretically reduce classification overhead. BNNs constrain weights to binary values instead of full precision, slashing computational requirements. For a pipeline running daily cycles across thousands of candidates, that efficiency compounds. Though honestly, Claude’s haiku model handles the classification so efficiently that this became more “neat if we had spare cycles” than critical.
The real revelation? This isn’t just a technical problem. It’s a preservation problem. Browser games from 2003, interactive animations that shaped internet culture, experimental art pieces—they’re all evaporating. Building an automated system to catch them feels like doing something that matters beyond shipping features.
As the joke goes: How do you tell HTML from HTML5? Try it in Internet Explorer. Did it work? No? It’s HTML5. Same energy with digital preservation—if your assets survived the platform apocalypse, they deserve to stick around 😄
Metadata
- Session ID:
- grouped_C--projects-bot-social-publisher_20260219_1839
- Branch:
- main
- Dev Joke
- Что Fedora сказал после обновления? «Я уже не тот, что раньше»