Tunnels, Timeouts, and the Night the Infrastructure Broke

Building a Multi-Machine Empire: Tunnels, Traefik, and the Night Everything Almost Broke
The borisovai-admin project had outgrown its single-server phase. What started as a cozy little control panel now needed to orchestrate multiple machines across different networks, punch through firewalls, and do it all with a clean web interface. The task was straightforward on paper: build a tunnel management system. Reality, as always, had other ideas.
The Tunnel Foundation
I started by integrating frp (Fast Reverse Proxy) into the infrastructure—a lightweight reverse proxy perfect for getting past NAT and firewalls without the overhead of heavier solutions. The backend needed a proper face, so I built tunnels.html with a clean UI showing active connections and controls for creating or destroying tunnels. On the server side, five new API endpoints in server.js handled the tunnel lifecycle management. Nothing fancy, but functional.
The real work came in the installation automation. I created install-frps.sh to bootstrap the FRP server and frpc-template to dynamically generate client configurations for each machine. Then came the small but crucial detail: adding a “Tunnels” navigation link throughout the admin panel. Tiny feature, massive usability improvement.
When Your Load Balancer Becomes Your Enemy
Everything hummed along until large files started vanishing mid-download through GitLab. The culprit? Traefik’s default timeout configuration was aggressively short—anything taking more than a few minutes would get severed by the reverse proxy. This wasn’t a bug in Traefik; it was a misconfiguration on my end.
I rewrote the Traefik setup with surgical precision: readTimeout set to 600 seconds, a dedicated serversTransport configuration specifically for GitLab traffic, and a new configure-traefik.sh script to generate these dynamically. Suddenly, even 500MB archives downloaded flawlessly.
The Documentation Moment
While deep in infrastructure tuning, I realized the docs/ folder had become a maze. I reorganized it into logical sections: agents/, dns/, plans/, setup/, troubleshooting/. Each folder owned its domain. I also created machine-specific configurations under config/contabo-sm-139/ with complete Traefik, systemd, Mailu, and GitLab settings, then updated upload-single-machine.sh to handle deploying these configurations to new servers.
Here’s the Thing About Traefik
Traefik markets itself as the “edge router for microservices”—lightweight, modern, cloud-native. What they don’t advertise is that it’s deeply opinionated about timing. A single misconfigured timeout cascades through your entire infrastructure. It’s not complexity; it’s precision. Get it right, and everything sings. Get it wrong, and users call you wondering why their downloads time out.
The Payoff
By the end of the evening, the infrastructure had evolved from single-point-of-failure to a scalable multi-machine setup. New servers could be provisioned with minimal manual intervention. The tunnel management UI gave users visibility and control. Documentation became navigable. Sure, Traefik had taught me a harsh lesson about timeouts, but the system was now robust enough to actually scale.
The next phase? Enhanced monitoring, SSO integration, and better observability for network connections. But first—coffee.
😄 Dev: “I understand Traefik.” Interviewer: “At what level?” Dev: “StackOverflow tabs open at 3 AM on a Friday level.”
Metadata
- Session ID:
- grouped_C--projects-bot-social-publisher_20260208_2248
- Branch:
- main
- Dev Joke
- Что говорит PM, когда всё сломалось? «А давайте обсудим это на следующем стендапе»