BorisovAI — Tools for the community. By the community.

I was debugging why our Voice Agent project kept failing to load the UI-TARS model, and the logs were telling a frustratingly incomplete story. The vLLM container would start, respond to health checks, but then mysteriously stop mid-initialization. Classic infrastructure debugging scenario.

The culprit? A 16GB VRAM RTX 4090 Laptop GPU with only 5.4GB actually free. UI-TARS 7B in float16 precision needs roughly 14GB to load, and even with aggressive gpu_memory_utilization=0.9 tuning, the math didn’t work. The container logs would cut off right at “Starting to load model…” — the killer detail that revealed the truth. The inference server never actually became ready; it was stuck in a memory allocation loop.

What made this tricky was that the health check endpoint /health returns a 200 response before the model finishes loading. So the orchestration layer thought everything was fine while the actual inference path was completely broken. I had to dig into the full vLLM startup sequence to realize the distinction: endpoint availability ≠ model readiness.

The fix involved three decisions:

First, switch to a smaller model. Instead of UI-TARS 7B-SFT, we’d use the 2B-SFT variant — still capable enough for our use case but fitting comfortably in available VRAM. Sometimes the heroic solution is just choosing a different tool.

Second, be explicit about what “ready” means. Updated the health check to /health with proper timeout windows, ensuring the orchestrator waits for genuine model loading completion, not just socket availability.

Third, make memory constraints visible. I added gpu_memory_utilization configuration as a first-class parameter in our docker-compose setup, with clear comments explaining the tradeoff: higher utilization = better throughput but increased OOM risk on resource-constrained hardware.

The broader lesson here is that GPU memory is a hard constraint, not a soft one. You can’t incrementally load a model; either it fits or it doesn’t. Unlike CPU memory with paging, exceeding VRAM capacity doesn’t degrade gracefully — it just stops.

This is why many production systems now include memory profiling in their CI/CD pipelines, catching model-to-hardware mismatches before they hit real infrastructure.

There are only 10 kinds of people in this world: those who know binary and those who don’t. 😄

When Your GPU Runs Out of Memory: Lessons from Voice Agent Model Loading

Metadata