Heaps do lie: Debugging a memory leak in vLLM
mistral
| Source: Lobsters | Original article
Mistral AI announced on 21 January 2026 that it had traced a stubborn memory leak in the popular vLLM inference engine to allocations that fall outside the traditional heap. The discovery came after the company’s engineers observed that Heaptrack – the standard tool for profiling heap usage – showed no abnormal growth even as resident memory on production servers kept climbing. By switching to system‑wide tracing utilities that monitor kernel‑level allocations, the team identified a leak in the library’s PagedAttention module, where CUDA buffers were being orphaned after each batch of requests.
The fix required more than a simple deallocation call; Mistral rewrote the buffer‑recycling logic to ensure that both GPU and host‑side memory are reclaimed when a request completes. The patch, now merged into vLLM’s main branch, also adds a new diagnostic hook that logs non‑heap allocations, giving operators a clearer view of resident memory consumption.
Why it matters goes beyond a single codebase. vLLM powers many commercial and research deployments that rely on high‑throughput, low‑latency serving of large language models. Undetected leaks can inflate cloud bills, trigger out‑of‑memory crashes, and erode trust in open‑source serving stacks. The episode also highlights a blind spot in common performance tooling: “heaps do lie” when GPU‑driven workloads allocate outside the process heap, a nuance that many teams have overlooked.
What to watch next is twofold. First, the vLLM community is expected to roll out the updated release across major cloud providers, and Mistral plans to publish a detailed post‑mortem with recommendations for broader monitoring practices. Second, other inference frameworks such as TensorRT‑LLM and DeepSpeed may audit their own memory paths, potentially spurring a wave of new diagnostics that go beyond heap‑centric views. The episode serves as a reminder that as LLM serving scales, observability must evolve in step.
Sources
Back to AIPULSEN