KV Cache Compression Gains 900,000× Speed, Beats TurboQuant and Per‑Vector Shannon Limit

vector-db

2026-04-21 | Source: HN | Original article

A team of researchers from the University of Copenhagen and the Nordic Institute for AI has unveiled a new key‑value (KV) cache compression technique that claims a 900 000‑fold reduction over Google’s TurboQuant and the per‑vector Shannon limit that has defined the field for months. The method, described in a pre‑print released yesterday, treats the KV cache as a single sequence rather than a collection of independent vectors and applies probabilistic entropy coding combined with block‑diagonal rotations to strip away redundancy across tokens. The KV cache stores attention keys and values for every token generated by a transformer, and its size grows linearly with context length. Reducing this memory footprint is the most effective lever for extending context windows, cutting GPU memory, and lowering inference latency. TurboQuant, which we dissected on 4 April, pushed per‑vector compression to 3 bits with no measurable loss, but it still hit the Shannon bound for each vector in isolation. By compressing the cache as a whole, the new approach sidesteps that bound, achieving compression ratios that would make even a 32‑bit float representation appear wasteful. Early benchmarks on a 70‑billion‑parameter model show a 28 % speed‑up in decoding and a five‑fold acceleration in pre‑fill, while maintaining perplexity within 0.1 % of the uncompressed baseline. The breakthrough matters because it could make multi‑kilotoken contexts feasible on a single GPU, opening the door to richer long‑form generation, more accurate retrieval‑augmented systems, and cheaper inference for cloud providers. However, the authors acknowledge that the gains hinge on an efficient decompression pipeline; the current implementation adds a modest CPU overhead that must be amortised over long generations. Watch for the full paper at the upcoming NeurIPS conference, for an open‑source release of the compression library, and for reactions from Google and OpenAI, who have both invested heavily in KV‑cache efficiency. If the technique scales to production workloads, it may redefine the economics of large‑language‑model serving across the Nordic AI ecosystem and beyond.

Sources

Back to AIPULSEN