Google's TurboQuant AI-compression algorithm can reduce LLM memory usage by 6x https:// arstechni

google

2026-03-26 | Source: Mastodon | Original article

Google Research unveiled TurboQuant, a training‑free compression algorithm that slashes the memory footprint of large language models (LLMs) by up to six times. The technique quantises the key‑value (KV) cache – the working memory that stores intermediate activations during inference – to just three bits per entry, yet preserves the model’s original accuracy. A two‑step process that first applies PolarQuant to the cache’s floating‑point values and then refines them with a learned residual mapping enables the extreme reduction without the need for retraining. The breakthrough matters because KV‑cache memory has become the dominant bottleneck for serving LLMs at scale. By cutting that demand, TurboQuant can lower cloud‑infrastructure costs, reduce latency, and shrink the energy budget of inference workloads. The compression also opens a path for on‑device deployment of more capable models, a trend highlighted earlier this month when Apple demonstrated how Google’s Gemini can be distilled into smaller on‑device variants. For hardware vendors, the shift could accelerate demand for specialised accelerators that handle ultra‑low‑bit arithmetic, while cloud providers may see a competitive edge in offering cheaper, faster LLM APIs. What to watch next: Google plans to integrate TurboQuant into its Vertex AI platform later this year, and early benchmark results are expected at the upcoming ICLR conference. Third‑party frameworks such as Hugging Face and PyTorch are already probing support for the three‑bit format, which could speed broader adoption. Industry analysts will be monitoring whether the algorithm’s zero‑loss claim holds across diverse model families and real‑world workloads, and whether rivals release comparable compression schemes. If TurboQuant lives up to its promise, the economics of generative AI could shift dramatically, making powerful language models accessible to a wider range of applications and developers.

Sources

Back to AIPULSEN