DeepSeek V4 Reduces Cache Size by 90% at 1 Million Tokens, Despite Potential Risks from Aggressive Data Compression

deepseek inference

2026-04-26 | Source: Mastodon | Original article

DeepSeek V4 reduces KV cache by 90% at 1M tokens. Aggressive compression may increase failure risks.

DeepSeek V4 has achieved a significant breakthrough by cutting its key-value (KV) cache by 90% at 1 million tokens, a drastic reduction from its predecessor, DeepSeek V3.2. This development is crucial as it addresses memory requirements, making the model more efficient and cost-effective. As we reported on April 26, DeepSeek V4 was unveiled with new, low-cost AI models, and this latest update further enhances its capabilities. The aggressive compression used to achieve this reduction may, however, increase the risk of 'needle in a haystack' failures, where the model struggles to find specific information within a large dataset. Despite this potential risk, the 90% reduction in KV cache is a substantial improvement, with the model requiring only 27% of single-token inference FLOPs and 10% of the KV cache of DeepSeek V3.2. This increased efficiency is particularly important for inference, enabling faster and more accurate processing of large datasets. As the AI landscape continues to evolve, it will be essential to monitor how DeepSeek V4's compression approach affects its performance in real-world applications. With the launch of DeepSeek V4, the company has set a new standard for efficient million-token context inference, and its open-source approach under Apache 2.0 licensing is likely to attract significant attention from developers and researchers. As the industry adapts to these advancements, we can expect to see further innovations and improvements in AI model efficiency.

Sources

Back to AIPULSEN