DeepSeek tests “sparse attention” to slash AI processing costs
deepseek
| Source: Mastodon | Original article
DeepSeek announced that it is field‑testing a new “fine‑grained sparse attention” mechanism that, according to the company, halves the cost of its public API for long‑form inputs. The technique, a long‑standing research idea that trims the number of token‑to‑token interactions during inference, has been re‑engineered by DeepSeek to apply dynamically at a much more granular level than earlier sparse‑transformer models. Early benchmarks shared on Hugging Face show up to a 60‑75 % reduction in compute time for sequences over 2 k tokens, and the firm has already lowered its pricing for the affected endpoint by roughly 50 %.
The move matters because inference cost remains the biggest barrier to widespread deployment of large language models. Google’s recent KV‑cache compression and TurboQuant algorithms cut memory and compute expenses dramatically, but they still rely on dense attention for full‑length context. DeepSeek’s approach promises comparable savings without sacrificing the quality of long‑range dependencies, potentially democratising access to high‑capacity models for startups, researchers and enterprises that previously could not afford the per‑token fees.
As we reported on 25 March, DeepSeek hired 17 specialists to integrate its DeerFlow 2.0 framework, signalling a broader push to optimise both training and serving pipelines. The sparse‑attention trial is the latest step in that strategy.
What to watch next: DeepSeek plans to release a production‑ready version of the model by Q3, accompanied by a peer‑reviewed paper detailing the algorithmic innovations. Industry observers will be keen to see independent benchmark suites, how cloud providers price the new endpoint, and whether rivals such as OpenAI or Anthropic accelerate their own sparsity research in response. The outcome could reshape the economics of AI services across the Nordic tech ecosystem and beyond.
Sources
Back to AIPULSEN