Why Token Counting in Multi-LLM Systems Is Harder Than You Think
| Source: Dev.to | Original article
A team of engineers building an adaptive context‑window manager for multi‑LLM applications has uncovered a hidden complexity: counting tokens accurately across different models is far from trivial. The problem emerged when the component tried to trim prompts on the fly to stay within each provider’s context limits while preserving the semantic core of a conversation. The engineers discovered that token counts diverge not only because Claude, Gemini, GPT‑5 and Llama use distinct tokenizers, but also because the data format itself inflates token usage. Repeated JSON keys, nested objects and whitespace can add dozens of tokens per request, a cost that compounds at scale.
The issue matters because token‑based pricing is now the primary expense driver for production‑grade AI services. Mis‑estimating token counts leads to unexpected bills, throttled latency and, in worst cases, request failures when a model’s window is exceeded. Observability tools for LLM pipelines still struggle to surface these hidden overheads, as they focus on CPU, GPU and queue metrics rather than the “soft” token budget. Open‑source utilities such as token‑counter and Cognio’s free calculator have begun to address the problem, but they still rely on per‑model tokenizers and cannot reconcile format‑induced inflation.
The discovery is prompting a wave of experimentation with more compact payload formats. A recent whitepaper on “TOON vs JSON in High‑Scale LLM Systems” shows that schema‑first, binary‑compatible representations can shave up to 30 % of token overhead compared with conventional JSON, while also simplifying parsing for LLMs. Industry watchers will be looking for standardised token‑counting libraries that abstract away tokenizer quirks, and for broader adoption of TOON‑style formats in SDKs and cloud APIs. If these solutions mature, they could tighten cost predictability, improve latency and make multi‑model orchestration a more reliable building block for the next generation of AI products.
Sources
Back to AIPULSEN