I Benchmarked 4 LLMs With Real Token Costs — The Most Expensive One Scored the Lowest
agents benchmarks claude gemini gpt-4
| Source: Dev.to | Original article
A developer‑run benchmark released this week compared four leading large‑language models—OpenAI’s GPT‑4.1, Anthropic’s Claude, Google’s Gemini and Meta’s Llama‑2—using the actual cost of the tokens each model consumed while executing a suite of AI‑agent tasks. The test measured success rates on planning, tool use and problem‑solving, then divided those scores by the dollars spent per 1 000 tokens. The result was stark: the model with the highest per‑token price, GPT‑4.1, delivered the lowest cost‑adjusted performance, while the cheaper Gemini and Claude variants outperformed it on a per‑dollar basis.
The experiment matters because enterprises are moving from experimental pilots to production‑scale AI agents, and token bills are becoming a decisive factor in model selection. As we reported on 6 April, Qwen‑3.6‑Plus recently broke the 1‑trillion‑token‑per‑day barrier, underscoring how quickly token volumes can balloon. When real‑world workloads are priced, the cheapest model is not automatically the worst; efficiency gains can offset raw capability gaps. The benchmark also highlights a growing transparency gap: providers disclose pricing but rarely publish per‑token performance data, leaving customers to infer cost‑effectiveness through ad‑hoc tests like this one.
Looking ahead, three developments could reshape the calculus. First, OpenAI and other vendors have hinted at tiered pricing and “pay‑as‑you‑go” discounts that may narrow the gap. Second, the industry’s push toward open‑source, high‑throughput models—exemplified by the token‑processing feats of Qwen‑3.6‑Plus—could deliver cheaper alternatives without sacrificing capability. Third, advances in model‑specific prompting and tool‑integration, such as the real‑time AI pipelines demonstrated on Apple’s M3 Pro, may boost the effective output of lower‑priced models. Stakeholders should monitor pricing announcements, emerging open‑source releases, and tooling improvements to ensure they are not overpaying for marginal gains.
Sources
Back to AIPULSEN