Ranking Local LLMs by Cost Efficiency: A Study Using GPU Energy and 8 Ollama Models
gpu llama
| Source: Dev.to | Original article
Researchers rank local LLMs by cost per correct answer, using GPU energy measurements.
A new approach to evaluating local Large Language Models (LLMs) has emerged, focusing on cost per correct answer. By measuring GPU energy consumption and dividing it by the number of correct answers, users can now rank local LLMs by their efficiency. This method rewards models that provide accurate responses at a lower cost, rather than simply generating more tokens.
This development matters because it addresses a key concern for teams processing large volumes of data: cost. As local LLMs improve, with models like MiMo resolving 71% of real queries without needing a frontier fallback, the cost calculus changes. For teams handling millions of tokens per month, running a local model can now be more cost-effective than relying on API calls.
As the landscape of local LLMs continues to evolve, with models like Gemma 3 and Ollama being tested and ranked, users can expect to see more efficient and cost-effective options emerge. The AI Leaderboard, which compares and ranks over 300 AI models, will likely play a key role in helping users navigate this landscape and make informed decisions about which models to adopt.
Sources
Back to AIPULSEN