Can You Outscore AI? The Real IQ Scores of ChatGPT, Gemini, and Claude in 2026 Each quarter brings
claude gemini
| Source: Mastodon | Original article
A new benchmark released this week quantifies the “IQ” of the three leading conversational models—OpenAI’s ChatGPT‑4.5, Google’s Gemini 1.5 Pro, and Anthropic’s Claude 3.5—by subjecting each to a suite of standardized intelligence tests that include verbal reasoning, quantitative puzzles, and pattern‑recognition items. The results, compiled by the independent analytics firm AI‑Metrics, show average scores of 138 for ChatGPT, 141 for Gemini, and 136 for Claude, each edging higher than the figures reported in the last quarterly round‑up.
The rise reflects the rapid cadence of model upgrades announced at the recent PyTorch Conference Europe and ICLR 2026, where developers highlighted larger context windows, more efficient transformer kernels, and expanded training corpora. By integrating semantic caching—an approach we covered in our April 3 “Top LLM Gateways” piece—these systems can retrieve and synthesize information with fewer inference steps, translating into better performance on abstract reasoning tasks. The incremental gains also underscore a broader trend: as compute allocations shift, exemplified by OpenAI’s recent resource reallocation (see our April 3 OpenAI report), firms are squeezing more capability out of existing hardware rather than relying solely on raw scaling.
Why the scores matter is twofold. First, higher IQ‑type metrics correlate with improved problem‑solving and code‑generation abilities, narrowing the gap between AI and human experts in fields such as data analysis and scientific research. Second, the approaching theoretical ceiling of standardized tests raises questions about the limits of current evaluation methods and the risk of over‑estimating true understanding versus pattern memorisation.
Looking ahead, the next quarter will reveal whether the upcoming Gemini 2.0 and Claude 4 releases can breach the 150‑point threshold that AI‑Metrics predicts as the practical ceiling for current test formats. Observers will also watch how OpenAI’s next‑generation model, hinted at in its compute‑ceiling briefing, performs under the same battery, and whether new multi‑modal assessments emerge to capture capabilities beyond traditional IQ paradigms.
Sources
Back to AIPULSEN