Google DeepMind just scored 85% on ARC-AGI-2 — the most difficult general reasoning benchmark in AI.
benchmarks deepmind gemini google reasoning
| Source: Mastodon | Original article
Google DeepMind’s Gemini 3 model has cracked the ARC‑AGI‑2 benchmark with an 85 percent accuracy, shattering the previous high‑water mark of 54 percent set by competing systems. The result, announced after the “Deep Think” upgrade rolled out on 12 February 2026, represents the first time an AI has comfortably outperformed the average human score of roughly 60 percent on this test of fluid, abstract reasoning.
ARC‑AGI‑2, created by the ARC Prize Foundation, is deliberately engineered to foil simple pattern‑matching tricks; it demands that models extrapolate from sparse examples, compose multi‑step chains of thought and generalise across domains. Earlier versions—ARC‑AGI‑1 and ARC‑AGI‑3—have served as stepping stones, but ARC‑AGI‑2 has long been regarded as the “hardest” of the trio. Gemini 3’s leap suggests that scaling alone, combined with sophisticated chain‑of‑thought prompting, can now bridge gaps that previously required human‑level insight.
The breakthrough matters for several reasons. First, it narrows the performance gap between today’s narrow AI and the broader, flexible reasoning once thought exclusive to humans, nudging the field closer to the long‑standing AGI ambition. Second, the result validates DeepMind’s strategy of iterative model upgrades, reinforcing its lead in the competitive race that includes OpenAI, Anthropic and emerging European labs. Third, the achievement raises fresh safety questions: as models become adept at open‑ended problem solving, the risk of unintended behaviours and misuse escalates, echoing DeepMind’s own recent research on AI’s potential negative societal impacts.
What to watch next: DeepMind is already previewing Gemini 3.1 Pro, which early tests claim a 77 percent score on ARC‑AGI‑2 and near‑perfect results on ARC‑AGI‑1, hinting at even higher ceilings. The AI community will be monitoring upcoming benchmark releases, especially ARC‑AGI‑3, and regulatory bodies are likely to intensify scrutiny of models that demonstrate human‑level reasoning capabilities. The coming months could define whether this performance jump translates into practical, responsibly deployed technology or fuels a new wave of competitive escalation.
Sources
Back to AIPULSEN