Even GPT-5.2 Can't Count to Five: Zero-Error Horizons in Trustworthy LLMs
gpt-5
| Source: HN | Original article
A new study released on arXiv this week shows that even the latest OpenAI model, GPT‑5.2, stumbles on the most elementary arithmetic, prompting researchers to propose a fresh benchmark called the Zero‑Error Horizon (ZEH). The authors measured how far a language model can go on a suite of synthetic tasks—counting, parity checks, and simple logical puzzles—without a single mistake. GPT‑5.2’s ZEH collapsed at the five‑bit parity test, meaning the model produced an error as soon as it was asked to determine whether a five‑digit binary string contained an even number of ones. The paper argues that traditional accuracy scores mask such worst‑case failures, because a model that gets 99.9 % of examples right can still be unreliable for safety‑critical applications.
Why the finding matters is twofold. First, it exposes a blind spot in current evaluation practices that prioritize average performance over guaranteed correctness, a gap that becomes critical as LLMs are embedded in medical triage, financial advice, and autonomous systems. Second, the ZEH framework offers a concrete, task‑agnostic yardstick for “trustworthiness” that developers can integrate into model‑testing pipelines, complementing existing robustness checks such as adversarial prompting and calibration metrics.
The research builds on earlier work from the Data Processing Club and the Quantile‑RK error‑analysis literature, extending the conversation about numerical reasoning limits first highlighted by Zhang et al. (2024). Looking ahead, the community will watch for adoption of ZEH in model cards and regulatory audits, and for follow‑up studies that push the horizon beyond five‑bit tasks to real‑world datasets. If the metric gains traction, it could reshape how providers certify LLMs for high‑stakes deployments, nudging the industry toward models that are not just powerful, but provably error‑free within defined bounds.
Sources
Back to AIPULSEN