LLMs Do Not Grade Essays Like Humans
| Source: ArXiv | Original article
A new arXiv pre‑print (2603.23714v1) shows that large language models (LLMs) still fall short of human graders when scoring essays. The authors compared raw LLM scores against human marks across a multilingual test set and found systematic mismatches: short or under‑developed responses that hit the prompt are consistently overrated, while well‑crafted essays are penalised for minor language slips. The models appear to apply a literal, rubric‑free logic rather than the nuanced judgment humans use.
The study joins a growing body of work that probes AI’s role in assessment. Earlier research on German student essays reported similar gaps between open‑source and proprietary LLMs and human raters, highlighting both the promise of multidimensional evaluation and the danger of hidden bias. A separate analysis of scoring processes underscored that, unlike human grading which follows explicit rubrics, LLMs generate scores from opaque internal patterns that are difficult to audit.
Why it matters now is twofold. First, educational technology firms are courting schools and testing agencies with “AI‑graded” solutions, touting speed and cost savings. If the underlying models misjudge brevity or penalise stylistic variance, students could be unfairly advantaged or disadvantaged, eroding trust in digital assessment. Second, the findings raise regulatory questions: many jurisdictions are drafting standards for algorithmic transparency in education, and this paper provides concrete evidence that current LLMs may not meet those thresholds.
What to watch next includes efforts to fine‑tune LLMs on domain‑specific rubrics, the emergence of hybrid human‑AI grading pipelines, and policy debates at upcoming education conferences. Industry players are likely to release updated models that claim rubric alignment, while researchers will test whether those claims hold up under the same rigorous cross‑human comparison. The next few months will reveal whether AI can move from “fast but fuzzy” to a reliable partner in essay evaluation.
Sources
Back to AIPULSEN