LLM 'benchmark' as a 1v1 RTS game where models write code controlling the units
benchmarks open-source
| Source: Lobsters | Original article
A new open‑source benchmark called **LLM Skirmish** pits large language models against each other in a 1‑vs‑1 real‑time strategy (RTS) duel where the models generate the JavaScript that drives nine units on each side. The test draws on the Screeps API, a sandbox where code is executed continuously in a game world, and limits actions to simple move() and pew() commands. Each model first faces a human‑written baseline bot for ten rounds, then competes in a round‑robin tournament of ten games per opponent, with ASCII snapshots of the board recorded after every tick.
The benchmark is designed to surface a model’s ability to perform in‑context reasoning, adapt to dynamic feedback, and manage computational cost when generating executable code. Unlike static question‑answer tests, LLM Skirmish forces the AI to anticipate opponent moves, allocate resources, and iteratively refine its strategy under strict latency constraints. Early results show that newer instruction‑tuned models such as Claude 3.5 and GPT‑4o outperform older, larger models, echoing the performance hierarchy observed in the LLM Buyout Game Benchmark we covered on 31 March 2026.
Why it matters is twofold. First, the ability to write and run code on the fly is a core use case for AI‑assisted software development, and the benchmark offers a concrete, reproducible metric for that capability. Second, the cost‑efficiency signal—how many API calls and compute cycles a model consumes to win—directly informs enterprises weighing the trade‑off between model size and operational expense, a concern highlighted by the recent Claude Code cost‑inflation bug.
Looking ahead, the community plans to expand the arena with larger maps, additional unit types, and multi‑agent cooperation scenarios. Researchers will also integrate reinforcement‑learning loops that let models learn from their own game logs, potentially blurring the line between code generation and autonomous agent training. The next release, slated for Q2 2026, promises a leaderboard that could become the de‑facto standard for measuring strategic, code‑writing AI.
Sources
Back to AIPULSEN