Laravel

AI evaluation has crossed a critical cost threshold that is reshaping who can participate in the field. The Holistic Agent Leaderboard (HAL) recently spent approximately $40,000 to run 21,730 agent rollouts across 9 models and 9 benchmarks. A single GAIA run on a frontier model can cost $2,829 before caching. Exgentic's $22,000 sweep across agent configurations found a 33× cost spread on identical tasks, highlighting scaffold choice as a primary cost driver. Meanwhile, UK-AISI scaled agentic steps into the millions to study inference-time compute. In scientific ML, The Well costs about 960 H100-hours to evaluate one new architecture and 3,840 H100-hours for a full four-baseline sweep. While compression techniques have been proposed for static benchmarks, new agent benchmarks are noisy, scaffold-sensitive, and only partly compressible. Training-in-the-loop benchmarks are expensive by construction, and efforts to improve reliability through repeated runs further multiply costs.

Making Static LLM Benchmarks Cheaper

The cost problem emerged before agents. When Stanford's CRFM released HELM in 2022, the paper's per-model accounting showed API costs ranging from $85 for OpenAI's code-cushman-001 to $10,926 for AI21's J1-Jumbo, and 540 to 4,200 GPU-hours for open models, with BLOOM and OPT at the high end. Across HELM's 30 models and 42 scenarios, the aggregate reported costs and GPU compute reached roughly $100,000.

Another striking finding came from Perlitz et al.'s analysis of EleutherAI's Pythia checkpoints: developers pay for evaluation repeatedly during model development. Pythia released 154 checkpoints for each of 16 models, or 2,464 checkpoints total, to study training dynamics. Running the LM Evaluation Harness across all those checkpoints makes evaluation a multiplier on training, with costs potentially surpassing pretraining when evaluating checkpoints. For small models, evaluation becomes the dominant compute line item across the entire development cycle.

Perlitz et al. then investigated how much of HELM actually carried the rankings. The result was striking: a 100× to 200× reduction in compute preserved nearly the same ordering, with larger reductions still useful for coarse grouping. Flash-HELM turned this into a coarse-to-fine procedure: run cheap evaluations first, then spend high-resolution compute only on the top candidates. Much of HELM's compute was confirming rankings that the field could have inferred much more cheaply.

Other research reached the same conclusion from different angles. tinyBenchmarks compressed MMLU from 14,000 items to 100 anchor items at about 2% error using Item Response Theory. The Open LLM Leaderboard collapsed from 29,000 examples to 180. Anchor Points showed that as few as 1 to 30 examples could rank-order 87 language-model/prompt pairs on GLUE, and others followed, reducing dataset sizes by 90%. Static benchmarks had a weakness you could exploit: model differences often concentrate in a small subset of items, so ranking can survive aggressive subsampling.

That advantage weakened sharply once benchmarks moved from static predictions to agents.

Agent Evals Are Messier

A thorough public accounting of agent evaluation comes from the Holistic Agent Leaderboard (Kapoor et al., ICLR 2026). HAL runs standardized agent harnesses across nine benchmarks covering coding, web navigation, science tasks, and customer service, with shared scaffolds and centralized cost tracking. The headline cost: $40,000 for 21,730 rollouts across 9 models and 9 benchmarks. By April 2026, the leaderboard had grown to 26,597 rollouts. Ndzomga's independent reproduction arrived at almost the same number: $46,000 across 242 agent runs.

Behind that aggregate, the cost of a single benchmark run varies by four orders of magnitude across HAL tasks, and by three orders within some individual benchmarks. Each bar in the cost distribution shows the minimum-to-maximum cost across HAL configurations on a single benchmark, with highlighted bars crossing the $1,000-per-run threshold. A "run" is one full agent evaluation across all tasks. The within-benchmark spread reflects the model × scaffold × token-budget product.

Behind these numbers is a blunt pricing fact. Claude Opus 4.1 charges $15 per million input tokens and $75 per million output. Gemini 2.0 Flash charges $0.10 and $0.40, a two-order-of-magnitude spread on input alone. Agent benchmarks rarely benchmark "the model" in isolation; they benchmark a model × scaffold × token-budget product, and small scaffold choices can multiply costs 10×.

Worse, higher spend does not reliably buy better results. On Online Mind2Web, Browser-Use with Claude Sonnet 4 cost $1,577 for 40% accuracy, while SeeAct with GPT-5 Medium hit 42% for $171. The HAL paper notes "a 9× difference in cost despite just a two-percentage-point difference in accuracy." On GAIA, an HAL Generalist with o3 Medium cost $2,828 for 28.5% accuracy, while a different agent hit 57.6% for $1,686. CLEAR finds across 6 SOTA agents on 300 enterprise tasks that "accuracy-optimal configurations cost 4.4 to 10.8× more than Pareto-efficient alternatives" with comparable real-world performance.

The static-era toolkit should have helped, but it has only gone so far. Ndzomga's mid-difficulty filter, which selects tasks with 30 to 70% historical pass rates, achieves a 2× to 3.5× reduction while preserving rank fidelity under scaffold and temporal shifts. That is useful, but it falls far short of the 100× to 200× gains available for static benchmarks. When each item is a multi-turn rollout with its own variance, the unavoidable long trajectory per single question becomes the expensive object.

Some Evals Are Just Training

Some benchmarks escape the API-cost framing altogether because their evaluation protocol trains models from scratch. The Well bundles 16 scientific machine-learning datasets spanning biological systems, fluid dynamics, magnetohydrodynamics, supernova explosions, viscoelastic instability, and active matter, totaling 15 TB. Using the paper's headline 16-dataset grid, the protocol leaves little room to economize: train each baseline model for 12 hours on a single H100, try five learning rates per (model, dataset) pair, repeat across four architectures and 16 datasets. That headline-grid sweep consumes 3,840 H100-hours, or roughly $9,600 under typical cloud pricing. A single new architecture still costs about 960 H100-hours.

The Hidden $100K Cost of Evaluating AI Agents: Why Benchmarking Is the New Bottleneck

Making Static LLM Benchmarks Cheaper

Agent Evals Are Messier

Some Evals Are Just Training

We Care About Your Privacy

How and why we process data