Laravel

A new analysis from the ARC Prize Foundation exposes systematic reasoning failures in the latest AI models, including OpenAI's GPT-5.5 and Anthropic's Opus 4.7, when confronted with the ARC-AGI-3 benchmark. Despite costing thousands of dollars per run, both models scored below 1%, while untrained humans solved the same tasks with ease.

The foundation reviewed 160 game replays and reasoning traces from the models, identifying three distinct error patterns that prevented them from grasping the underlying mechanics of interactive, turn-based environments.

1. Missing the Big Picture

Both models frequently detected local effects—like recognizing that a specific action rotates an object—but failed to integrate these observations into a coherent world model. For instance, Opus 4.7 understood that ACTION3 rotated a container and that ACTION5 poured paint, but never connected the dots to realize it needed to align the bucket before dipping it to reproduce a target image.

2. False Analogies from Training Data

The models often mistook novel game environments for familiar ones from their training data. GPT-5.5 interpreted a puzzle requiring key combinations as the arcade game Breakout, leading to wasted actions on nonexistent bricks and paddles. Such baseless assumptions, which a human would quickly discard, completely derailed progress.

3. Solving Level 1 Does Not Equal Understanding

Even when a model solved an initial level, it often did so based on a flawed theory that happened to work for that simple case. The success reinforced the wrong assumption, which then carried over to subsequent levels. Opus 4.7 once solved a level by mistakenly believing that clicking teleported characters, and then blindly clicked its way through later stages, unable to recover.

Different Flaws, Same Result

Opus 4.7 tended to lock onto a single false theory early and never let go, while GPT-5.5 struggled to commit to any theory, cycling through multiple wrong hypotheses. As Greg Kamradt of the ARC Prize Foundation put it, "The difference comes down to compression. Opus compressed its observations into a confident but wrong theory. GPT-5.5 had difficulty compressing at all."

The findings underscore challenges for real-world AI agents, which must navigate unfamiliar environments, form and test hypotheses, and adapt to new information—skills that remain elusive for even the most advanced models.

Frontier AI Models Flounder on ARC-AGI-3: Three Reasoning Fails Revealed

We Care About Your Privacy

How and why we process data