ARC-AGI-3 Drops and Frontier Models Are Back to Square One

5 min read 1 source explainer
├── "ARC-AGI-3 is a meaningful and valuable measure of progress toward AGI"
│  ├── ARC Prize team (arcprize.org) → read

ARC-AGI-3 is positioned as the first interactive reasoning benchmark for AI agents, designed so that no amount of inference compute can substitute for genuine reasoning. The benchmark's evolution from ARC-AGI-1 through 3 deliberately resets the scoreboard each time models game the previous version, reinforcing the thesis that scaling compute is not scaling intelligence.

│  ├── @jwpapi (Hacker News) → view

Calls ARC-AGI a 'very good estimation of AGI' because it gives humans and AI the same input and measures results on equal footing. Expresses puzzlement at why people push back against the benchmark, arguing that knowing whether AI matches human reasoning on novel tasks is exactly the question worth answering.

│  └── @vessenes (Hacker News) → view

Praises the puzzle design and scoring system, arguing that models trained to do well on ARC-AGI will be 'genuinely much more useful' in practice. Distinguishes ARC-AGI favorably from earlier benchmarks like ARC-AGI-1 where brute-force compute could substitute for reasoning.

├── "The benchmark's human baseline and methodology have serious flaws"
│  ├── @Tiberium (Hacker News) → view

Cites analysis from @scaling01 calling out multiple issues with ARC-AGI-3, including that the human baseline is defined as 'the second-best first-run human by action count' from a self-selected population of enthusiasts who signed up voluntarily. This methodology choice inflates perceived human performance and makes the AI-vs-human gap appear larger than it may actually be.

│  └── @Zedseayou (Hacker News) → view

As an actual human tester for the benchmark, reports that incentives ($5 per game solved) pushed for solve speed over minimizing action count, despite instructions to minimize actions. This firsthand account suggests the human baseline data may not reflect what the benchmark claims to measure, since testers optimized for speed rather than efficiency.

├── "These tasks measure learned familiarity with visual-spatial puzzles, not general intelligence"
│  ├── @Real_Egor (Hacker News) → view

Argues that someone who grew up playing video games would pass these tests at near 100% while a grandmother who never used a computer would fail completely — just like an LLM. This suggests the benchmark measures familiarity with a specific class of interactive spatial puzzles rather than innate reasoning ability, undermining its claim to test general intelligence.

│  └── @lukev (Hacker News) → view

Questions the benchmark's relevance to AGI entirely, arguing it only measures LLM ability in a certain class of games. Notes that humans vary in their performance on these tasks too, and that there exist classes of games where AI already outperforms humans — so task-specific comparisons don't establish or disprove general intelligence.

└── "Comparing AI to human cognition via benchmarks is a category error"
  └── @BeetleB (Hacker News) → view

Invokes the classic analogy from a 1990s AI researcher: 'It's silly to say airplanes don't fly because they don't flap their wings.' Argues that defining AGI as closing the gap between AI and human learning on specific tasks misunderstands what intelligence means — AI may achieve general capability through fundamentally different mechanisms than human cognition, making direct performance comparisons misleading.

What Happened

François Chollet and Mike Knoop have released ARC-AGI-3, the third iteration of the Abstraction and Reasoning Corpus — the benchmark that has become the de facto measuring stick for whether AI systems can actually *think* or merely pattern-match at scale. The announcement landed on Hacker News with a score north of 300, reigniting the perennial debate about what "intelligence" means when applied to large language models.

ARC-AGI has a simple premise that turns out to be devastatingly hard for AI. Each task presents a few input-output grid examples demonstrating some visual transformation rule — rotations, color mappings, symmetry operations, object manipulations — and asks the system to apply that rule to a new input. Humans average around 85% on these tasks. They're the kind of puzzles a bright 10-year-old can solve in seconds. ARC-AGI-3 is designed so that no amount of money thrown at inference compute can substitute for genuine reasoning ability.

The benchmark's evolution tells a story about the cat-and-mouse game between AI capabilities and evaluation. ARC-AGI-1, which ran from 2019 through 2024, saw steady progress — culminating in OpenAI's o3 model hitting roughly 75.7% on the semi-private evaluation set in December 2024 using high-compute inference. That result made headlines, but Chollet was quick to note that o3 was spending orders of magnitude more compute per task than a human spends calories thinking about the same problem.

Why It Matters

ARC-AGI-2 was the first reset. By introducing strict compute budgets per task and an entirely new task set to prevent data contamination, it knocked frontier models back down dramatically. Models that had been approaching human-level scores on ARC-AGI-1 dropped to the low double digits or worse. The message was clear: scaling inference compute is not the same as scaling intelligence.

ARC-AGI-3 pushes this thesis further. The new version introduces harder tasks that require more compositional reasoning — combining multiple abstraction primitives in novel ways rather than applying a single transformation rule. The compute constraints are tighter still. Where ARC-AGI-2 forced systems to operate within a modest per-task compute envelope, ARC-AGI-3 squeezes that budget further, testing whether AI can reason *efficiently* about novel problems.

This matters because the AI industry has largely converged on a strategy of throwing more compute at reasoning. OpenAI's o-series models, Anthropic's extended thinking, Google's Gemini with chain-of-thought — they all bet that longer inference-time computation produces better reasoning. And on many benchmarks, it does. But ARC-AGI-3 asks a different question: can your model solve a problem it has literally never seen before, with a budget that rules out exhaustive search?

The benchmark community is genuinely split on what this means. One camp — call them the scaling optimists — argues that ARC-AGI keeps moving the goalposts and that each iteration simply measures how quickly benchmarks expire, not whether models lack intelligence. They point to the rapid progress on ARC-AGI-1 as evidence that ARC-AGI-3 will also fall within a year or two. The other camp, which includes Chollet himself, argues that the pattern of scores cratering on each new version proves that previous "solutions" were memorization and brute force, not generalization. If a system truly understood the underlying reasoning, a harder version of the same kind of task shouldn't cause a catastrophic drop.

Neither side is entirely wrong. The scaling optimists are correct that AI capabilities are improving rapidly and that static benchmarks have a shelf life. But the Chollet camp has a point that keeps surviving empirical contact: every time we think a model has learned to reason, a slightly different version of the same problem reveals it was approximating rather than understanding.

What This Means for Your Stack

If you're building applications that depend on LLMs handling genuinely novel situations — anomaly detection, complex debugging, architectural decisions with unusual constraints — ARC-AGI-3 should calibrate your expectations. The models are spectacular at pattern-matching within their training distribution and increasingly good at chain-of-thought reasoning on problems that resemble their training data. They are still poor at the kind of fluid intelligence that ARC tests: seeing a few examples of a rule and correctly extrapolating it to a case that shares no surface-level features with the examples.

Practically, this means your prompt engineering and system design need to handle the gap. Don't assume that a model that aces your test cases will generalize to edge cases it hasn't seen. Build evaluation harnesses that specifically test out-of-distribution reasoning. If you're using AI for code generation, the model will nail the common patterns and struggle with the weird architectural corner case that doesn't match anything on Stack Overflow — which is, of course, exactly the situation where you most need help.

For teams evaluating AI capabilities for production use, ARC-AGI-3 is a better signal than MMLU, HumanEval, or most other popular benchmarks — not because grid puzzles are relevant to your domain, but because the benchmark's design controls for memorization and compute scaling in ways that other benchmarks don't. If a model scores well on ARC-AGI-3 under compute constraints, it's demonstrating something closer to genuine generalization.

The compute constraint angle also has implications for cost. If the industry's answer to harder reasoning is always "spend 10x more on inference," the economics of AI-powered features change dramatically. A model that needs $5 of compute to reason through a novel problem is a very different product proposition than one that needs $0.05. ARC-AGI-3's per-task budget forces researchers to optimize for efficiency, which is ultimately what makes AI reasoning economically viable in production.

Looking Ahead

ARC-AGI-3 won't be the final word — Chollet and Knoop have been clear that the benchmark will keep evolving to stay ahead of AI progress. But the broader trajectory it reveals is worth watching: after years of benchmark scores going up and to the right, we have a class of evaluation that consistently resets frontier models to near-zero. Whether that's a fundamental limitation of current architectures or a temporary plateau that new techniques will overcome is the most important open question in AI research today. For developers, the practical takeaway is simpler: build your systems assuming the model will fail on the hardest cases, and you'll ship better products than the teams that assume the next model version will be smart enough.

Hacker News 484 pts 320 comments

ARC-AGI-3

→ read on Hacker News
Tiberium · Hacker News

https://x.com/scaling01 has called out a lot of issues with ARC-AGI-3, some of them (directly copied from tweets, with minimal editing):- Human baseline is "defined as the second-best first-run human by action count". Your "regular people" are people who signed up

vessenes · Hacker News

I’m not a Chollet booster. Well, I might be a little bit of one in that I admire his persistence.I really like these puzzles. There’s a lot to them both in design and scoring — models trained to do well on these are going to be genuinely much more useful, so I’m excited about it. As opposed to -1 an

BeetleB · Hacker News

> As long as there is a gap between AI and human learning, we do not have AGI.Back in the 90's, Scientific American had an article on AI - I believe this was around the time Deep Blue beat Kasparov at chess.One AI researcher's quote stood out to me:"It's silly to say airplanes

jwpapi · Hacker News

This is a very good estimation of AGI. We give humans and AI the same input and measure the results. Kudos to ARC for creating these games.I really wonder why so many people fight against this. We know that AI is useful, we know that AI is researchful, but we want to know if they are what we vaguely

Real_Egor · Hacker News

I'll probably be the skeptic here, but:- Take a person who grew up playing video games. They'll pass these tests 100% without even breaking a sweat.- BUT, put a grandmother who has never used a computer in front of this game, and she'll most likely fail completely. Just like an LLM.As

// share this

// get daily digest

Top 10 dev stories every morning at 8am UTC. AI-curated. Retro terminal HTML email.