ARC-AGI-3 benchmark exposes frontier reasoning chasm (<1% vs human 100%)

Frontier models score under 1% on ultra-hard games, compared to human 100% performance. It exposes a reasoning chasm.

A $2M prize incentivizes solving the benchmark, linking agent, memory, and verification evaluations.

It tests core reasoning for AGI, connecting to MiroEval for multimodal agents and ADeLe for performance prediction.

Key Questions

Frontier models score under 1% on ultra-hard games, compared to human 100% performance. It exposes a reasoning chasm.

A $2M prize incentivizes solving the benchmark, linking agent, memory, and verification evaluations.

It tests core reasoning for AGI, connecting to MiroEval for multimodal agents and ADeLe for performance prediction.

Ultra-hard games <1%; $2M prize; links agent/memory/verif evals; echoes Chollet critiques on LLM curve-fitting vs generalization.

Sources (2)

Updated Apr 8, 2026