ARC-AGI-3 benchmark exposes frontier reasoning chasm (<1% vs human 100%)
Key Questions
What is the ARC-AGI-3 benchmark result for frontier models?
Frontier models score under 1% on ultra-hard games, compared to human 100% performance. It exposes a reasoning chasm.
What is the prize for ARC-AGI-3?
A $2M prize incentivizes solving the benchmark, linking agent, memory, and verification evaluations.
How does ARC-AGI-3 relate to agent evals?
It tests core reasoning for AGI, connecting to MiroEval for multimodal agents and ADeLe for performance prediction.
Ultra-hard games <1%; $2M prize; links agent/memory/verif evals; echoes Chollet critiques on LLM curve-fitting vs generalization.
Sources (2)
Updated Apr 8, 2026