AI Research Pulse

ARC-AGI-3 benchmark exposes frontier reasoning chasm (<1% vs human 100%)

ARC-AGI-3 benchmark exposes frontier reasoning chasm (<1% vs human 100%)

Key Questions

What is the ARC-AGI-3 benchmark result for frontier models?

Frontier models score under 1% on ultra-hard games, compared to human 100% performance. It exposes a reasoning chasm.

What is the prize for ARC-AGI-3?

A $2M prize incentivizes solving the benchmark, linking agent, memory, and verification evaluations.

How does ARC-AGI-3 relate to agent evals?

It tests core reasoning for AGI, connecting to MiroEval for multimodal agents and ADeLe for performance prediction.

Ultra-hard games <1%; $2M prize; links agent/memory/verif evals; echoes Chollet critiques on LLM curve-fitting vs generalization.

Sources (2)
Updated Apr 8, 2026