**Chollet ARC-AGI-3: Unsaturated Agentic AI Benchmark [developing]** [developing]
Key Questions
What is the performance of frontier models on ARC-AGI-3?
Frontier models score under 1% on ARC-AGI-3, an unsaturated agentic AI benchmark. It tests adaptive skills and planning. François Chollet critiques curve-fitting vs. symbolic approaches.
How does Chollet view past AGI progress signals?
Chollet sees past jumps as potential AGI signals but emphasizes ARC-AGI-3's challenges. The benchmark remains unsaturated. It contrasts scaling with true generalization.
What flaw does Microsoft’s Universal Verifier expose?
Microsoft’s paper reveals hidden problems in agent benchmarks, like verifying agent actions. It questions eval reliability. Every benchmark shares this issue.
ARC-AGI-3 <1% frontiers; Chollet curve-fitting vs symbolic critique; adaptive skills/planning; past jumps as AGI signals; Microsoft Universal Verifier exposes agent eval flaws.