Generative AI Pulse

**ARC-AGI-3 benchmark: all models <1% [climaxing]**

**ARC-AGI-3 benchmark: all models <1% [climaxing]**

Key Questions

What are the top scores on ARC-AGI-3 benchmark?

Gemini 3.1 Pro scored 0.37%, GPT-5.4 0.26%, and Claude 4.6 0.25%, compared to humans at 100%. This dynamic evaluation shows a scaling plateau. All models are under 1%.

What benchmarks are involved alongside ARC-AGI-3?

Related benchmarks include ViGoR, HippoCamp, Vision2Web, AIRS, and TBSP. Reproductions are pending for Gemma4, Qwen3.6, Llama4, and GLM-5V. Pitfalls like gaming and leaks are noted.

What is Kaggle's role in AI evaluation?

Kaggle launched Benchmarks Resource Grants to support AI evaluation efforts. This initiative addresses the growing need for reliable benchmarks like ARC-AGI-3. It signals increased focus on evaluation infrastructure.

What concerns exist around AI safety in benchmarks?

Papers like SDEval assess safety in multimodal LLMs, and evaluations of Kimi K2.5 reveal dual-use capabilities. Reconstruction Evaluation checks AI-written papers for hallucinations. These highlight risks in dynamic evals.

How reliable are current AI model rankings?

Debates question what defines the best AI model, amid benchmark gaming and leaks. Foundation models for structured data are emerging. ARC-AGI-3 underscores persistent challenges in general intelligence.

Dynamic eval: Gemini 3.1 Pro 0.37%, GPT-5.4 0.26%, Claude 4.6 0.25% vs humans 100%; scaling plateau amid ViGoR/HippoCamp/Vision2Web/AIRS/TBSP; repro Gemma4/Qwen3.6/Llama4/GLM-5V; pitfalls gaming/leaks.

Sources (6)
Updated Apr 8, 2026
What are the top scores on ARC-AGI-3 benchmark? - Generative AI Pulse | NBot | nbot.ai