ARC-AGI-3 benchmark: all models <1% [climaxing]

Q: What are the top scores on ARC-AGI-3 benchmark?

Gemini 3.1 Pro scored 0.37%, GPT-5.4 0.26%, and Claude 4.6 0.25%, compared to humans at 100%. This dynamic evaluation shows a scaling plateau. All models are under 1%.

Q: What benchmarks are involved alongside ARC-AGI-3?

Related benchmarks include ViGoR, HippoCamp, Vision2Web, AIRS, and TBSP. Reproductions are pending for Gemma4, Qwen3.6, Llama4, and GLM-5V. Pitfalls like gaming and leaks are noted.

Q: What is Kaggle's role in AI evaluation?

Kaggle launched Benchmarks Resource Grants to support AI evaluation efforts. This initiative addresses the growing need for reliable benchmarks like ARC-AGI-3. It signals increased focus on evaluation infrastructure.

Q: What concerns exist around AI safety in benchmarks?

Papers like SDEval assess safety in multimodal LLMs, and evaluations of Kimi K2.5 reveal dual-use capabilities. Reconstruction Evaluation checks AI-written papers for hallucinations. These highlight risks in dynamic evals.

Q: How reliable are current AI model rankings?

Debates question what defines the best AI model, amid benchmark gaming and leaks. Foundation models for structured data are emerging. ARC-AGI-3 underscores persistent challenges in general intelligence.

Key Questions

What are the top scores on ARC-AGI-3 benchmark?

Gemini 3.1 Pro scored 0.37%, GPT-5.4 0.26%, and Claude 4.6 0.25%, compared to humans at 100%. This dynamic evaluation shows a scaling plateau. All models are under 1%.

What benchmarks are involved alongside ARC-AGI-3?

Related benchmarks include ViGoR, HippoCamp, Vision2Web, AIRS, and TBSP. Reproductions are pending for Gemma4, Qwen3.6, Llama4, and GLM-5V. Pitfalls like gaming and leaks are noted.

What is Kaggle's role in AI evaluation?

Kaggle launched Benchmarks Resource Grants to support AI evaluation efforts. This initiative addresses the growing need for reliable benchmarks like ARC-AGI-3. It signals increased focus on evaluation infrastructure.

What concerns exist around AI safety in benchmarks?

Papers like SDEval assess safety in multimodal LLMs, and evaluations of Kimi K2.5 reveal dual-use capabilities. Reconstruction Evaluation checks AI-written papers for hallucinations. These highlight risks in dynamic evals.

How reliable are current AI model rankings?

Debates question what defines the best AI model, amid benchmark gaming and leaks. Foundation models for structured data are emerging. ARC-AGI-3 underscores persistent challenges in general intelligence.

Dynamic eval: Gemini 3.1 Pro 0.37%, GPT-5.4 0.26%, Claude 4.6 0.25% vs humans 100%; scaling plateau amid ViGoR/HippoCamp/Vision2Web/AIRS/TBSP; repro Gemma4/Qwen3.6/Llama4/GLM-5V; pitfalls gaming/leaks.

Sources (6)

Updated Apr 8, 2026

Generative AI Pulse

ARC-AGI-3 benchmark: all models <1% [climaxing]

Key Questions

What are the top scores on ARC-AGI-3 benchmark?

What benchmarks are involved alongside ARC-AGI-3?

What is Kaggle's role in AI evaluation?

What concerns exist around AI safety in benchmarks?

How reliable are current AI model rankings?

Kaggle launches Benchmarks Resource Grants for AI evaluation

[PDF] SDEval: Safety Dynamic Evaluation for Multimodal Large Language ...

@_akhaliq: Paper Reconstruction Evaluation Evaluating Presentation and Hallucination in AI-written Papers pap...

@Miles_Brundage reposted: 🚨New paper! How safe and aligned is Kimi K2.5? We found concerning dual-use ca...

The Best AI Model...According To What??

Foundation Models for Structured Data

**ARC-AGI-3 benchmark: all models <1% [climaxing]**

Key Questions

What are the top scores on ARC-AGI-3 benchmark?

What benchmarks are involved alongside ARC-AGI-3?

What is Kaggle's role in AI evaluation?

What concerns exist around AI safety in benchmarks?

How reliable are current AI model rankings?

Kaggle launches Benchmarks Resource Grants for AI evaluation

[PDF] SDEval: Safety Dynamic Evaluation for Multimodal Large Language ...

@_akhaliq: Paper Reconstruction Evaluation Evaluating Presentation and Hallucination in AI-written Papers pap...

@Miles_Brundage reposted: 🚨New paper! How safe and aligned is Kimi K2.5? We found concerning dual-use ca...

The Best AI Model...According To What??

Foundation Models for Structured Data

ARC-AGI-3 benchmark: all models <1% [climaxing]