LLM Insight Tracker

新基准挑战:PRBench/RealChart2Code/Qworld/CUA/ARC-AGI-3/Terminal 2.0 + GEditBench v2 + MESNA + MLPerf v6.0 + YC-Bench + Vision2Web/VideoZeroBench + BizGenEval/VOID + HF top papers + agent metrics + VLM limits + base LLM math gen fails + llmtester/ClawArena/LIBERO-Para

新基准挑战:PRBench/RealChart2Code/Qworld/CUA/ARC-AGI-3/Terminal 2.0 + GEditBench v2 + MESNA + MLPerf v6.0 + YC-Bench + Vision2Web/VideoZeroBench + BizGenEval/VOID + HF top papers + agent metrics + VLM limits + base LLM math gen fails + llmtester/ClawArena/LIBERO-Para

Key Questions

What does PRBench test in AI agents?

PRBench evaluates end-to-end paper reproduction in physics research. It exposes failures in agent physics repro tasks as of Mar 2026.

What is ARC-AGI-3's performance level?

ARC-AGI-3 scores are low at 0.37% for top models like GPT-5.4. It challenges generalization in new benchmarks.

What limitation do VLMs show per recent studies?

VLMs ignore visual details in favor of semantic anchors. They prioritize words over pure visuals.

What is BizGenEval?

BizGenEval is a benchmark for commercial visual content generation. It systematically tests generation quality.

How do base LLMs perform on generalization math?

Base LLMs fail on zero-shot generalization-focused math without TTA, per @GaryMarcus and @fchollet. They lack fluid intelligence.

What is VOID benchmark?

VOID tests video object and interaction deletion. It evaluates multimodal editing capabilities.

What are key new agent benchmarks mentioned?

New benchmarks include CUA, Terminal 2.0, GEditBench v2, MESNA (GPT-5.4 at 150IQ), and MLPerf v6.0. They measure agent ROI and HITL.

What does Agentic-MME reveal about multimodal agents?

Agentic-MME shows limits in VLM visuals vs. semantics. It benchmarks multimodal agent capabilities.

PRBench physics repro agent fails; ARC-3(0.37%)/Terminal/YC-Bench/MESNA(GPT-5.4 150IQ); Vision2Web/VideoZero/BizGenEval; MLPerf/Codex/Gemma/llmtester; agent ROI/GCR/Stuck/HITL/ClawArena evolving envs; FIPO > o1; Agentic-MME; LIBERO-Para VLA paraphrase; VLMs semantics > visuals; @GaryMarcus/@fchollet base LLMs zero fluid math.

Sources (9)
Updated Apr 8, 2026
What does PRBench test in AI agents? - LLM Insight Tracker | NBot | nbot.ai