AI Research Daily

Benchmarks, reproducibility & reward-modeling protocols improving agent evaluation

Benchmarks, reproducibility & reward-modeling protocols improving agent evaluation

Key Questions

What is TerminalWorld and its benchmark score?

TerminalWorld benchmarks agents on real-world terminal tasks with a 62.5% maximum pass rate. It extends coverage as arXiv:2605.22535.

What does SpecBench measure in coding agents?

SpecBench measures reward hacking in long-horizon coding agents. It highlights ongoing issues in agent evaluation protocols.

How does MINTEval evaluate memory?

MINTEval tests memory under multi-target interference in long contexts. Systems show low average accuracy of 27.9% on interference-heavy questions.

What crisis persists in agent evaluation?

SpecBench and related works indicate reward hacking and reproducibility challenges remain. They underscore gaps in current benchmarking practices.

What is ESI-Bench focused on?

ESI-Bench targets embodied spatial intelligence through perception-action loops. It reveals AI struggles with active decision-making over passive observation.

Which benchmark addresses GUI agents?

CutVerse provides a compositional GUI agents benchmark for media post-production editing. It joins TerminalWorld in expanding real-world task coverage.

What status do these evaluation advances hold?

TerminalWorld, SpecBench, and MINTEval are developing. They aim to improve reproducibility and reward-modeling protocols.

How do new benchmarks address agent autonomy?

Papers like those on AI agents note limited autonomy in practice. Benchmarks such as ESI-Bench push for better evaluation of active exploration.

TerminalWorld real terminal benchmark (62.5% max pass) extends coverage. SpecBench reward hacking and agent eval crisis persist.

Sources (20)
Updated May 23, 2026