Benchmarks, reproducibility & reward-modeling protocols improving agent evaluation
Key Questions
What is TerminalWorld and its benchmark score?
TerminalWorld benchmarks agents on real-world terminal tasks with a 62.5% maximum pass rate. It extends coverage as arXiv:2605.22535.
What does SpecBench measure in coding agents?
SpecBench measures reward hacking in long-horizon coding agents. It highlights ongoing issues in agent evaluation protocols.
How does MINTEval evaluate memory?
MINTEval tests memory under multi-target interference in long contexts. Systems show low average accuracy of 27.9% on interference-heavy questions.
What crisis persists in agent evaluation?
SpecBench and related works indicate reward hacking and reproducibility challenges remain. They underscore gaps in current benchmarking practices.
What is ESI-Bench focused on?
ESI-Bench targets embodied spatial intelligence through perception-action loops. It reveals AI struggles with active decision-making over passive observation.
Which benchmark addresses GUI agents?
CutVerse provides a compositional GUI agents benchmark for media post-production editing. It joins TerminalWorld in expanding real-world task coverage.
What status do these evaluation advances hold?
TerminalWorld, SpecBench, and MINTEval are developing. They aim to improve reproducibility and reward-modeling protocols.
How do new benchmarks address agent autonomy?
Papers like those on AI agents note limited autonomy in practice. Benchmarks such as ESI-Bench push for better evaluation of active exploration.
TerminalWorld real terminal benchmark (62.5% max pass) extends coverage. SpecBench reward hacking and agent eval crisis persist.