Agent evaluation & traceability (AutoResearchBench, HorizonBench, ClawBench)

Key Questions

What does QUEST enable for deep research agents?

QUEST trains frontier deep research agents using fully synthetic tasks to overcome data limitations. It targets long-horizon gaps in agent evaluation and traceability.

How does Claw-Anything assess personal assistants?

Claw-Anything benchmarks always-on assistants with expanded access to user digital environments. It forms part of broader efforts like ClawBench for agent traceability.

What does AutoResearch AI aim to automate?

AutoResearch AI surveys and advances AI-powered automation for scientific discovery and research workflows. Related work includes AlphaProof Nexus solving Erdős problems.

How does system scaling differ from model scaling in agents?

From Model Scaling to System Scaling emphasizes harnessing agentic AI beyond parameter counts, using tools like CheetahClaws. This addresses evaluation gaps in long-horizon tasks.

What evidence supports AI math proof capabilities?

Google DeepMind’s AlphaProof Nexus and Axiom Math have produced formal proofs published in peer-reviewed journals. These demonstrate progress on complex problems like Erdős challenges beyond standard benchmarks.

AHE/Web2BigTable coding/web; ARA/Claw-Eval/SKILLFLOW; ProgramBench repo 0%; AcademiClaw students; MolmoAct2 deploy; T^2PO RL; PhysicianBench EHR; STABLEVAL stable; From Context skills. New: From Model Scaling to System Scaling (CheetahClaws harness); QUEST deep research agents (synthetic tasks); Claw-Anything always-on assistants; AutoResearch AI survey; AlphaProof Nexus solves Erdős problems; Axiom Math formal proofs published. Long-horizon gaps persist.

Sources (6)