AI Research Digest

Agentic systems, online learning, verification & security

Agentic systems, online learning, verification & security

Key Questions

What defines the agentic systems focus in this highlight?

It covers advances in theorem proving, self-verification, tool-use, RL verification, and security benchmarks for agents. Key themes include online learning, verifiable environments, and reward hacking mitigation.

How does Spreadsheet-RL improve agent performance?

It uses reinforcement learning to boost LLM agents from 12% to 23.4% on SpreadsheetBench tasks. This demonstrates gains in realistic, long-horizon agent workflows.

What new benchmarks evaluate computer-use and GUI agents?

OSWorld, OpenCUA, OpenComputer, CutVerse, and π-Bench assess computer agents, verifiable worlds, GUI editing, and proactive assistants. They highlight gaps in tool-use success rates of 26-54%.

How does OpenAI's geometry work advance autonomous discovery?

An OpenAI reasoning model autonomously disproved a 1946 planar unit distance conjecture by Erdős. This strengthens the case for AI as a co-discoverer in mathematics.

What methods support self-verification and reward assignment in agents?

30B-A3B self-verification, DelTA for discriminative token credit assignment, and CEPO for contrastive RLVR self-distillation are key. They address sparse rewards and reward hacking in SpecBench.

How do EnvFactory and Agent S scale tool-use agents?

EnvFactory synthesizes executable environments with robust RL, while Agent S provides an ACI framework. These improve scalability for complex agent tasks.

What security and verification concerns are addressed?

Gary Marcus highlights memory issues, while PopuLoRA uses LLM population self-play and IndusAgent focuses on anomaly detection. OpenComputer provides verifiable software worlds to reduce risks.

How does AVSD improve RL from verifiable rewards?

AVSD uses adaptive-view self-distillation and dense token rewards via privileged views to handle sparse outcome rewards. It enhances credit assignment in long-horizon agent training.

OProver agentic theorem proving; 30B-A3B self-verification; Gary Marcus on memory issues; tool-use gaps (26-54%). New: EnvFactory synthesis+RL, OSWorld/OpenCUA computer agents, OpenComputer verifiable worlds, CEPO contrastive RLVR self-distillation, Agent S ACI framework, CHI-Bench clinical agents, CutVerse GUI agents benchmark, SpecBench reward hacking, PopuLoRA LLM population self-play reasoning, OpenAI geometry breakthroughs (autonomous math discovery), IndusAgent (MLLM anomaly detection), π-Bench (proactive agents), DelTA (token RLVR credit assignment). New: Spreadsheet-RL (RL agents improve 12%→23.4% on SpreadsheetBench), AVSD (dense token rewards via privileged views).

Sources (18)
Updated May 24, 2026