Agentic AI: Self-Evo RL + Memory/Planning + Evals

Key Questions

What is PAGER in agentic AI research?

PAGER is one of several frameworks, alongside Solvita and AIRA, advancing self-evolving agent systems through RL and planning.

How does AVSD improve LLM training?

AVSD uses adaptive-view self-distillation to address sparse outcome rewards in on-policy RL for language models.

What is Moss in autonomous agents?

Moss enables self-evolution through source-level rewriting, allowing agents to iteratively improve their own code.

What does π-Bench evaluate?

π-Bench tests proactive personal assistant agents on hidden intents, inter-task dependencies, and cross-session continuity.

What are domain-camouflaged attacks?

These attacks exploit guardrails in multi-agent debate setups, amplifying static injections by up to 9.9x on smaller models.

What is Spreadsheet-RL?

Spreadsheet-RL applies reinforcement learning to improve LLM agents on realistic, complex spreadsheet manipulation tasks.

What gaps remain in agent cognition?

Current systems still show cognition gaps in long-horizon planning, memory consistency, and robustness to adversarial inputs.

Which papers support these agentic advances?

Recent works on RLVR, CEPO, STATE-Bench, and MINTEval provide the technical foundations for the reported progress.

PAGER/Solvita/AIRA; RLVR/CEPO/AVSD/OPSD self-distillation; Moss source-rewriting; Spreadsheet-RL, π-Bench (proactive agents), STATE-Bench; domain-camouflaged attacks (9.9x amplification in debate). Cognition gaps. Climaxing.

Sources (44)

Updated May 23, 2026

Agentic AI: Self-Evo RL + Memory/Planning + Evals

Key Questions

What is PAGER in agentic AI research?

How does AVSD improve LLM training?

What is Moss in autonomous agents?

What does π-Bench evaluate?

What are domain-camouflaged attacks?

What is Spreadsheet-RL?

What gaps remain in agent cognition?

Which papers support these agentic advances?

@EliasEskin reposted: 🚨 Outcome rewards in LLM RL are sparse --&gt; AVSD (Adaptive-View Self-Distillat...

@omarsar0 reposted: NEW paper worth reading. A full agentic workflow can be distilled into model we...

@EliasEskin reposted: We’ve been working on a way to get better on-policy token-level rewards for LLMs...

@srush_nlp reposted: Some enterprise tasks are challenging to hill-climb with RL-based methods since ...

Blind Spots in the Guard: How Domain-Camouflaged Injection Attacks ...

Paper page - π-Bench: Evaluating Proactive Personal Assistant Agents ...

Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning

Moss: Self-Evolution Through Source-Level Rewriting in Autonomous Agent Systems

Agencies or Swarms? What Small Model Cooperation Means for AI Engineering

@omarsar0 reposted: If you design production agent systems, this matters. Most devs accidentally le...

SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

@EliasEskin reposted: 🚨 Excited to introduce MINTEval, a benchmark designed to evaluate memory system ...

Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing

Qwen3.7-Max Is Built for Agents

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Composer 2.5, Claude Prices, and AI "Tricks"

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

Process Rewards with Learned Reliability

Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

Video Models Can Reason with Verifiable Rewards

OpenComputer: Verifiable Software Worlds for Computer-Use Agents

STATE-Bench - Memory-agnostic Benchmark

Latent Action Reparameterization for Efficient Agent Inference

AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

MA2P: A Meta-Cognitive Autonomous Intelligent Agents Framework for ...

Beyond Visual Polish: Benchmarking Reasoning in World Models and Coding Agents

@BhavinJawade reposted: I just published a detailed guide on evaluating agents. It covers: 1. Agent fun...

@EliasEskin reposted: 🚨 Check out Agent-BRACE, our new work on belief state modeling for LLM agents in...

Measuring Voice Agents End-to-End: ServiceNow's EVA Benchmark

@adiyossLC reposted: Our paper: "LaMI: Augmenting Large Language Models via Late Multi-Image Fusion" ...

Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution

Look Before You Leap: Autonomous Exploration for LLM Agents

Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR

PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

Building Agentic AI : Memory for AI Agent | by ASHPAK MULANI

Agentic Data Stack: Vector DB + Memory + Orchestration

@omarsar0 reposted: Interesting interpretability paper on tool-using agents. The authors probe hidd...

@omarsar0 reposted: Are your benchmarks actually measuring the capability you think they measure? N...

Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance

@omarsar0 reposted: // Beyond Individual Intelligence // One of the more useful multi-agent surveys...

[Revue de papier] Nexus : An Agentic Framework for Time...

@EliasEskin reposted: 🚨 Outcome rewards in LLM RL are sparse --> AVSD (Adaptive-View Self-Distillat...