AI Preprint Pulse · Apr 11 Daily Digest
Agent Evaluation Benchmarks
- 🔥 ClawBench: Can AI Agents Complete Everyday Online Tasks?: New paper on ClawBench evaluating AI agents on everyday...

Created by Valerie Flynn
Daily top AI arXiv papers with abstracts and relevance notes
Explore the latest content tracked by AI Preprint Pulse
Hot trend in autoregressive LLM inference: aggressive techniques for parallel decoding and multi-token generation.
New preprint rethinks poor generalization in reasoning SFT via conditional analysis of optimization, data, and model capability roles. Essential read for understanding SFT bottlenecks.
Even in agentic search, core tech like BM25 (1994) + monoT5 (2020) lets a 20B agent rival GPT-5 (2025). SIGIR2026 paper "Revisiting Text Ranking in Deep Research" proves traditional IR remains highly competitive.
New preprint spotlight:
New preprint SEVerA introduces verified synthesis methods for self-evolving agents, targeting reliable self-improvement vital for AI safety. Join the discussion.
New preprint 'The Depth Ceiling' spotlights the limits of large language models in discovering latent planning structures, urging deeper scrutiny of LLM planning depths.
DeonticBench launches as a new benchmark for reasoning over rules, probing LLMs on deontic obligations and permissions—vital for AI alignment.
MegaTrain introduces full precision training for 100B+ parameter large language models on a single GPU – slashing barriers to massive model development.
Free-Range Gaussians introduce a core idea: instead of predicting Gaussians on pixel- or voxel-aligned grids, they live freely in 3D space for...
Fresh arXiv preprint Learning to Retrieve from Agent Trajectories spotlights trajectory-based retrieval for agent enhancement. Join the discussion.
New preprint introduces Action Images, an approach for end-to-end policy learning using multiview video generation. Join the discussion sparking interest in this innovative method.
ThinkTwice introduces a unified approach to jointly optimizing large language models for reasoning and self-refinement. Join the discussion on this new preprint.