AI Research Pulse · Mar 19 Daily Digest
Agent Evaluation Benchmarks
- 🔥 SWE-CI: SWE-CI evaluates agent capabilities in maintaining software engineering tasks such as static bug fixing...

Created by Christopher Malcolm
High‑impact AI research summaries across core ML, applied AI, and safety/policy
Explore the latest content tracked by AI Research Pulse
V-Co takes a closer look at visual representation alignment via co-denoising. Paper: https://t.co/yFmatjr2xS.
InCoder-32B emerges as a code foundation model for industrial scenarios, while LLMs prove powerful for automating programming tasks including security-related ones—signaling rising momentum bridging code gen and secure applications.
New book The Emerging Science of Machine Learning Benchmarks garners 35 points on Hacker News, spotlighting the evolving science behind ML benchmark design.
Multimodal generative models show remarkable progress in single-modality video and audio synthesis, yet a new arXiv paper advances diffusion models for truly joint audio-video generation.
This paper reframes prompt choice as a per-query decision problem for LLMs, using a learned offline proxy reward to score query-prompt pairs for scalable optimization.
Rapid gains in AI coding reliability: OpenAI's GPT-5.4 mini scores 54.4% on SWE-Bench Pro (up from 45.7% for GPT-5 mini) and runs 2x faster.
A new cognitive framework offers a fresh lens on AGI progress measurement, drawing strong interest with 58 points on Hacker News.
Emerging trend in AI agent diagnostics and verification:
Emerging AI papers push multimodal and agent boundaries:
Diffusion models run in a reflexive System 1 mode, hobbled by fixed, content-agnostic sampling schedules—rigidity born from the curse of state that curbs intrinsic generative optimality.
Major agentic coding leaps this week:
Cognitive science reveals why AI systems don't truly learn autonomously, offering a sharp theory critique vital for AI alignment discussions. HN thread exploded with 62 points.
AttnRes counters residual dilution in deep LLMs by selectively aggregating prior layers via learnable softmax attention.
Encoder-decoder architectures fight back against decoder-only LLMs in multilingual NLP:
MR-Search introduces meta-RL with self-reflection for complex info-seeking, yielding up to 19.3% benchmark gains via episode learning and multi-turn...
SGTR trains models to recognize their own text, reversing and preventing emergent misalignment (EM) harms. Miles Brundage spotlights this promising alignment defense.