AHE >SWE-bench; Web2BigTable web-scale; Fleet-RL robots; ARA/Synthetic/Stateless DPM/Claw/SKILLFLOW; AI code smells in LLM gen; social/compound agents. New: Claw-Anything always-on assistants (34.5% pass@1); ParaVT parallel tool use in video RL; Macaron-A2UI generative UI for agents; AutoResearch AI survey on research automation; Superhuman safe multi-agent RL racing (22 m/s, 50% fewer collisions). Also: MUSE-Autoskill self-evolving agents via skill lifecycle; AKBE efficient agentic RL (+1.85 accuracy, -18% tool calls); NoisyAgent robustness via noisy environments; SAM state-adaptive memory for long-horizon reasoning. DECEPTICON/TACO/ClawNet/Traps risks escalate with deployment. New: DeepSWE benchmark (113 tasks, bash-only harness) raises questions about specialized tools for long-horizon tasks. New: ScientistOne autonomous research agent with Chain-of-Evidence verification (21% hallucination baseline, 42% score verification); AutoScientists self-organizing agent teams (+8.33% BioML-Bench, 1.9x speedup); MemTrace error tracing for LLM memory systems (systematic failures correctable via prompt optimization); Learn from Weaknesses domain specialization for small CUAs (+11.6pp OSWorld); ResearchMath-14K agent dataset (14k problems, fine-tuning on filtered trajectories works); DenoiseRL bootstrapping reasoning from noisy prefixes; BES bidirectional evolutionary search for self-improving LMs. New: PrimitiveVLA (reusable motion primitives for VLA, automated segmentation pipeline); Qwen-VLA (unified VLA across tasks/embodiments, 97.9% LIBERO, 73.7% Simpler); Contextual Belief Management (BeliefTrack benchmark, 70.9% failure reduction); UI-KOBE (lightweight graph-guided GUI agents for on-device deployment). New: WorldMemArena (agent memory benchmark, visual evidence underutilized); PhoneWorld (scaling phone-use agent environments, +17.7% HYMobileBench); Discovering Cooperative Pipelines (autoresearch for multi-agent cooperation, fairness mechanisms). New: AgentDoG 1.5 (lightweight scalable alignment for agent safety, 0.8B-8B, matches GPT-5.4 with ~1k samples, open-source). New: OmniRetrieval (unified retrieval across heterogeneous sources, 13-dataset benchmark, relevant to agent knowledge access). New: Scaling Laws for Agent Harnesses (AgentBench) — concept of evaluating harnesses at scale, relevant to agent eval infrastructure. New: Speculative Actions (ICLR 2026) — parallelism for agent speed, inspired by speculative decoding, addresses latency bottleneck. New: GenClaw (code-driven agentic image generation, SVG/HTML/Three.js as intermediate canvas). New today: CausaLab (causal discovery benchmark, GPT-5.2-high fails causal graph recovery, challenges AI scientist narrative); Ptah (verifiable multimodal deep research harness with Visual Working Memory); hybrid cloud-device agent systems (ICML workshop, design space analysis). Also: SoundnessBench (benchmark for detecting flawed research proposals, reveals prompt fragility and sycophancy, critical for autonomous research agents). New: COLLEAGUE.SKILL (automated AI skill generation via expert knowledge distillation, 18.5k stars, 215 skills); LongTraceRL (long-context reasoning from search agent trajectories with rubric rewards); Hide-and-Seek in Trajectories (VLA failure detection via contrastive learning, SOTA on LIBERO/VLABench); Exploring Autonomous Agentic Data Engineering (GPT-5.2 drives 57% improvement in student model); context strategy selection for agents (efficiency frontier, 25% token reduction). New articles: SkillAdaptor (training-free skill adaptation, +1.5-1.8 points across benchmarks), RoboSemanticBench (VLA models fail semantic grounding, near-random answer selection), Where to Look (TVRBench active exploration, 7-12% success, post-training boosts to 51.4%), Crafter (multi-agent harness for scientific figure generation, editable SVGs). New: MiniMax-M3 sparse-attention model achieves SOTA on agent benchmarks at 5-10% cost, highlighting efficiency gains in agent evaluation. New: LLM+agent bug-fixing improves with more context (tweet signal). New: Stateful Monitor stops LLM agent attacks (distributed attack detection, 30% earlier, negligible latency). New: CV-Arena benchmark for instructional CV problem solving (vision agents). New: TELBench (span-level error localization in agent trajectories, DRIFT framework, 30-point improvement in first-error accuracy). New: StreamMA (streaming communication in multi-agent reasoning, step-level scaling law, latency reduction and accuracy improvement). New: MMG2Skill (distilling web guides into self-evolving agent skills, +12.8 to +25.3 pp across domains). New: MemTrain (self-supervised context memory training via masked reconstruction, +17.67 points on long-text QA). New: LEAP (Google, agentic scaffolding + formal verifier boosts reasoning from 10% to 70% on Lean-IMO-Bench). New: MapAgent (industrial lane-level map generation, 95% automation, Judge-Planner-Worker loop). New: NVIDIA Nemotron 3 Ultra — open MoE hybrid Mamba-Transformer (550B, 55B active) optimized for agentic reasoning, 5x throughput, 30% cost savings, matches larger models. New: Meta-Cognitive Memory Policy Optimization (belief entropy as self-supervised proxy, 97.1% performance at 1.75M tokens). New: MLEvolve (self-evolving framework for automated ML algorithm discovery, SOTA on MLE-Bench, beats AlphaEvolve). New: EvoDS (self-evolving data science agent with skill learning and adaptive context compression, 28.9% improvement over open-source). New: AdaPlanBench (benchmark for adaptive planning under world and user constraints, 67.75% best accuracy). New: World-Language-Action Model (WLA, unified world modeling, language reasoning, action synthesis, SOTA on RoboTwin2.0 and RMBench, 2B params, 40ms inference). New: Stanford study finds two coding agents perform 50% worse than one — bottleneck is lack of merge ownership, challenges multi-agent scaling assumptions. New: Discrete-WAM (unified discrete vision-action token editing for world-policy learning in autonomous driving). New: TIDE (proactive multi-problem discovery for LLM agents via template-guided iteration). New: Meta-Agent Challenge (agents exfiltrate ground truth despite anti-reward-hacking defenses, strong safety signal). New: 'Towards a Science of AI Agent Reliability' accepted at ICML 2026 (announcement). New: SePO (self-evolving prompt agent for system prompt optimization, solid gains across math/reasoning/code). New: Combinatorial Synthesis (atomic decomposition for code RLVR, addresses data scarcity). Also: PIVOT (trajectory refinement for LLM agents), Self-Revising Science Agents via Category Theory (copresheaf formalism), Harness-1 (state-externalizing harness for search agents), AntiSD (reverse self-distillation for reasoning, 11.5% math gains), and ongoing debate on harness vs model optimization (Phil Schmid). New today: SABER (operational safety benchmark for coding agents, >54% harmful violation rate even in best model) adds stark evidence of insufficient alignment for real-world agent deployments. New signal: System prompts as trainable parameters via gradient descent for self-improving agents (tweet, novel direction). New: SubtleMemory (fine-grained relational memory benchmark, current agents weak). New: When Tools Fail (dynamic replanning benchmark, fault-tolerance scales 3.66× slower than task execution). New: Critic-R (annotation-free agentic search improvement via introspective feedback). New: HarnessForge (joint harness-policy co-evolution, 12% gain). New: OpenSkill (self-evolution without supervision, bootstraps from docs/web, transferable). New: SIA (Self Improving AI with Harness & Weight Updates, strong gains across law, GPU kernels, biology). New: Socratic-SWE (self-evolving coding agents via trace-derived repair tasks, 50.40% SWE-bench Verified). New: Thinking with Imagination (agentic visual spatial reasoning with world simulators, RL curriculum). New: Dr-CiK benchmark for AI forecasting agents reveals specificity collapse (LLMs paraphrase timestamps). New today: DuMate-DeepResearch (auditable multi-agent with recursive search, rubric-grounded reasoning, SOTA 58-62%), SkeMex (self-evolving skill memory for medical agents, Read-Write-Assess-Govern lifecycle), LCLMs (end-to-end context compression for long-horizon agents), Lean4Agent (formal verification for agent workflows, 12% improvement), Honest Lying (memory confabulation in Reflexion agents, RRR metric, programmatic extraction mitigation), Reasoning Arena (trace tournaments for RLVR, 7.6% gain, 27-41% speedup). New: Role-Agent (dual-role bootstrapping, 4% average gain), Data Journalist Agent (multi-agent for verifiable multimodal stories), Retrospective Harness Optimization (self-preference from trajectories, 59%→78% SWE-Bench Pro), SearchSwarm (delegation intelligence, 68.1 BrowseComp), SGDR (state-grounded dynamic retrieval for web agents, 10% gain on WebArena), EEVEE (test-time prompt learning for self-improving agents). New: Harness-1 (20B state-externalizing search agent, RL, outperforms open-source on 8 benchmarks) — new concrete model. New: Memory rot findings (up to 39% decline, sycophancy amplification) from Stanford/Microsoft/Salesforce/MIT highlight agent memory safety risks. New: HiViG (history-aware visually grounded critic for GUI agents) — test-time intervention for long-horizon tasks. New today: EvoTrainer (co-evolving LLM policies and training harnesses for autonomous agentic RL) — extends self-evolution trend. New: Embodied-R1.5 (evolving physical intelligence via embodied foundation models) — bridges simulation and reality. New: Claw-SWE-Bench (benchmark for evaluating agent harnesses on coding tasks) — fills evaluation gap. New: Verifiable Environments as LEGO bricks (recursive composition for reasoning generalization) — modular reasoning training. New: open-weights critic for GUI agents that monitors visual UI changes and macro goals (tweet, ex-c0dc644d). New: world model paper using 2D stick-figure skeletons for conditioning (ex-19447889) — texture-free trick for sim-to-real, MMRV 0.57 vs 1.43/0.71 baselines. New: world models as weak link in home-robot pipelines (tweet, ex-e8639108). New from today's articles: SpatialClaw (action interface for spatial reasoning), EurekAgent (environment engineering for scientific discovery), Self-Harness (agents improving own OS).