Agentic AI 势头:DeepMind traps/CMU CAID/Qwen Trace2Skill/OpenClaw + Claude Computer/CUA/M2.7 + self-improving agents/@rasbt coding blocks + multi-agent math/MemFactory + PRBench + new benches + Meta FAIR math + AlphaEvolve/SKILL0 + @hardmaru automation + weekly papers + World Action Models/VLA robustness + streaming video baselines + autoresearch/test-time adaptation + Cog-DRIFT/SkillX/ClawArena/OpenWorldLib/Stanford multi-agent critique
Key Questions
What is Cog-DRIFT and how does it work?
Cog-DRIFT enables models to learn from zero-reward examples in RLVR by reformulating hard problems into MCQ/cloze formats. It breaks exploration barriers when pass@64=0 in standard RLVR.
What is OpenWorldLib?
OpenWorldLib is a unified codebase and definition for advanced world models, as shared by @_akhaliq.
How does the Stanford paper view multi-agent systems?
The Stanford paper challenges the idea that more agents always yield better results in multi-agent setups.
What is World Action Models' advantage over VLAs?
World Action Models generalize better than Vision-Language-Action models (VLAs) in robustness studies.
What does PRBench reveal about agentic AI?
PRBench exposes failures in physics reproduction for agentic systems.
What are key components of coding agents according to @rasbt?
@rasbt outlines building blocks for coding agents, including reasoning and other components.
What is Agentic-MME?
Agentic-MME evaluates what agentic capabilities bring to multimodal intelligence.
What new benchmarks are emerging for agentic AI?
New benchmarks include LIBERO-Para for VLA paraphrase robustness, SkillX for skill knowledge bases, ClawArena, and others like Streaming Video baselines.
AlphaEvolve/SKILL0/@hardmaru; @rasbt blocks; @zainhasan6 RL scaling; @_akhaliq Signals/Agentic-MME/Streaming Video/Token Warping/OpenWorldLib; Stanford single-agent > multi hype; World Action > VLAs; PRBench exposes physics repro flops; traps (Meta-Harness/@omarsar0); Vision2Web/OpenClaw/Terminal/CAID; METR/@GaryMarcus; Weston RL; FIPO; @Suuraj autoresearch; Cog-DRIFT RLVR reformulates zero-reward hard problems (pass@64=0) to MCQ/cloze; SkillX/ClawArena/LIBERO-Para test-time agents.