Agent Scaling Pitfalls: Degradation, Vulns, Evals & Fixes
Key Questions
What pitfalls occur in agent scaling?
Agents suffer degradation, vulnerabilities like injection/poison, and poor generalization. Stanford research shows single agents outperform multi-agents in many cases.
What evaluations measure agent trustworthiness?
ClawArena and Claw-Eval provide trustworthy evals. AgentHazard (73%) and AgentSocialBench test leaks and privacy risks.
How does ThinkTwice improve reasoning?
ThinkTwice jointly optimizes LLMs for reasoning and self-refinement. It addresses base LLM flops without test-time adaptation (TTA).
What is Cog-DRIFT in RLVR?
Cog-DRIFT breaks exploration barriers in reinforcement learning via verifiable rewards (RLVR). It enhances agent reasoning scalability.
What security measures protect agents?
Security sandboxes and AgentSocialBench evaluate privacy. Papers cover traps, injections, and noisy supervision robustness.
What frameworks survey agent harnesses?
A survey reviews 22 agent harness systems. It highlights orchestration gaps and workflows vs. agents debate.
How do APO and DSPy/OPRO aid optimization?
APO uses DSPy/OPRO for automated prompt optimization. Self-Execution improves coding via agent trajectories.
What benchmarks test agentic skills?
Agentic skills use wild benchmarks; LightThinker++ manages memory. Holos scales multi-agents for web tasks.
Claude Sonnet 4.5 emotion vectors; OpenClaw chaos/safety; WSJ fails; Agent Traps/injection/poison; AgentHazard 73%/AgentSocialBench leaks; ClawArena/Claw-Eval trustworthy evals; agentic skills wild benchmarks; base LLMs generalization flops (no TTA); Stanford single > multi-agents; Cog-DRIFT RLVR; ThinkTwice reasoning/self-ref; Self-Execution coding; Learning from Agent Trajectories; SkillX/FileGram/LightThinker++; agent harness survey (22 systems); learnable TTA; noisy supervision; APO (DSPy/OPRO); security sandboxes; 6 layers/orch gap; workflows vs agents; Holos/Neuro-Symbolic/SSD/CORAL/Omni-SimpleMem/Raschka/Cyara.