Agent Scaling Pitfalls: Degradation, Vulns, Evals & Fixes

Key Questions

What pitfalls occur in agent scaling?

Agents suffer degradation, vulnerabilities like injection/poison, and poor generalization. Stanford research shows single agents outperform multi-agents in many cases.

What evaluations measure agent trustworthiness?

ClawArena and Claw-Eval provide trustworthy evals. AgentHazard (73%) and AgentSocialBench test leaks and privacy risks.

How does ThinkTwice improve reasoning?

ThinkTwice jointly optimizes LLMs for reasoning and self-refinement. It addresses base LLM flops without test-time adaptation (TTA).

What is Cog-DRIFT in RLVR?

Cog-DRIFT breaks exploration barriers in reinforcement learning via verifiable rewards (RLVR). It enhances agent reasoning scalability.

What security measures protect agents?

Security sandboxes and AgentSocialBench evaluate privacy. Papers cover traps, injections, and noisy supervision robustness.

What frameworks survey agent harnesses?

A survey reviews 22 agent harness systems. It highlights orchestration gaps and workflows vs. agents debate.

How do APO and DSPy/OPRO aid optimization?

APO uses DSPy/OPRO for automated prompt optimization. Self-Execution improves coding via agent trajectories.

What benchmarks test agentic skills?

Agentic skills use wild benchmarks; LightThinker++ manages memory. Holos scales multi-agents for web tasks.

Claude Sonnet 4.5 emotion vectors; OpenClaw chaos/safety; WSJ fails; Agent Traps/injection/poison; AgentHazard 73%/AgentSocialBench leaks; ClawArena/Claw-Eval trustworthy evals; agentic skills wild benchmarks; base LLMs generalization flops (no TTA); Stanford single > multi-agents; Cog-DRIFT RLVR; ThinkTwice reasoning/self-ref; Self-Execution coding; Learning from Agent Trajectories; SkillX/FileGram/LightThinker++; agent harness survey (22 systems); learnable TTA; noisy supervision; APO (DSPy/OPRO); security sandboxes; 6 layers/orch gap; workflows vs agents; Holos/Neuro-Symbolic/SSD/CORAL/Omni-SimpleMem/Raschka/Cyara.

Sources (64)

Updated Apr 8, 2026

Agent Scaling Pitfalls: Degradation, Vulns, Evals & Fixes

Key Questions

What pitfalls occur in agent scaling?

What evaluations measure agent trustworthiness?

How does ThinkTwice improve reasoning?

What is Cog-DRIFT in RLVR?

What security measures protect agents?

What frameworks survey agent harnesses?

How do APO and DSPy/OPRO aid optimization?

What benchmarks test agentic skills?

ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement

@omarsar0: NEW paper on multi-agents from Stanford. More agents, better results, right? Not so fast. This pa...

@EliasEskin reposted: 🚨Cog-DRIFT: Breaking the Exploration Barrier in RLVR RLVR has pushed LLM reason...

Agent Harness for Large Language Model Agents: A Survey[v1] | Preprints.org

Learning to Learn-at-Test-Time: Language Agents with Learnable Adaptation Policies

Can LLMs Learn to Reason Robustly under Noisy Supervision?

LightThinker++: From Reasoning Compression to Memory Management

FileGram: Grounding Agent Personalization in File-System Behavioral Traces

Holos: Scalable LLM Multi-Agent System for Agentic Web

Engineering Storefronts for Agentic Commerce – O’Reilly

Your AI Agent Depends on Six Layers — Here's Which Ones Won't Last

AI Agents | Description, Examples, and Security! | Snyk Learn

AI Agent Testing Frameworks: How to Validate AI Systems

From Alchemy to Math: The Dawn of Automated Prompt Engineering | Level Up Coding

AgentSocialBench: Evaluating Privacy Risks in Human-Centered Agentic Social Networks

AgentHazard Benchmark Finds Computer-Use Agents Fail Safety Tests at High Rates – MegaOne AI

Neuro-Symbolic Dual Memory for Long-Horizon LLM Agents

AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents

Agentic AI could amplify data breaches through system-wide leaks

@_akhaliq: Signals Trajectory Sampling and Triage for Agentic Interactions paper: https://t.co/XPfBucLx0i htt...

Managing Your AI Agent Ecosystem

Karpathy’s No‑Code AI Memory System

9 Open Agents That Improve Themselves

AI Agents of the Week: Papers You Should Know About

STOP GUESSING: 6 Leaderboards/Benchmarks You Need Before Choosing an LLM (2026)

@rosstaylor90: 🌶️ One more spicy take while I am jet lagged and less inhibited than usual: We expect agents to be ...

Wall Street Just Bet $285 Billion on AI Agents. The Best One Barely Works.

Gemini 4 Explained: Multi-Million Context, Agentic AI & The Real Truth

Optimizing LLM Agent Workflows: Static to Dynamic Graphs

ByteRover: Agent-Native Hierarchical LLM Memory

@jaseweston: 🧮 Reasoning over Mathematical Objects 🧮 Our 70-page(!) paper is out on arXiv, as covered by several...

@roydanroy: Gemini has been posting its solutions directly to https://t.co/fqfl9BoXzj. Everyone is still in the ...

Omni-SimpleMem: Autoresearch-Guided Discovery of Lifelong Multimodal Agent Memory

@_akhaliq reposted: Vision2Web Evaluating coding agents on 193 real-world tasks across static, inte...

Anthropic Says It Has Identified Vectors Relating To Different Emotions Within Its AI Models

How to Manage LLM Memory Effectively for Smarter AI Agents

The Evolution of Tool Use in LLM Agents: From Single-Tool Call to ...

[PDF] Adaptive Theory of Mind for LLM-based Multi-Agent Coordination

What are Agent Skills and How Agents Use Them?

@diptanu: Sandbox infrastructure for automation of RL environments has a different set of priorities than infr...

@_akhaliq: Do Phone-Use Agents Respect Your Privacy? paper: https://t.co/1yGuE9cpl6 https://t.co/gZWQoTPASr

@_akhaliq reposted: Terminal Agents Suffice for Enterprise Automation ServiceNow research shows ter...

@emollick: New report from us: Can you prompt inject your way to an “A”? As LLMs increasingly are used as judg...

AI Agents and Agentic Protocols for Telecom Networks

How to Build AI Agents: From Prompts to Autonomous Systems

RAG Is Dead. Just Give Your Agent a File System.

Reasoning Shift: How Context Silently Shortens LLM Reasoning

Stop Calling LLMs Like APIs (Do This Instead) | Rynaut - The Agentic Architect

Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive Assistants

The 5 Types of AI Agent Memory Every Developer Needs to Know (Part 1) - DEV Community

ARC Engine: Deterministic and Verifiable LLMs

Securing the Cognitive Layer: A Survey on Security Threats, Defenses, and Privacy-Preserving Architectures for LLM-IoT Integration

The Architecture of AI Agent Traps

LLM Consensus Matches or Outperforms the Best AI Models in Expert Evaluation Without Performance Degradation - The Providence Journal

@omarsar0: NEW paper from Google DeepMind The biggest threat to AI agents isn't a smarter attacker. It's the w...

@omarsar0: Self-organizing agents work if built correctly.

@_akhaliq: FIPO Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization paper: https://t.co/5G...

@minchoi: This paper is wild. New paper says even rational users can spiral into delusions from sycophantic c...

@_akhaliq: LongCat-Next Lexicalizing Modalities as Discrete Tokens paper: https://t.co/gKUZvc4KQ0 https://t.c...

From Paraphrases to Diagnostics: A Fine-Grained Framework for LLM Auditing (EACL 2026)

@lennysan reposted: wow. @Microsoft just appointed a CVP whose entire job is OpenClaw. https://t.co/...

@omarsar0 reposted: NEW Stanford &amp; MIT paper on Model Harnesses. Changing the harness around a ...

Clawing Back on Security: Challenges with Agentic AI Systems

MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Models

@omarsar0 reposted: NEW Stanford & MIT paper on Model Harnesses. Changing the harness around a ...