Agentic coding: Cursor/Claude Code regressions + metrics

Key Questions

What issues are plaguing Claude Code recently?

Claude Code is facing a memory crisis, ignoring configs and race conditions in CLAUDE.md #44257, leading to worsened performance. Enterprise developers report declining reliability in debugging and multi-file system-level tasks, eroding trust due to shallow reasoning.

What is GLM-5.1 and its key achievement?

GLM-5.1 is a 744B parameter open-source LLM from Zhipu AI, achieving state-of-the-art 58.4% on SWE-Bench Pro for long-horizon agentic coding. It beats models like Opus 4.6 and GPT-5.4, supporting extended work sessions like an 8-hour workday.

What is OpenBrowser-AI?

OpenBrowser-AI is an open-source tool connecting AI agents to browsers via raw CDP without abstraction layers. The LLM writes Python in a persistent namespace, achieving 100% task completion with a 59% cost reduction.

What improvements does Cursor 3 offer?

Cursor 3 launches a unified workspace for AI coding agents, providing a 1.84x speedup over previous versions. It shifts focus from autocomplete to broader agentic coding capabilities.

What is the rivalry between Copilot and Windsurf?

GitHub Copilot CLI introduces Rubber Duck in experimental mode, boosting AI performance by nearly 75%. Windsurf counters with Codemaps and SWE-1.5, claiming 14x faster performance than Claude for free.

Why are enterprise developers questioning Claude Code's reliability?

Feedback on GitHub and user reports highlight declining effectiveness in complex engineering tasks like debugging and multi-file systems. Critiques point to trust erosion from shallow reasoning in Claude apps.

What is the Agent Reading Test?

The Agent Reading Test is a benchmark evaluating how well AI coding agents read web content. Agents are pointed at the test to receive a score, with 42 points noted on Hacker News.

How does GLM-5.1 support long-horizon coding?

GLM-5.1's developer guide details 600+ iteration optimization for agentic coding. It excels in extended tasks, positioning it as a leader in open-source AI for software engineering.

Claude Code memory crisis (CLAUDE.md #44257 ignored configs/race conditions) worsens with apps critique on trust erosion/shallow reasoning; GLM-5.1 744B SOTA SWE-Bench Pro 58.4% long-horizon coding; OpenBrowser-AI OSS CDP agent 100% tasks/59% cost drop; Cursor 3 1.84x speedup; Copilot/Windsurf rivalry.

Sources (44)