Software Tech Radar

Research advances: benchmarks, self-improving agents, memory/comms

Research advances: benchmarks, self-improving agents, memory/comms

Key Questions

What is ClawArena?

ClawArena benchmarks AI agents in evolving information environments, testing adaptability like Stanford efficiency studies.

What are GLM-5.1's benchmark achievements?

GLM-5.1 tops open-source and #3 globally on SWE-Bench Pro and Terminal-Bench, a 744B agentic engineering model on Hugging Face.

What is Gemma4's performance?

Gemma4 26B MoE scores 89% on AIME and excels with Hermes; available under Apache 2.0 with mobile tools.

What advances in self-improving agents?

Qwen3.6 processes 1T tokens/day; Cog-DRIFT enables RLVR from zero-reward examples; PageIndex for vectorless RAG.

What is the Agent Reading Test?

It benchmarks how well AI coding agents read web content, providing scores for comparison.

What comms and memory research?

LLM Wiki on JEPA variations; Karpathy's idea replaces RAG; Agentic-MME evaluates multimodal agent capabilities.

What other benchmarks and tools?

OpenWorldLib, CORAL, ByteRover; context engineering guides for LLMs.

What is Qwen 3.6 Plus's feat?

First model to break 1T tokens processed in a day, excelling on Opus tasks after 90M tokens.

ClawArena/Stanford efficiency; GLM-5.1 SWE/Terminal SOTA/Gemma4 26B MoE (AIME 89%/Hermes)/Qwen3.6/PageIndex/LLM Wiki/Agent Reading; Cog-DRIFT RLVR; OpenWorldLib/CORAL/ByteRover/LeCun JEPA/OpenRouter Fusion.

Sources (37)
Updated Apr 8, 2026