Long-horizon agents, benchmarks, and unstable long-context safety
Key Questions
What benchmarks show degradation at long contexts?
Benchmarks like LMEB, daVinci, SWE-CI, Claw-Eval, and AMA-Bench exhibit over 50% degradation at 100K tokens. This highlights instability in long-horizon agents. Urgent evals and repros are needed.
What is Cog-DRIFT?
Cog-DRIFT addresses the zero-reward pitfall in RLVR using a curriculum for hard problems. It breaks exploration barriers in LLM reasoning. Reproducibility and evaluations are priorities.
Why do agent skills perform differently with curated vs. unfiltered retrieval?
Agent skills shine with curated toolboxes in demos but fail with unfiltered retrieval. This reveals gaps in real-world robustness. Curated setups mask underlying weaknesses.
What is ThinkTwice?
ThinkTwice jointly optimizes LLMs for reasoning and self-refinement. It improves performance on complex tasks. It is part of ongoing agent advancements like GLM-5.1.
What is Claw-Eval?
Claw-Eval aims for trustworthy evaluation of autonomous agents. It addresses reliability in long-horizon benchmarks. Repros and further evals are urgent.
What does GLM-5.1 achieve on SWE-Bench?
GLM-5.1 open-source LLM beats Opus 4.6 and GPT-5.4 on SWE-Bench Pro. It supports an 8-hour workday for coding tasks. This marks a resurgence in open-source AI from China.
What safety issues arise in long-context scenarios?
Unstable long-context safety includes leaks like Claude and concerns in models like Kimi K2.5. Monitoring, hardening, and eviction strategies (KV/HISA) are critical. Dual-use capabilities raise alignment worries.
What are the urgent priorities for long-horizon agents?
Priorities include evals/repros for Cog-DRIFT, skills, Claw, and AMA; KV/HISA eviction; monitoring; and hardening. Systems like Holos for scalable multi-agent web tasks are emerging. Status is developing.
LMEB/daVinci/SWE-CI/Claw-Eval/AMA-Bench >50% deg @100k; Cog-DRIFT (RLVR zero-reward curriculum); agent skills (curated good/unfiltered retrieval fails); ThinkTwice/GLM-5.1/Claude leak/Gemma4/Holocene/etc. Urgent: evals/repros (Cog-DRIFT/skills/Claw/AMA), eviction (KV/HISA), monitoring, hardening.