Long-horizon agents, benchmarks, and unstable long-context safety

Key Questions

What benchmarks show degradation at long contexts?

Benchmarks like LMEB, daVinci, SWE-CI, Claw-Eval, and AMA-Bench exhibit over 50% degradation at 100K tokens. This highlights instability in long-horizon agents. Urgent evals and repros are needed.

What is Cog-DRIFT?

Cog-DRIFT addresses the zero-reward pitfall in RLVR using a curriculum for hard problems. It breaks exploration barriers in LLM reasoning. Reproducibility and evaluations are priorities.

Why do agent skills perform differently with curated vs. unfiltered retrieval?

Agent skills shine with curated toolboxes in demos but fail with unfiltered retrieval. This reveals gaps in real-world robustness. Curated setups mask underlying weaknesses.

What is ThinkTwice?

ThinkTwice jointly optimizes LLMs for reasoning and self-refinement. It improves performance on complex tasks. It is part of ongoing agent advancements like GLM-5.1.

What is Claw-Eval?

Claw-Eval aims for trustworthy evaluation of autonomous agents. It addresses reliability in long-horizon benchmarks. Repros and further evals are urgent.

What does GLM-5.1 achieve on SWE-Bench?

GLM-5.1 open-source LLM beats Opus 4.6 and GPT-5.4 on SWE-Bench Pro. It supports an 8-hour workday for coding tasks. This marks a resurgence in open-source AI from China.

What safety issues arise in long-context scenarios?

Unstable long-context safety includes leaks like Claude and concerns in models like Kimi K2.5. Monitoring, hardening, and eviction strategies (KV/HISA) are critical. Dual-use capabilities raise alignment worries.

What are the urgent priorities for long-horizon agents?

Priorities include evals/repros for Cog-DRIFT, skills, Claw, and AMA; KV/HISA eviction; monitoring; and hardening. Systems like Holos for scalable multi-agent web tasks are emerging. Status is developing.

LMEB/daVinci/SWE-CI/Claw-Eval/AMA-Bench >50% deg @100k; Cog-DRIFT (RLVR zero-reward curriculum); agent skills (curated good/unfiltered retrieval fails); ThinkTwice/GLM-5.1/Claude leak/Gemma4/Holocene/etc. Urgent: evals/repros (Cog-DRIFT/skills/Claw/AMA), eviction (KV/HISA), monitoring, hardening.

Sources (42)

Updated Apr 8, 2026

Long-horizon agents, benchmarks, and unstable long-context safety

Key Questions

What benchmarks show degradation at long contexts?

What is Cog-DRIFT?

Why do agent skills perform differently with curated vs. unfiltered retrieval?

What is ThinkTwice?

What is Claw-Eval?

What does GLM-5.1 achieve on SWE-Bench?

What safety issues arise in long-context scenarios?

What are the urgent priorities for long-horizon agents?

@omarsar0 reposted: Agent skills look great in demos. Hand them a curated toolbox, and they shine. ...

@EliasEskin reposted: Thrilled to share Cog-DRIFT 🎉🎉 Breaking the zero-reward pitfall for hard problem...

@EliasEskin reposted: 📢 Excited to share our new work on Cog-DRIFT! The core idea is quite simple yet...

ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Learning to Retrieve from Agent Trajectories

Localizing, Scaling, and Controlling Policy Circuits in Language Models

@EliasEskin reposted: 🚨Cog-DRIFT: Breaking the Exploration Barrier in RLVR RLVR has pushed LLM reason...

AI joins the 8-hour work day as GLM ships 5.1 open source LLM, beating Opus 4.6 and GPT-5.4 on SWE-Bench Pro

@adiyossLC reposted: 🚨New paper🚨 Self-Execution Simulation Improves Coding LLMs Current reasoning LL...

Learning to Learn-at-Test-Time: Language Agents with Learnable Adaptation Policies

@Miles_Brundage reposted: 🚨New paper! How safe and aligned is Kimi K2.5? We found concerning dual-use ca...

Holos: Scalable LLM Multi-Agent System for Agentic Web

@_akhaliq: Agentic-MME What Agentic Capability Really Brings to Multimodal Intelligence? paper: https://t.co/...

Simple self-distillation improves code generation in large language models

CORAL: Multi-agent evolution for LLM discovery

Google's Gemma 4 Runs Frontier AI On A Single GPU

LLMs: Improving Latent Generalization via CoT

Black Hat USA 2025 | Universal and Context-Independent Triggers for Precise Control of LLM Outputs

ByteRover: Agent-Native Hierarchical LLM Memory

The Dawn of Gemma 4 Efficiency is King

Gemma 4: Byte for byte, the most capable open models

@jaseweston: 🧮 Reasoning over Mathematical Objects 🧮 Our 70-page(!) paper is out on arXiv, as covered by several...

Google And Nvidia Launch Gemma 4 AI Models For Data Centres And Edge Devices

Google battles Chinese open-weights models with Gemma 4

Google Gemma 4: The Open-Source AI Model Changing the Game | Stork.AI

Google Gemma 4 Developer Guide: Benchmarks & Local Setup | Lushbinary

@ClementDelangue reposted: Gemma 4 26B MoE (4B active) on a single RTX 4090: - 162 t/s decode - 8,400 t...

Efficient Context Expansion: Techniques Compared | NanoGPT

Claude Code Source Leak: 7 Agent Architecture Lessons

@Miles_Brundage reposted: Today, I'm releasing the first eval meant to test whether frontier models will h...

Everything That Happened in AI Today Thursday, April 2, 2026

@omarsar0: Can an AI agent run a startup for a year without going bankrupt? Turns out most can't. New benchma...

@_akhaliq reposted: Terminal Agents Suffice for Enterprise Automation ServiceNow research shows ter...

SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization

HippoCamp: Benchmarking Contextual Agents on Personal Computers

@omarsar0: Self-organizing agents work if built correctly.

@omarsar0: Most devs think that adding more agents to a planning system should help. The math says otherwise. ...

@CharlesVardeman reposted: Excited about our new paper: AI Agent Traps AI agents inherit every vulnerabil...

[AINews] The Claude Code Source Leak - Latent.Space

@omarsar0 reposted: NEW Stanford &amp; MIT paper on Model Harnesses. Changing the harness around a ...

MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Models

@omarsar0 reposted: NEW Stanford & MIT paper on Model Harnesses. Changing the harness around a ...