AI Repo & Hardness

June 11, 2026

AI Repo & Hardness · Jun 11 Daily Digest

Self-Improving Agent Harnesses

🔥 Self-Harness: Introduces harnesses that rewrite themselves from run data rather than remaining fixed...

June 10, 2026

SGDR Enables Dynamic Skill Reuse for Web Agents

Static task-level skill retrieval falls short for web agents as page states evolve during execution. SGDR introduces state-grounded dynamic retrieval...

Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval

arxiv.org

Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval

June 10, 2026

SearchSwarm Trains Delegation Intelligence for Long-Horizon Agents

New arXiv paper introduces SearchSwarm to tackle finite context limits in agentic LLMs via smart task delegation to subagents.

Defines delegation...

SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research

arxiv.org

SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research

June 10, 2026

New Theoretical Tools for RL in LLMs

Two recent papers strengthen the foundations of RL post-training for LLMs by refining trust-region control and credit assignment.

Divergence...

Rethinking the Divergence Regularization in LLM RL

arxiv.org

Rethinking the Divergence Regularization in LLM RL

June 10, 2026

Latent and Graph Memories Slash Multimodal Token Costs

Latent and graph-based memories are unlocking efficient long-context multimodal reasoning.

Latent Memory replaces each evidence item with one...

One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

arxiv.org

One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

June 10, 2026

Self-Improving Agent Harnesses

Agent scaffolds are shifting from static wrappers to learnable artifacts that rewrite themselves based on run outcomes. This turns manual upkeep into compounding gains for long-horizon systems.

June 10, 2026

Role-Agent: Dual-Role LLM Evolution Boosts Agent Performance

Role-Agent turns a single LLM into both agent and environment, using World-In-Agent state-prediction rewards and Agent-In-World failure-driven task retrieval to drive bootstrapped co-evolution and deliver >4% gains on benchmarks.

Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution

arxiv.org

Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution

June 10, 2026

EEVEE Brings Test-Time Prompt Learning to Real-World Agent Streams

EEVEE is the first multi-dataset test-time prompt learning framework for LLM agents, deploying a router to cluster heterogeneous inputs and...

EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents

arxiv.org

EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents

June 10, 2026

Learned Relay Representations Let Diffusion Models Plan Ahead

Diffusion models can move beyond greedy per-step decisions with Learned Relay Representations (LRRs), enabling them to plan for the future.

June 10, 2026

New Benchmarks Signal Maturing Agent Evaluation

Two fresh benchmarks emphasize long-horizon rigor for agents and world models.

Workflow-GYM tests professional GUI workflows; even top models reach...

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

arxiv.org

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

June 10, 2026

RHO: Self-Optimizing LLM Agent Harnesses via Past Trajectories

RHO enables agents to self-improve their harness using only prior trajectories: it picks challenging tasks, generates rollouts, applies...

Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts

arxiv.org

Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts

June 10, 2026

AI Repo & Hardness · Jun 10 Daily Digest

Inference Optimization Releases

🔥 End-to-End Context Compression at Scale: Presents Latent Context Language Models (LCLMs) with...

arxiv.org

End-to-End Context Compression at Scale

June 9, 2026

Two Fresh Paths to Sharper LLM Agents

Two new arXiv papers highlight complementary techniques for boosting agent performance via skill and trajectory refinement.

Bayesian-Agent models...

Bayesian-Agent: Posterior-Guided Skill Evolution for LLM Agent Harnesses

arxiv.org

Bayesian-Agent: Posterior-Guided Skill Evolution for LLM Agent Harnesses

June 9, 2026

Benchmarks Shift Focus to Agent Exploration Skills

Two fresh benchmarks move beyond binary task success to probe deeper agent capabilities.

SWE-Explore isolates repository navigation, requiring...

SWE-Explore: Benchmarking How Coding Agents Explore Repositories

arxiv.org

SWE-Explore: Benchmarking How Coding Agents Explore Repositories

June 9, 2026

Memory Becomes the New AI Frontier

Memory mechanisms are rapidly evolving from simple agent stores to sophisticated world-model and context-compression systems.

Agentmemory delivers...

github.com

README.md - rohitg00/agentmemory

June 9, 2026

Agent Framework Stats: CrewAI, LangGraph & Top Contenders

Live GitHub data reveals clear leaders among agent frameworks.

LangChain tops at 138.8K stars; CrewAI follows with 53K stars and 14.9M monthly PyPI...

Agent Frameworks Compared | CrewAI, LangGraph ...

madebyagents.com

Agent Frameworks Compared | CrewAI, LangGraph ...

June 9, 2026

Scrutiny on AI Evaluation Methods Spurs Better Alternatives

Human questionnaires mischaracterize LLMs, as lexical cues trigger desirable responses unlike realistic queries
Demographic prompts shift...

Human Psychometric Questionnaires Mischaracterize LLM Behavior

arxiv.org

Human Psychometric Questionnaires Mischaracterize LLM Behavior

June 9, 2026

AI Repo & Hardness · Jun 09 Daily Digest

Evaluation Optimization Failure Modes

🔥 Gradient Dilution in LLM Judges: arXiv:2605.26046 documents 59% task-focus drop and Spearman rho...

June 8, 2026

AI Repo & Hardness · Jun 8 Daily Digest

Trending AI GitHub Repositories

🔥 Weekly Explosive Growth Lists: 5 GitHub Repos That Exploded This Week and Top 10 AI GitHub Repos This Week...

June 8, 2026

Failure Modes in Multi-Objective LLM Judge Prompt Optimization

Textual gradient methods for LLM judges face unique challenges when optimizing across multiple criteria simultaneously, unlike numerical multi-task...

When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges

arxiv.org

When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges

June 8, 2026

US AI Safety Regulations Accelerating

Digest Calendar

Recent Posts

AI Repo & Hardness · Jun 11 Daily Digest

Self-Improving Agent Harnesses

SGDR Enables Dynamic Skill Reuse for Web Agents

Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval

SearchSwarm Trains Delegation Intelligence for Long-Horizon Agents

SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research

New Theoretical Tools for RL in LLMs

Rethinking the Divergence Regularization in LLM RL

Latent and Graph Memories Slash Multimodal Token Costs

One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

Self-Improving Agent Harnesses

Role-Agent: Dual-Role LLM Evolution Boosts Agent Performance

Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution

EEVEE Brings Test-Time Prompt Learning to Real-World Agent Streams

EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents

Learned Relay Representations Let Diffusion Models Plan Ahead

New Benchmarks Signal Maturing Agent Evaluation

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

RHO: Self-Optimizing LLM Agent Harnesses via Past Trajectories

Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts

AI Repo & Hardness · Jun 10 Daily Digest

Inference Optimization Releases

End-to-End Context Compression at Scale

Two Fresh Paths to Sharper LLM Agents

Bayesian-Agent: Posterior-Guided Skill Evolution for LLM Agent Harnesses

Benchmarks Shift Focus to Agent Exploration Skills

SWE-Explore: Benchmarking How Coding Agents Explore Repositories

Memory Becomes the New AI Frontier

README.md - rohitg00/agentmemory

Agent Framework Stats: CrewAI, LangGraph & Top Contenders

Agent Frameworks Compared | CrewAI, LangGraph ...

Scrutiny on AI Evaluation Methods Spurs Better Alternatives

Human Psychometric Questionnaires Mischaracterize LLM Behavior

AI Repo & Hardness · Jun 09 Daily Digest

Evaluation Optimization Failure Modes

AI Repo & Hardness · Jun 8 Daily Digest

Trending AI GitHub Repositories

Failure Modes in Multi-Objective LLM Judge Prompt Optimization

When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges