Agent memory trustworthiness and interpretability gaps
Key Questions
What does the H2HMem benchmark evaluate?
It assesses multimodal memory for agents in human-human interactions across dyadic and multi-party scenarios. Dimensions include recall, reasoning, and application. The benchmark directly targets retrieval integrity.
How does H2HMem address memory degradation in agents?
It measures performance on recall and reasoning tasks to detect degradation. This helps ensure durable agent state over time. The new signal focuses on trustworthy memory mechanisms.
What prior work informs agent memory trustworthiness research?
Signals include SkeMex, SubtleMemory, and ICML 2026 reliability papers. These address retrieval integrity and persistent memory needs. Microsoft Build sessions on memory are also referenced.
Developing with new signals today: SGDR (state-grounded dynamic retrieval for web agents, 10.6% gain on WebArena, code released). MemDreamer (hierarchical graph memory with agentic retrieval for long video). One Token per Multimodal Evidence (latent memory compression). EEVEE (test-time prompt learning for self-improving agents). Previous signals include H2HMem multimodal memory benchmark, SkeMex, Bayesian-Agent, ICLR 2026 (MCIF benchmark), SubtleMemory, STRIDE, Offloading Score, ICML 2026 agent reliability paper, Meta-Cognitive Memory Policy Optimization, Microsoft Build session on persistent memory. Focus on retrieval integrity, memory degradation, and durable agent state.