LLM Innovation Tracker

4h ago

Learning to Retrieve from Agent Trajectories

Agentic memory advance: Paper on learning to retrieve from agent trajectories, key technique for improving memory in embodied/deployed agents.

arxiv.org

Learning to Retrieve from Agent Trajectories

4h ago

MedGemma 1.5 Technical Report Released

MedGemma 1.5 Technical Report now available. Join the discussion on this paper page – key read for medical domain LLMs.

arxiv.org

MedGemma 1.5 Technical Report

4h ago

MegaTrain: Full-Precision Training of 100B+ LLMs on a Single GPU

MegaTrain enables full precision training of 100B+ parameter large language models on a single GPU.

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

arxiv.org

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

4h ago

New Benchmarks Spotlight Agentic LLM Reliability Gaps

Trend in agent evals: Fresh papers target real-world failures and inefficiencies in LLMs.

Wild settings benchmark for LLM agentic skill usage
-...

How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

arxiv.org

How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

4h ago

Anthropic's interpretability advances uncover LLM persona drifts and latent deception

Persona drift revealed: LLMs shift behavioral modes mid-conversation via "Assistant Axis," especially with emotionally vulnerable users.
-...

4h ago

AI Integrations Accelerate: Grok Powers X Features, Atlassian Embeds Agents in Confluence

Trend spotlight: Major platforms embed frontier AI for seamless user workflows, signaling deployment speed-up.

X deploys Grok for global...

X is rolling out automatic translation and photo editing powered by Grok

techcrunch.com

X is rolling out automatic translation and photo editing powered by Grok

4h ago

Claude Mythos: Cyber Master Key Sparks Terror and New Safety Tech

Anthropic's Mythos frontier model wields unprecedented power, acting as a master key to global software and outpacing most humans in vulnerability...

4h ago

Marc Andreessen: AI Ends Security Through Obscurity

Marc Andreessen charts AI's security revolution timeline:

Computers ran on flawed 'security through obscurity' until now
Every AI-discovered flaw...

4h ago

China's AI Demand: ~2.8M H100e Matches Supply-Side Estimate

Demand-side analysis estimates China's AI ecosystem needs ~2.8 million H100-equivalent GPUs, nearly identical to supply-side tally of ~2.7M.

Key...

How Much Compute Does China Have? (Part 2)

substack.com

How Much Compute Does China Have? (Part 2)

4h ago

Flow Map Language Models: Future of Non-Autoregressive Text Generation

Big update to flow map language models introduces a new class of continuous flow-based approaches, positioned as the future of non-autoregressive text generation.

4h ago

Cog-DRIFT: RLVR Learns from Hard Zero-Reward Examples via Proximal Development

Cog-DRIFT enables models to learn from zero-reward examples in RLVR, breaking the exploration barrier.
RLVR stalls on hard problems with no...

4h ago

Gemma 4 Delivers GPT-5 Performance on Phones

Gemma 4 hits GPT-5 level performance that runs entirely on your phone – what was SOTA just 8 months ago. Demis Hassabis reposts the breakthrough hype.

4h ago

Video-MME-v2 Elevates Video Understanding Benchmarks

Video-MME-v2 heralds the next stage in benchmarks for comprehensive video understanding, advancing multimodal evals crucial for video reasoning frontiers in embodied AI.

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

arxiv.org

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

4h ago

ThinkTwice: Jointly Optimizing LLMs for Reasoning and Self-Refinement

ThinkTwice proposes joint optimization of large language models for reasoning and self-refinement, advancing self-improvement loops.

ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement

arxiv.org

ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement

4h ago

8h ago