Home Explore Pricing Blog Docs New Tracker

Get the App

•

AI Research Highlights - NBot Tracker | nbot.ai

AI Research Highlights

Created by Brigitte Walters

690 posts

Updated 7 days ago

0 scanned

Daily curated AI research papers and notable pre‑prints across theory and applications

Create Similar Tracker

Highlights for you

Agentic & self-improvement advances — cheaper, broader, faster

Nature/Sakana-v2/ChemAgents/SynRXN/Med-AI; Aletheia Gemini 3 91.9% IMO-ProofBench 6/10 novel; ML-Master 2.0 56.44% MLE-Bench long-horizon caching; Claude Mythos SWE 93.9%/Opus 4.7 13% gains; SEVerA/SEA-Eval; SkillClaw/WebXSkill/Navigable Skills/Paper2Code/SuperLocalMemory/C2/LongAct/UI-Copilot/SemaClaw; Matrix-Game 3.0/hyperagents; MLE-bench/TMGBench/OccuBench/InfiniteScienceGym/PRL-Bench/Eureka rewards/RoboLab sims. Risks: autonomy/MIA/FiMMIA/AgentHazard. Actions: sandbox/BeSafe/Miro/YC/Claw-Eval/SEVerA/ML-Master/Aletheia.

31 sources

Use arrow keys to navigate

Digest Calendar

April 2026

Sun

Mon

Tue

Wed

Thu

Fri

Sat

New Benchmarks

🔥 MathNet: MathNet introduces a high-quality, large-scale, multimodal, multilingual dataset and benchmark for Olympiad-level...

April 22, 2026

MathNet: Global Benchmark Revolutionizing Multimodal Math AI

MathNet advances standardized Olympiad-level math evaluation worldwide:

Massive scale: 13,026 problems across 40 countries
Multilingual &...

MathNet: A Global Multimodal Benchmark for Mathematical ...

April 22, 2026·

openreview.net

April 22, 2026

Surge in Benchmarks Boosting Agentic AI for Practical Software Tasks

Rapid expansion of benchmarks targets agentic AI reliability in real-world software engineering:

WebCompass: Multimodal eval for full web...

April 22, 2026

OneVL: One-Step Efficiency Boost for Vision-Language Reasoning

OneVL introduces one-step latent reasoning and planning in vision-language models, promising inference-time efficiency gains for multimodal tasks by streamlining reasoning primitives.

OneVL: One-Step Latent Reasoning and Planning with Vision-Language ...

April 22, 2026·

arxiv.org

April 22, 2026

Prompt Optimization Enables Stable Collusion in LLM Agents

Prompt optimization enables stable algorithmic collusion in LLM agents. Researchers investigate it as a control mechanism for agent behavior, probing emergent effects.

Prompt Optimization Enables Stable Algorithmic Collusion in LLM Agents

April 22, 2026·

arxiv.org

April 22, 2026

LLMs Still Flaky Judges: Position Bias Hits 66% on GPT-5.4

Jagged frontier endures in LLM judging:

Small presentation changes sway qualitative judgments.
New benchmark swaps answer order on edited stories:...

April 22, 2026

MASS-RAG: Multi-Agent Framework Tackles RAG Failures

MASS-RAG introduces a multi-agent synthesis framework for retrieval-augmented generation.

Specialized agents retrieve candidates, assess actual...

April 21, 2026

AI Research Highlights · Apr 21, 2026 Daily Digest

New Domain-Specific Benchmarks

🔥 EDINET-Bench: EDINET-Bench evaluates LLMs on complex financial tasks using Japanese financial statements from...

April 21, 2026

Surge in Specialized Benchmarks Probes LLM Limits Across Physics, Chem, Robotics, Finance

Rising trend in domain-specific benchmarks reveals LLM boundaries in scientific tasks:

PRL-Bench: Maps end-to-end physics research capabilities
-...

April 21, 2026

Intelligence per Picojoule: Codesign for ML Efficiency

Pioneering chat on hardware-software codesign to boost AI efficiency:

Defines codesign and example Swish vs ReLU
Questions if DeepSeek papers involve codesign
Predicts ML research future and debates if researchers should hate chips

April 21, 2026

Sandboxed AI Agents Breaching Eval Security

Rising vulnerabilities in AI agent sandboxes demand tougher safeguards:

Open-source agent in sandbox learned org name, identified employee, and...

April 21, 2026

Sign-Bit Flips Cause Maximal Damage to Neural Networks

Sign-bit flips deliver maximal brain damage to neural networks without data or optimization. A strikingly simple attack exposes core robustness flaws.

April 21, 2026

ICLR 2026: Prompt Trick Unlocks LLM True Randomness

LLM randomness fails: Prompted coin flips drift far from 50:50 heads-tails.
Models grasp probabilities but can't generate faithful distributions.
Simple fix: Prompt to generate and manipulate a random string.
Debuting at #ICLR2026 this week.

April 21, 2026

LLMs as AI Practice Partners for Human Social Skills

Stanford researchers propose LLMs as practice partners and mentors to build human social skills in counseling and conflict resolution, asking: What if LLMs could help humans be better at helping others?

April 21, 2026

Google's Open AI Paper Erodes Its Own Moat

A single openly published paper from Google seeded an entire ecosystem of competitors, handing rivals the blueprint for its own tech. AI's scientific ethos is rapidly dismantling traditional industry moats.

AI's Scientific Ethos and the Moat That Wouldn't Hold - Truth on the Market

April 21, 2026·

truthonthemarket.com

April 20, 2026

AI Research Highlights · Apr 20 Daily Digest

Autonomous Math Agents

🔥 Google’s Aletheia: Google announced Aletheia, using Gemini 3 Deep Think, that solved 6/10 novel math problems in the...

April 19, 2026

Trend: Long-Horizon Agents Scaling Math/Science Research with Robust Evals

Autonomous AI agents are tackling extended research tasks:

Aletheia solved 6/10 novel math proofs in FirstProof autonomously, self-filtering...

April 19, 2026

Schmidhuber's Keynote Unveils Power of World Models and Latent Spaces

Schmidhuber delivered the opening keynote at the 2026 World Modeling Workshop at Mila, highlighting simple but powerful ways of using world models and their latent space. Details in his Neural World Model Boom paper.

April 19, 2026

Natural Language Failures in Controlling AI Agents

Natural language instructions are failing to control autonomous AI agents, with this week's research providing striking empirical clarity on the issue—raising critical deployment risks.

AI Agents of the Week: Papers You Should Know About - LLM Watch

April 19, 2026·

llmwatch.com

April 19, 2026

AI Research Highlights · Apr 19 Daily Digest

Agent Frameworks

Don't Retrieve, Navigate: Paper on distilling enterprise knowledge into navigable agent skills for QA and RAG.
🔥 Paper2Code:...

AI Research Highlights

Agentic & self-improvement advances — cheaper, broader, faster

Digest Calendar

Recent Posts

AI Research Highlights · Apr 22 Daily Digest

New Benchmarks

MathNet: Global Benchmark Revolutionizing Multimodal Math AI

MathNet: A Global Multimodal Benchmark for Mathematical ...

Surge in Benchmarks Boosting Agentic AI for Practical Software Tasks

OneVL: One-Step Efficiency Boost for Vision-Language Reasoning

OneVL: One-Step Latent Reasoning and Planning with Vision-Language ...

Prompt Optimization Enables Stable Collusion in LLM Agents

Prompt Optimization Enables Stable Algorithmic Collusion in LLM Agents

LLMs Still Flaky Judges: Position Bias Hits 66% on GPT-5.4

MASS-RAG: Multi-Agent Framework Tackles RAG Failures

AI Research Highlights · Apr 21, 2026 Daily Digest

New Domain-Specific Benchmarks

Surge in Specialized Benchmarks Probes LLM Limits Across Physics, Chem, Robotics, Finance

Intelligence per Picojoule: Codesign for ML Efficiency

Sandboxed AI Agents Breaching Eval Security

Sign-Bit Flips Cause Maximal Damage to Neural Networks

ICLR 2026: Prompt Trick Unlocks LLM True Randomness

LLMs as AI Practice Partners for Human Social Skills

Google's Open AI Paper Erodes Its Own Moat

AI's Scientific Ethos and the Moat That Wouldn't Hold - Truth on the Market

AI Research Highlights · Apr 20 Daily Digest

Autonomous Math Agents

Trend: Long-Horizon Agents Scaling Math/Science Research with Robust Evals

Schmidhuber's Keynote Unveils Power of World Models and Latent Spaces

Natural Language Failures in Controlling AI Agents

AI Agents of the Week: Papers You Should Know About - LLM Watch

AI Research Highlights · Apr 19 Daily Digest

Agent Frameworks

Reading Activity