AI Research & Policy Brief

Surge in agentic training methods and RL-driven tool use

Surge in agentic training methods and RL-driven tool use

Key Questions

What is Nemotron 3 Super?

Nemotron 3 Super is an open, efficient Mixture-of-Experts hybrid Mamba-Transformer model designed for agentic reasoning. It combines architectures for improved performance in agent tasks. This reflects the surge in agentic training methods.

What is Lightning OPD?

Lightning OPD is Efficient Post-Training for Large Reasoning Models using Offline On-Policy Distillation. It enables effective distillation from large teachers. Related work rethinks on-policy distillation phenomenology and recipes.

What is SPPO?

SPPO is Sequence-Level PPO for Long-Horizon Reasoning Tasks, enhancing RL for extended sequences. It supports agentic training advances. This ties into broader RL-driven tool use.

How does KV Packet improve LLM efficiency?

KV Packet enables recomputation-free, context-independent KV caching for LLMs, optimizing memory and speed. It supports agentic workflows with long contexts. This is part of efficiency gains in agent training.

What is MM-WebAgent?

MM-WebAgent is a hierarchical multimodal web agent for webpage generation, advancing web navigation and tool use. It leverages multimodal inputs for complex tasks. This exemplifies the agentic surge.

What is the shift from P(y|x) to P(y) in RL pretraining?

Research investigates Reinforcement Learning in Pre-train Space, moving from conditional to marginal probability modeling. This explores RL directly in pretraining. It aims to boost agent capabilities.

What risks are associated with RAGEN and evals?

RAGEN highlights collapse risks amid evolving evaluations in agentic systems. Benchmarks like Gemma, DMax, SkillClaw, and GrandCode test these. Training methods like Parcae looped scaling and RAD-2 gen-disc RL address scaling challenges.

What are examples of reward modeling in agentic training?

RationalRewards focuses on vision tasks, C2 rubric RM for rubrics, and memory-enhanced dynamic reward shaping. Teacher-student SFT improves reasoning. These drive RL-driven tool use and SemaClaw harness.

Nemotron 3 Super MoE Mamba-Transformer; Lightning OPD/SPPO/TIP; Parcae looped scaling; RL pretrain P(y); SemaClaw harness; RationalRewards vision; RAD-2 gen-disc RL; KV Packet KV cache; MM-WebAgent hier web gen; teacher-student SFT reasoning; C2 rubric RM; Gemma/DMax/SkillClaw/GrandCode; RAGEN collapse risks amid evals.

Sources (13)
Updated Apr 18, 2026