Core LLM inference & training advances

Key Questions

What advances improve LLM inference speed and efficiency?

ScheduleFree+ achieves 31% faster training, EAGLE 3.1 integrates into vLLM, and DSpark delivers 60-85% speedup via confidence-scheduled speculative decoding. NVIDIA NVFP4 provides 1.66x gains over FP8, while Jet-Long offers 1.39x FA2 throughput for long contexts.

Which new quantization and compression methods enable on-device deployment?

NanoQuant achieves 25.8x compression of Llama2-70B to run on 8GB GPUs using sub-1-bit techniques. BiSCo-LLM supports sub-2-bit weights with minimal perplexity degradation. NVIDIA Kimi-K2.7-Code NVFP4 matches INT4 quality on agentic benchmarks while optimizing for Blackwell GPUs.

How do diffusion-based LLMs advance generation capabilities?

NVIDIA Nemotron-Labs-Diffusion unifies autoregressive, diffusion, and self-speculation modes for 6x tokens per forward pass and 4x throughput. NVIDIA TwoTower diffusion LLM reaches 2.42x throughput at 98.7% quality with 8% lower pretraining cost. dOPSD improves reasoning via on-policy self-distillation from denoising trajectories.

What attention and context extensions support longer sequences?

HiLS Attention enables 64x context extrapolation with full-attention performance via hierarchical sparse selection. SubQ reaches 12M context, and ReContext supports long-context reasoning without retraining. Jet-Long uses dynamic bifocal RoPE for efficient extension.

Which training optimizers and frameworks reduce costs?

M+Adam from Anima Anandkumar's group separates mantissa and exponent for low-precision training. OmniOpt benchmarks 100+ optimizers across model sizes. CoFrGeNets replace transformer components with continued fraction ladders for fewer parameters and faster training.

What interpretability breakthroughs aid monitoring of model internals?

Anthropic's J-space global workspace in Claude reveals a conscious-access bottleneck via Jacobian lens analysis. This allows viewing hidden thinking and modulating it in real time. The approach supports monitoring of agentic reasoning and goals.

How do new models balance quality and efficiency on benchmarks?

VibeThinker-3B matches DeepSeek V3.2 performance, Gemma 4 12B and DiffusionGemma offer consumer GPU speedups, and Inference Looping adds 2.6% on MMLU-Pro. NITP gains 5.7% on the same benchmark while vLLM v0.22 incorporates speculative decoding.

What synthetic data and robustness methods improve training?

CommonSyn generates diversified commonsense reasoning data via two-stage methods. Distributionally Robust RLHF/DPO provides convergence guarantees and OOD improvements. On-policy distillation recovers performance after domain SFT, restoring IF-eval scores from 45% back to 83%.

Inference Looping (+2.6% MMLU-Pro); ScheduleFree+ (31% faster); EAGLE 3.1 merged into vLLM; NITP (+5.7% MMLU-Pro); Gemma 4 12B; SubQ (12M ctx); DiffusionGemma (4x faster on consumer GPUs); DeepSeek V4 Compressed Sparse Attention; vLLM v0.22 speculative decoding; NVIDIA NVFP4 (1.66x over FP8); ProbMoE; VibeThinker-3B matches DeepSeek V3.2; VIMPO; OpenAI Jalapeño chip; NVIDIA TwoTower diffusion LLM (2.42x throughput, 98.7% quality, 8% pretraining cost, open-source); Sber GFusion; Flying Serving; Wiola SLM; ET Edge Transformer; MIPI/MIPU framework. M+Adam low-precision training optimizer (mantissa-exponent separation) from Anima Anandkumar's group — signal for reduced training costs. ReContext (long-context reasoning without retraining). dOPSD: on-policy self-distillation for diffusion language models — improves reasoning via teacher privilege from denoising trajectory, gains in math and code. NanoQuant: sub-1-bit quantization via low-rank binary factorization and ADMM, 25.8× compression of Llama2-70B to run on 8GB GPU; practical breakthrough for on-device deployment. Anthropic's 'global workspace' (J-space) in Claude — Jacobian lens reveals a conscious-access bottleneck; can see what Claude is thinking but not saying, and modulate it. Major interpretability breakthrough for real-time monitoring of agentic reasoning and hidden goals. Nemotron-Labs-Diffusion — tri-mode LM (AR, diffusion, self-speculation) from NVIDIA; 6x tokens per forward pass, 4x throughput on SPEED-Bench; self-speculation outperforms MTP; speed-of-light analysis shows 76.5% more tokens per forward pass. CommonSyn — synthetic data for diversified commonsense reasoning (ACL 2026); two-stage generation method; addresses lack of diverse commonsense training data. OmniOpt — unified taxonomy and large-scale benchmark for modern optimizers; covers 100+ methods, evaluates 24+ from 60M to 1B params; fills gap in systematic comparison for training efficiency. NVIDIA Puzzle-75B-A9B compression — joint structural search across heterogeneous MoE pruning, Mamba pruning, distillation, RL, quantization, and MTP head; doubles interactive server throughput on single 8xB200 node while holding quality on reasoning, coding, and agentic benchmarks. HiLS Attention — hierarchical sparse attention with end-to-end learned chunk selection; achieves full-attention performance while enabling 64x context extrapolation; lightweight continued pretraining conversion. NVIDIA Kimi-K2.7-Code NVFP4 — 1T param MoE quantized to NVFP4, matches INT4 on agentic/coding benchmarks (SWE-bench 74.3%, Terminal-Bench 72.5%); practical deployment optimization for Blackwell GPUs. DSpark — confidence-scheduled speculative decoding with semi-autoregressive generation; reduces waste from parallel drafters; DeepSeek production deployment with 60-85% speedup. Distributionally Robust RLHF/DPO — addresses prompt distribution shift with theoretical convergence guarantees; shows OOD improvements on reasoning tasks; practical robustness method for deployment. New: On-policy distillation for recovery after domain SFT — IF-eval 85%→45%→83%, actionable technique for maintaining post-training behavior. New: Jet-Long — dynamic bifocal RoPE for efficient long-context extension; tuning-free, 1.39x FA2 throughput, strong on RULER/HELMET-RAG; directly addresses short-context fidelity trade-off for agentic workflows. New: CoFrGeNets — IBM Research replaces transformer attention/FFN with continued fraction ladders; fewer parameters, faster training/inference, matches GPT2-xl and Llama-3.2B on GLUE/perplexity; plug-and-play with existing pipelines; early-stage but promising for efficiency. New: BiSCo-LLM — lookup-free binary spherical coding for sub-2-bit LLM weights; Qwen3-8B shows ~0.5 perplexity and ~2% accuracy degradation at extreme compression; practical for on-device/agentic deployment.

Sources (21)

Updated Jul 11, 2026

Agentic AI & Simulation

Core LLM inference & training advances

Key Questions

What advances improve LLM inference speed and efficiency?

Which new quantization and compression methods enable on-device deployment?

How do diffusion-based LLMs advance generation capabilities?

What attention and context extensions support longer sequences?

Which training optimizers and frameworks reduce costs?

What interpretability breakthroughs aid monitoring of model internals?

How do new models balance quality and efficiency on benchmarks?

What synthetic data and robustness methods improve training?

Diffusion LLM Learns to Be Its Own Draft Model: NVIDIA Releases Tri-Mode Open Weights

BiSCo-LLM: Lookup-Free Binary Spherical Coding for ...

CoFrGeNets replace the 'bones' of transformer-based models

Jet-Long: Efficient Long-Context Extension with Dynamic Bifocal RoPE

@BhavinJawade: Recovery with Retention While revisiting the On-Policy Distillation blogpost from Thinking Machines...

Distributionally Robust Reinforcement Learning with ...

DSpark: Confidence-Scheduled Speculative Decoding with Semi-Autoregressive Generation

Hierarchical Sparse Attention Done Right: Toward Infinite Context Modeling

@omarsar0: Banger compression paper from NVIDIA. (bookmark it) Bigger MoE models keep winning on quality, but...

nvidia/Kimi-K2.7-Code-NVFP4

@_akhaliq reposted: OmniOpt A unified taxonomy and large-scale benchmark for modern optimizers, cov...

Nemotron-Labs-Diffusion: A Tri-Mode Language Model Unifying Autoregressive, Diffusion, and Self-Speculation Decoding

Synthetic Data Generation for Training Diversified ...

A global workspace in language models

Efficient Sub-1-bit Quantization of Large Language Models

dOPSD: On-Policy Self-Distillation for Diffusion Language Models

@AnimaAnandkumar reposted: I’ll present our work M+Adam: Low-Precision Training via Mantissa–Exponent Optim...

The Mirage of Optimizing Training Policies: Monotonic Inference Policies as the Real Objective for LLM Reinforcement Learning

Synthetic vs. Curated Data for LLMs: 2026 Guide

ET: An Energy Efficient Edge Transformer Architecture

Wiola SLM Architecture Emerges from First Principles - AI Herald