Core ML efficiency/interpretability (Recurrent TF, DeepSeek-V4, ShadowPEFT)

Key Questions

What efficiency breakthroughs are reported for LLMs?

MiniMax-M3 and NVIDIA Nemotron 3 Ultra achieve strong results at lower cost through sparse attention and hybrid Mamba-Transformer designs. ESPO and OCTOPUS deliver over 20% token savings and near-optimal KV cache compression.

How do small models compare to frontier models on reasoning tasks?

VibeThinker-3B matches DeepSeek V3.2 and Gemini 3 Pro on reasoning benchmarks. Gemma 4 12B reaches near 26B MoE performance while remaining laptop-ready.

What advances address KV cache and memory bottlenecks?

VaSE, HyLo, and TurboQuant achieve 4x to 5x compression with minimal accuracy loss. HOLA and Jet-Long enable 16x to 64x length extrapolation via hierarchical sparse attention.

What role does on-policy distillation play in efficiency?

CollectionLoRA, TrOPD, and OPRD reduce parameter interference and sampling variance while improving throughput. These methods deliver 1.44x speedups and 54% memory reductions.

How are diffusion and autoregressive models evolving for efficiency?

DiffusionGemma offers 4x faster text generation, while JetSpec achieves 9.64x speedup via parallel tree speculative decoding. LoopMDM enables depth-scaling without extra parameters.

What interpretability methods are emerging for core ML?

Explaining Attention with Program Synthesis provides symbolic methods. Why Larger Models Learn More validates capacity and interference mechanisms using OLMo models.

What hardware and energy considerations affect deployment?

Carmack highlights memory cost issues for AI accelerators. Local LLM energy benchmarks show 3-shot prompting improves efficiency without accuracy loss.

How do optimizers and quantization impact training efficiency?

OmniOpt provides a comprehensive taxonomy and benchmark of modern optimizers. OrbitQuant enables data-agnostic quantization for diffusion transformers with cross-modality transfer.

Diffusion spec 3x; Meta OMT-LLaMA; SGS 7B>671B; Signpost/Hier mem/RoundPipe/DORA/LenVM/LeCun JEPA; MSA 100M mem; efficient-transformers LVLM mem issues. New: Flow map language models (FMLM) outperform discrete diffusion in few-step generation; adversarial flow distillation for video; OCTOPUS extreme KV cache compression (near-optimal, zero latency); RT-Lynx activation sparsity for diffusion transformers (1.55x speedup); EDGE-OPD on-policy distillation; MEMO modular memory model for continual learning (outperforms RAG); Stanford HAI scaling law method for one-shot LLM training. Also: Gemini Embedding 2 native multimodal embedding (SOTA on multiple benchmarks); AKBE efficient agentic RL; SAM state-adaptive memory. Non-AR boosts accelerate. New: Queue & AI paper introduces 'variance wedge' concept for AI workflow slowdowns — relevant to deployment efficiency. New: DenoiseRL efficient RL from noisy prefixes; BES bidirectional evolutionary search for self-improvement. New: How LoRA Remembers? (parametric memory law for LoRA, phase transition at p>0.5, MemFT optimization). New: Thinking Before Constraining (unified decoding framework, up to 27% accuracy gains over natural generation); Why Larger Models Learn More (mechanistic explanation: capacity, interference, rare-task retention, validated with OLMo 4M-4B). New: CollectionLoRA (distilling 50 effect LoRAs into one via multi-teacher on-policy distillation, addresses parameter interference and deployment overhead). New: DiffusionBlocks (Sakana AI) — block-wise training via diffusion, new paradigm for scalable neural network training. New today: Trajectory concurrent multi-LoRA training stack (2.81× throughput gain, no accuracy regression, vLLM multi-LoRA inference). Also: LoopMDM (looped diffusion language models, depth-scaling without extra parameters, flexible compute scaling). Also: SAERL (SAE-guided RL post-training data engineering, +3% accuracy, -20% training steps) adds data-centric efficiency. New: dMoE (dLLMs with learnable block experts, reduces activated experts from 69.5 to 14.6, 99% performance). New: Learn from your own latents (theoretical proof that latent prediction beats token prediction for sample complexity, Random Hierarchy Model). Also: TurboQuant open-source 5x KV cache memory reduction (Tether AI, production-ready). New: MiniMax-M3 sparse-attention model beats GPT-5.5 and Gemini 3.1 Pro on agent benchmarks at 5-10% cost — major efficiency breakthrough. Also conceptual signal: @jon_barron argues autoregression and diffusion are not a false dichotomy, citing diffusion forcing. New articles: On the Scaling of PEFT (million personal models, MinT infrastructure for adapter management), NITP (Next Implicit Token Prediction, +5.7% MMLU-Pro, low-overhead pre-training improvement), ESPO (early-stopping PPO, >20% token savings, improves math reasoning). New: Agentix (NSDI '26) — efficient serving engine for LLM agents as general programs, 4-15x throughput gain over vLLM by treating programs as first-class citizens and preempting calls based on program context. New: Oryx — dynamic attention & recurrent hybrid LLM, ties 90% of parameters, balancing long-context retrieval and generation speed. New: GPU kernel performance forecasting via calibrated LLM surrogates (12k kernel dataset, selective prediction, enables efficient evolutionary search for self-improving agents). New: VaSE (value-aware stochastic KV cache eviction for reasoning models, 4x compression, training-free). New: Decoupled Residual Denoising Diffusion Models (unified I2I translation, data-efficient, CVPR). New: HyLo hybrid architecture (upcycles existing LLMs with MLA+linear blocks, 90% KV-cache reduction, 32x context extension). New: TrOPD (stable on-policy distillation for reasoning models, partitions tokens into trust regions and outliers). New: Gemma 4 12B (encoder-free multimodal, laptop-ready, near 26B MoE performance) — notable release. New: NVIDIA Nemotron 3 Ultra — hybrid Mamba-Transformer, NVFP4 quantization, LatentMoE, multi-token prediction, Multi-Teacher On-Policy Distillation, 5x throughput on agentic tasks. New signal: @mmbronstein reposted MUX — latent continuous reasoning from text CoT, new direction for reasoning efficiency. New: NF-CoT (latent reasoning with normalizing flows) preserves KV-cache and tractable likelihood, improves code-gen pass rates at lower cost. New: OPRD (On-Policy Representation Distillation) eliminates sampling variance, 1.44x speedup, 54% memory reduction. New: Fine-tuning vs ICL study (ACL 2026) — fine-tuning converges to equal proficiency, ICL shows variability; both generalize similarly OOD. New: MLEvolve (self-evolving framework for automated ML algorithm discovery, SOTA on MLE-Bench) — relevant to core ML automation. New: AdaCodec (predictive visual code for video MLLMs, 1/7 token budget, 5.7x TTFT reduction). New: Flash-WAM (modality-aware distillation for world action models, 23x speedup). New: Video2LoRA (parametric video internalization, 1500x token reduction). New: Combinatorial Synthesis (atomic decomposition for code RLVR, scaling data generation). Also: AntiSD (reverse self-distillation, 11.5% math gains, 2-10x fewer steps), Do Transformers Really Need Three Projections? (K/V sharing, 50% cache reduction, <3% perplexity loss). New: Stateful Encoders (VLMs with visual memory via cross-attention and stop-gradient, consistent gains across backbones, practical for multi-image reasoning). New: New Sleep Paradigm for LLM Memory Consolidation (two-stage sleep with distillation and dreaming, biologically inspired, addresses forgetting). New: Compress-Distill (reasoning trace compression, 7.6x training speedup, 96% accuracy retention). New: ADASPEC (multilingual speculative decoding, 2.3x speedup over EAGLE-2, self-synthesized training data). New: On-policy self-distillation conditioning on own outputs outperforms vanilla on-policy RL (tweet). New: Forgetting in LMs — self-generated replay with BOS token reduces forgetting to near zero, 38x efficiency. New today: LCLMs (end-to-end context compression at scale, targets long-horizon agents, architecture search, 350B token pre-training), Reasoning Arena (trace tournaments for RLVR, 7.6% gain, 27-41% speedup), FNS tokenizer (hybrid subword-character resolution via loss augmentation, practical for agentic workflows). New: DRPO (smooth quadratic divergence regularization for LLM RL), FlowTracer (attention-induced information flow for token-level credit assignment, ICML 2026). New: Harness-1 (20B state-externalizing search agent) also contributes to efficiency via reduced context overhead. New: DiffusionGemma (4x faster text generation, open diffusion model) — significant speedup for text generation. New: Redesign MoE routers with manifold power iteration — potentially impactful for efficiency. New: Verifiable Environments as LEGO bricks — recursive composition for reasoning generalization. New today: Adaptive Multi-Resolution Procedural Knowledge Compression for LLMs — multi-resolution compression for flexible compute-memory trade-offs. New from today's articles: HarnessBridge (bidirectional controller for token efficiency), SG-OPD (sign-gated on-policy distillation), MaxProof (generative-verifier RL for math proof scaling). Also new: Nightjar (dynamic adaptive speculative decoding, addresses KV cache overhead under high load). New: N-GRPO (embedding-level neighbor mixing for exploration in GRPO, improves math reasoning). New: Demystifying Hidden-State Recurrence (Switchable Latent with Switch-GRPO, latent computation with curriculum learning). New: SMEPilot CPU inference optimization (ex-6cff8086); FastContext 4B repo exploration model (ex-1496e78c). New today: @StanfordHAI tweet on new scaling law method to reduce training demands (ex-1099a237) — relevant to efficiency. New: Only Loop Once for efficient test-time computation scaling (reduces latency/KV-cache in Looped Transformers). New: VibeThinker-3B small model matches DeepSeek V3.2 and Gemini 3 Pro on reasoning (Parametric Compression-Coverage Hypothesis). New: @jeremyphoward repost signals return of critics (value functions) in RL training for long-horizon tasks, challenging GRPO. New: VibeThinker-3B also matches Opus 4.5 in programming — reinforces small-model reasoning trend. New: How Efficient Attention Shapes Hybrid LLMs (ex-f25db1d6) — Large-Window Laziness and NoPE boost long-context performance. New: Explaining Attention with Program Synthesis (ex-36f308c5) — symbolic interpretability method. New: V-Zero (answer-label-free on-policy distillation for visual reasoning, 5-10x speedup), CAVEWOMAN (input compression backfires, output compression saves cost). New: JetSpec (parallel tree speculative decoding, 9.64x speedup, breaks scaling ceiling). New: Error-Conditioned Neural Solvers (10× gain on turbulent Kolmogorov flow) adds core physics-informed ML improvement. New practical observation: @rasbt reports local 30B MoE models (Qwen-Code, Codex, Claude Code) hitting ~40 tok/sec on Mac/DGX Spark, matching GPT 5.5 subscription speed; Claude Code uses 2x tokens vs Codex, highlighting harness efficiency differences. New: ELDR (expert-locality-aware decode routing for PD-disaggregated MoE serving, 5.9-13.9% TPOT reduction) adds practical MoE efficiency. New: Multimodal Continuous Reasoning via AMVL (asymmetric mutual variational learning, +10.83 on BLINK) improves multimodal reasoning efficiency. New today: low-bit quantization inflates reasoning tokens (CoT Token Inflation Ratio, QAT helps, mitigations not solved). New: HOLA (compressive recurrent state + small exact cache, sub-full-attention perplexity, 16x length extrapolation) adds practical efficiency. New: Survey on efficient attention mechanisms provides overview. New: 'meek models' paper on bounded vs unbounded metrics challenges democratization narratives. New: OrbitQuant (data-agnostic quantization for DiTs, W2A4 usable quality, cross-modality transfer). New: MIPI/MIPU (monotonic inference policies as real objective for LLM RL, addresses training-inference mismatch). New: OmniOpt (comprehensive optimizer survey and benchmark) adds reference for training efficiency. New: Carmack's NAND flash for AI inference (hardware-level cost argument, deterministic access patterns) adds provocative system-level efficiency perspective. New: Local LLM energy benchmark (3-shot prompting improves efficiency without accuracy loss) adds practical deployment insight. New today: HiLS Attention (hierarchical sparse attention, 64x length extrapolation, 90% retrieval accuracy) adds practical long-context efficiency.

Status: Climaxing — efficiency breakthroughs from MiniMax-M3, Nemotron 3 Ultra, and various KV cache/attention optimizations. The return of critics and value functions challenges GRPO dominance. Small models (VibeThinker-3B) matching frontier models on reasoning is a notable trend. The energy cost paper (H8) adds system-level perspective.

Sources (10)