LLM Training & Efficiency Breakthroughs
Key Questions
What efficiency gains does Orthrus hybrid AR-diffusion provide?
It achieves up to 7.8x faster parallel decoding by combining autoregressive and diffusion approaches. This improves inference speed for large language models.
How does Muon optimizer compare to Adam according to recent analysis?
Muon outperforms Adam due to better handling of curvature, with advantages amplified by data imbalance. The work offers a curvature-based perspective on optimizer performance.
What is CPPO and its benefit for LLM RL?
CPPO introduces position-weighted token-level trust regions for reinforcement learning in LLMs. It enables smarter, more targeted policy updates during training.
What does DiffusionGemma offer for generation speed?
The 26B MoE model from Google AI uses text diffusion to achieve up to 4x faster generation. It provides an open model alternative for efficient inference.
How does End-to-End Context Compression perform at scale?
It achieves a 1:16 compression ratio with adaptive expansion while preserving performance. The method reduces context length requirements for large models.
What does the paper on finding optimal tokenizers suggest?
It challenges assumptions about fixed tokenization, indicating potential efficiency improvements through better tokenizer selection. Results show impacts on model performance and training.
What is DELM and its cost reduction benefit?
DELM enables decentralized LLMs with shared context, reducing costs by up to 50%. It supports collaborative inference across distributed systems.
How does OPRD improve on-policy distillation?
OPRD addresses sampling variance and black-box teacher issues via representation-level distillation. It delivers 1.44x faster training and 54% less memory usage.
Darwin merging; δ-mem; Lighthouse; KV eviction; DiLoCo; ReFTA; CEPO/RLVR; HRM-Text (1000x token efficiency); ScheduleFree+ (31% faster); Orthrus hybrid AR-diffusion (7.8x parallel decode); MoE semantic-space router; OScaR quantization; Gated DeltaNet-2; SPD self-distillation (16% gain); dGRPO; Shard asymmetric KV compression (10x). New: LongTraceRL (RLVR with rubric rewards, positive-only); OmniOPD logit-free on-policy distillation (+28.64% math); Trust Region On-Policy Distillation (principled fix); ThoughtFold (56% token reduction, no accuracy loss); MemTrain (17.67 point gain). Also 'Language Models Need Sleep', 'Improving Frozen LLMs via Inference Looping', LT2 linear-time looped transformers, DAR timestep-adaptive routing, RT-Lynx activation sparsification, Stanford HAI scaling law estimation reduction, EDGE-OPD, MEMO modular memory, EAGLE 3.1, Cassandra self-speculative decoding, 'Less is More: Early Stopping Rollout', 'AgingBench', DenoiseRL, MemTrace, BES, DiffusionBlocks, HRBench, SAE-guided post-training, Parallax local linear attention with Muon, BeliefTrack, FluxMem, LLM introspection. New: Bonnie Li's talk on scaling RL compute for LLMs reveals sigmoid scaling curves, ceiling vs efficiency tradeoffs, train-inference discrepancy, and async RL with adaptive sampling. Combinatorial Synthesis scales code RLVR via atomic decomposition and recombination, addressing data scarcity with principled generation. Also LLMCodec uses video compression techniques (U420 format) for weight compression. New: OPRD (On-Policy Representation Distillation) addresses OPD sampling variance and black-box teacher via representation-level distillation, achieving 1.44x faster and 54% less memory, closing AIME/AIMO gap. New: Compress-Distill studies post-hoc reasoning trace compression before distillation, showing 2-7.6x speedup and up to 96% accuracy retention. New: TRD (Trajectory-Refined Distillation) fixes prefix failure in OPD; End-to-End Context Compression at scale (1:16 ratio, adaptive expansion); Muon outperforms Adam due to curvature perspective (NDS, data imbalance amplifies advantage). FOD#155 synthesizes sleep papers for continual learning. New: FlowTracer traces attention-induced information flow for token-level credit assignment in RL (ICML 2026). DiffusionGemma (26B MoE) offers up to 4x faster generation via text diffusion. New: CPPO introduces position-weighted token-level trust region for LLM RL. DELM decentralized LLMs with shared context reduce costs 50%. New: Switchable Latent states with Switch-GRPO for recurrent latent computation, using visible-to-latent curriculum. New: Finding Optimal Tokenizers paper challenges fixed tokenization, potentially improving efficiency.