Scaling Laws and RL Training Efficiency Advances

Key Questions

What insights does Bonnie Li's talk provide on RL scaling?

It describes sigmoid-like scaling curves and adaptive sampling for RL compute. Combinatorial Synthesis (ADR) enables consistent RLVR gains. These guide practical scaling decisions for practitioners.

How predictable is post-RL performance from pretraining?

Chess testbed research shows post-RL results are predictable from pretraining loss. RL reward curve slope scales linearly with pretraining tokens. This links pretraining and post-training phases.

What techniques improve distilled RL for LLM post-training?

Reverse importance sampling and cross-family distillation address RL vs OPD tensions. ByteDance EdgeBench proposes log-sigmoid scaling based on interaction time. These reduce training inefficiencies.

What do scaling laws for hypernetworks reveal?

Hypernetworks show steeper exponents than LoRA for OOD generalization. The MegaWikiQA dataset supports evaluation. Sham Kakade's work yields batch-size scaling laws and optimization schedulers.

What memorization limit applies to LLM pretraining?

An ICML 2026 paper establishes a ~3.6 bits/parameter memorization capacity limit. UltraX extends data editing with insertion via programmatic supervision. These inform data efficiency strategies.

Bonnie Li's talk on scaling RL compute provides practical insights on sigmoid-like scaling curves and adaptive sampling. Combinatorial Synthesis (ADR) enables RLVR scaling with consistent gains. New: Understanding Reasoning from Pretraining to Post-Training (chess testbed) shows post-RL performance predictable from pretraining loss, RL reward curve slope scales linearly with pretraining tokens. Distilled RL for LLM post-training proposes reverse importance sampling and cross-family distillation, addressing RL vs OPD tension. ByteDance EdgeBench proposes log-sigmoid scaling law based on agent environment-interaction time. Sham Kakade's ICML talk shows quadratic models predict LLM pretraining optimization, yielding batch-size scaling laws and practical schedulers. A new ICML 2026 paper establishes ~3.6 bits/parameter memorization capacity limit. UltraX extends pretraining data editing with insertion via programmatic supervision. New: Scaling laws for hypernetworks show steeper exponents than LoRA for OOD generalization, with new MegaWikiQA dataset. These developments are critical for practitioners scaling RL-based training and fine-tuning.

Sources (5)