Training Optimization & Efficiency

Key Questions

What gradient efficiency gains were reported?

@syhw shared a method that improves gradient efficiency by 20-40% and enables reasoning generalization from code to math without math-specific training.

What is Tsinghua CausalMix?

CausalMix from Tsinghua uses CATE estimation to optimize data mixtures for better transferability during training.

How does on-policy distillation aid recovery?

After domain-specific SFT degrades general capabilities, on-policy distillation can restore performance, for example raising IF-eval scores from 45% back to 83%.

What is CWM in world model training?

Compressed World Model (CWM) approaches use mid-training scale and agentic data such as bash commands to improve agentic RL training.

What does the mechanistic interpretability paper reveal?

A paper shows that memorized knowledge often fails to generalize in fine-tuning, but self-patching can recover 58-75% of failures.

What guarantees does Trust Region Policy Distillation provide?

TOP-D stabilizes on-policy distillation with theoretical guarantees by breaking large policy updates into smaller, safer steps.

What improvement does Self-Guided Test-Time Training deliver?

Self-Guided Test-Time Training yields a 15% relative improvement on LongBench-v2 for long-context processing without retraining.

What is Jet-Long's contribution to long-context models?

Jet-Long enables efficient long-context extension using dynamic bifocal RoPE with zero-shot application and no retraining required.

@syhw reports method improving gradient efficiency by 20-40% and enabling reasoning generalization from code to math without math training. Tsinghua CausalMix uses CATE estimation for transferable data mixture optimization. Denser neq Better paper challenges on-policy self-distillation. MIPI/MIPU paper from ACL2026 addresses training-inference engine mismatch. dOPSD uses denoising trajectories for dense supervision. OmniOpt provides unified taxonomy and benchmark for optimizers. On-policy distillation recovery technique: after domain-specific SFT degrades general capabilities, on-policy distillation can recover behavior (IF-eval 85%→45%→83%). SAO also contributes to training efficiency for agentic RL. Jet-Long long-context extension with dynamic bifocal RoPE (zero-shot, no retraining). New: CWM (Compressed World Model) and related world model training approaches shared by @syhw, with mid-training scale and agentic data. New: Mechanistic interpretability paper on knowledge generalization failure in fine-tuning (self-patching recovers 58-75% failure). Trust Region Policy Distillation (TOP-D) stabilizes on-policy distillation with theoretical guarantees. Self-Guided Test-Time Training for long-context (15% relative improvement on LongBench-v2). KronQ quantization (2-bit LLaMA-3-70B 7.93 perplexity) for extreme compression.

Sources (8)

Updated Jul 17, 2026

LLM Insight Tracker

Training Optimization & Efficiency

Key Questions

What gradient efficiency gains were reported?

What is Tsinghua CausalMix?

How does on-policy distillation aid recovery?

What is CWM in world model training?

What does the mechanistic interpretability paper reveal?

What guarantees does Trust Region Policy Distillation provide?

What improvement does Self-Guided Test-Time Training deliver?

What is Jet-Long's contribution to long-context models?

@_akhaliq reposted: Thx @_akhaliq for sharing 🚀🚀 Tri-branch DiT && Joint Cross-Modal Attent...

@omarsar0 reposted: NEW research from Microsoft. It's on scaling distribution-matching RL to large ...

Self-Guided Test-Time Training for Long-Context LLMs

Trust Region Policy Distillation

Towards Mechanistically Understanding Why Memorized Knowledge Fails to Generalize in Large Language Model Finetuning

@syhw reposted: @cwolferesearch Biased rec, CWM: https://t.co/KNVNHFdZ76, like ECHO/PaW, but at ...

Jet-Long: Efficient Long-Context Extension with Dynamic Bifocal RoPE

@BhavinJawade: Recovery with Retention While revisiting the On-Policy Distillation blogpost from Thinking Machines...

Training Optimization & Efficiency

Key Questions

What gradient efficiency gains were reported?

What is Tsinghua CausalMix?

How does on-policy distillation aid recovery?

What is CWM in world model training?

What does the mechanistic interpretability paper reveal?

What guarantees does Trust Region Policy Distillation provide?

What improvement does Self-Guided Test-Time Training deliver?

What is Jet-Long's contribution to long-context models?

@_akhaliq reposted: Thx @_akhaliq for sharing 🚀🚀 Tri-branch DiT &amp;&amp; Joint Cross-Modal Attent...

@omarsar0 reposted: NEW research from Microsoft. It's on scaling distribution-matching RL to large ...

Self-Guided Test-Time Training for Long-Context LLMs

Trust Region Policy Distillation

Towards Mechanistically Understanding Why Memorized Knowledge Fails to Generalize in Large Language Model Finetuning

@syhw reposted: @cwolferesearch Biased rec, CWM: https://t.co/KNVNHFdZ76, like ECHO/PaW, but at ...

Jet-Long: Efficient Long-Context Extension with Dynamic Bifocal RoPE

@BhavinJawade: Recovery with Retention While revisiting the On-Policy Distillation blogpost from Thinking Machines...

@_akhaliq reposted: Thx @_akhaliq for sharing 🚀🚀 Tri-branch DiT && Joint Cross-Modal Attent...