RL Scaling for LLMs and Agents

Key Questions

What is DORA in RL scaling?

DORA is an async RL method that saves 80% training time using MoE architectures. It aligns with approaches like GLM-5 async RL for efficient LLM and agent scaling.

What RL improvements did Meta FAIR achieve?

Meta FAIR's pretrain self-improve RL boosts factuality by 36%. It uses techniques like GRPO playbook with distillation, LoRA, and judges for better alignment.

What is Compute Optimal Tokenization?

Compute Optimal Tokenization optimizes tokenization for LLM scaling laws. It challenges fixed tokenizers in scaling works, improving efficiency as reposted by Luke Zettlemoyer.

What is the Length Value Model?

The Length Value Model enables scalable value pretraining for token-level length modeling. It addresses long-sequence challenges in RL for LLMs.

What is Edit-R1?

Edit-R1 develops reasoning reward models for image editing tasks. It uses RL to provide precise rewards, enhancing multimodal RL applications.

DORA async RL (80% time saver, MoE); Meta FAIR pretrain self-improve RL (36% factuality); Symbolic-MoE ICML2026 (+8.15%, 2x faster); GRPO playbook (distillation/LoRA/judges); Themis multilingual code RM; noisy DPO semi-sup; Length Value Model token-level length pretraining; Edit-R1 RL image editing rewards; Compute Optimal Tokenization scaling; aligns with DeepSeek GRPO/ASI-EVOLVE/GLM-5 async RL.

Sources (5)