Verifiable reasoning scaling & agent harnesses
Key Questions
What achievements does SU-01 30B demonstrate?
SU-01 30B reaches gold-medal performance on IMO and IPhO through reverse-perplexity SFT, two-stage verifiable RL, and test-time scaling.
What is Code as Agent Harness?
It is a reliable LLM framework using Plan-Execute-Verify cycles to improve agent reliability in complex reasoning tasks.
How does Anti-Self-Distillation work via PMI?
Anti-Self-Distillation for Reasoning RL uses Pointwise Mutual Information to prevent overfitting and enhance reasoning generalization.
What is RMT overfitting detection?
RMT enables overfitting detection in reasoning models without requiring access to test data through specialized monitoring techniques.
What is the status of verifiable reasoning scaling?
This highlight on verifiable reasoning scaling and agent harnesses is listed as developing with new scaling laws and frameworks emerging.
SU-01 30B hits gold-medal IMO/IPhO via reverse-perplexity SFT + two-stage verifiable RL + TTS. Code as Agent Harness (Plan-Execute-Verify); Anti-Self-Distillation via PMI; RMT overfitting detection without test data. DL combinatorial optimization survey. RoPE failures in position/token distinction and 50-95% sparse nets for accelerator efficiency added.