ProgramBench Coding Benchmark Crushes Frontier LLMs at 0%
Key Questions
What is ProgramBench?
ProgramBench is a new coding benchmark launched by Meta FAIR, Stanford, and Harvard that tests real programming skills. It requires models to select programming languages, formulate algorithms, and define data structures without predefined scaffolds. Top LLMs like Claude Opus 4.7, Gemini 3.1 Pro, and GPT-5 scored 0% on it.
Which AI models performed poorly on ProgramBench?
Claude Opus 4.7, Gemini 3.1 Pro, and other frontier LLMs scored 0% on ProgramBench. The benchmark exposed gaps in real-world programming capabilities compared to tools like Aider. Nine models were tested overall.
Why does ProgramBench challenge current LLMs?
Unlike simpler benchmarks, ProgramBench demands full program synthesis from open-ended problems, forcing models to handle language choice and complex algorithms independently. It highlights failures in multi-agent setups like Claude Code PR reviews. IP stack hacks and adamsreview further reveal these limitations.
How does ProgramBench differ from traditional coding benchmarks?
Traditional benchmarks provide scaffolds and fixed languages, while ProgramBench requires models to choose languages and design everything from scratch. This leads to epic fails for GPT, Claude, and Gemini on realistic tasks. It underscores the need for better evaluation of programming prowess.
What tools highlight gaps exposed by ProgramBench?
Tools like adamsreview for multi-agent PR reviews with Claude Code and user-space IP stack implementations with Claude demonstrate practical limitations. These outperform LLMs on ProgramBench, showing superior real-world coding. They emphasize the benchmark's focus on genuine programming challenges.
Meta FAIR/Stanford/Harvard ProgramBench 0% on Claude Opus 4.7, Gemini 3.1 Pro, GPT-5 real programming; Claude Code multi-agent PR reviews adamsreview, IP stack hacks highlight gaps vs Aider.