BrokenArXiv: LLMs fail to reliably reject false math proofs

Key Questions

What is the BrokenArXiv benchmark?

BrokenArXiv is a benchmark that tests LLMs' ability to reject perturbed or false math proofs. State-of-the-art LLMs accept around 60% of these false proofs, revealing weaknesses in verification and generalization. It highlights issues in base LLMs, as noted by Chollet and Marcus.

Why do LLMs struggle with math proof verification?

LLMs fail due to poor generalization on math tasks without test-time adaptation, accepting flawed proofs. The benchmark exposes verification weaknesses, connected to hallucinations in AI-written papers. Priorities include code, independent reproductions, and formal proof-checking tools.

What is Paper Reconstruction Evaluation?

Paper Reconstruction Evaluation detects presentation flaws and hallucinations in AI-written papers. It evaluates how well AI-generated content can be reconstructed accurately. This tool addresses gaps in AI paper reliability.

What advancements are mentioned like Cog-DRIFT and TriAttention?

Cog-DRIFT enables models to learn from zero-reward examples in RLVR, breaking exploration barriers. TriAttention improves efficient long reasoning with trigonometric KV compression. These tie into broader reasoning improvements amid verification challenges.

How does BrokenArXiv relate to RLHF jailbreaks and agentic pressure?

BrokenArXiv's findings connect to RLHF jailbreaks, where models fail under pressure, similar to accepting false proofs. It underscores agentic vulnerabilities in reasoning tasks. Tracking reproductions and code is emphasized for further study.

Benchmark shows SOTA LLMs accept ~60% perturbed/false math proofs, exposing verification weaknesses; Chollet/Marcus highlight base LLMs flop on generalization math; Paper Reconstruction Eval added for detecting hallucinations/presentation flaws in AI-written papers; new Cog-DRIFT RLVR advances and TriAttention long reasoning tie in. Connected to RLHF jailbreaks and agentic pressure; priorities: code, independent repros, formal proof-checking tools.

Sources (7)

Updated Apr 8, 2026

AI Research Digest

BrokenArXiv: LLMs fail to reliably reject false math proofs

Key Questions

What is the BrokenArXiv benchmark?

Why do LLMs struggle with math proof verification?

What is Paper Reconstruction Evaluation?

What advancements are mentioned like Cog-DRIFT and TriAttention?

How does BrokenArXiv relate to RLHF jailbreaks and agentic pressure?

FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification

@EliasEskin: 🚨 Excited to share Cog-DRIFT, new work on enabling models to learn from zero-reward examples! RLVR...

ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement

@EliasEskin reposted: 🚨Cog-DRIFT: Breaking the Exploration Barrier in RLVR RLVR has pushed LLM reason...

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

@GaryMarcus reposted: Paper below tested a variety of base LLMs (no TTA) on generalization-focus math ...

@_akhaliq: Paper Reconstruction Evaluation Evaluating Presentation and Hallucination in AI-written Papers pap...