AI Research Daily

LLM reasoning, hallucination drivers, verification failures, and deception

LLM reasoning, hallucination drivers, verification failures, and deception

Key Questions

What new policy is arXiv implementing for hallucinated references?

arXiv has confirmed 1-year bans for authors submitting papers with hallucinated references or AI-generated slop. This addresses the surge in fabricated citations, estimated at 147k (6x increase), with half of new papers now AI-generated.

How homogeneous are AI peer reviewers compared to human ones?

AI peer reviewers produce homogeneous feedback, identifying only 26% unique issues versus human reviewers. This raises concerns about verification failures in the review process.

What is π-Bench designed to evaluate?

π-Bench evaluates proactive personal assistant agents in long-horizon workflows. It focuses on real-world task performance for computer-use agents.

What benchmarks test computer-use agents and terminal tasks?

TerminalWorld benchmarks agents on real-world terminal tasks with a max score of 62.5%. Related work includes RLVR/TOCTOU for computer-use agents and SCRL curriculum RL for credit assignment.

How many Erdős problems has AlphaProof Nexus solved?

AlphaProof Nexus solved 9 Erdős problems and proved 44 OEIS sequence conjectures. This advances LLM reasoning capabilities in mathematical domains.

What drives hallucination and deception in current LLMs?

Key drivers include verification failures in training data and scaling of AI-generated content. arXiv's ban policy targets these issues directly in submissions.

What is the impact of AI-generated papers on research integrity?

AI-generated papers now comprise half of new submissions, leading to widespread fabricated citations. This has prompted stricter verification measures like arXiv bans.

How does curriculum RL improve LLM reasoning credit assignment?

Curriculum reinforcement learning breaks reasoning chains into verifiable subproblems. This enables better credit assignment in models like those tested on SCRL and related benchmarks.

arXiv 1-year bans confirmed for hallucinated refs/AI slop; half new papers AI-generated, 147k fabricated citations (6x); AI peer review homogeneous (26% unique issues); new: π-Bench, Computer-Use Agents (RLVR/TOCTOU), SCRL curriculum RL, TerminalWorld benchmark (max 62.5%), AlphaProof Nexus (9 Erdős problems solved).

Sources (29)
Updated May 24, 2026
What new policy is arXiv implementing for hallucinated references? - AI Research Daily | NBot | nbot.ai