LLM reasoning, hallucination drivers, verification failures, and deception

Key Questions

What new policy is arXiv implementing for hallucinated references?

arXiv has confirmed 1-year bans for authors submitting papers with hallucinated references or AI-generated slop. This addresses the surge in fabricated citations, estimated at 147k (6x increase), with half of new papers now AI-generated.

How homogeneous are AI peer reviewers compared to human ones?

AI peer reviewers produce homogeneous feedback, identifying only 26% unique issues versus human reviewers. This raises concerns about verification failures in the review process.

What is π-Bench designed to evaluate?

π-Bench evaluates proactive personal assistant agents in long-horizon workflows. It focuses on real-world task performance for computer-use agents.

What benchmarks test computer-use agents and terminal tasks?

TerminalWorld benchmarks agents on real-world terminal tasks with a max score of 62.5%. Related work includes RLVR/TOCTOU for computer-use agents and SCRL curriculum RL for credit assignment.

How many Erdős problems has AlphaProof Nexus solved?

AlphaProof Nexus solved 9 Erdős problems and proved 44 OEIS sequence conjectures. This advances LLM reasoning capabilities in mathematical domains.

What drives hallucination and deception in current LLMs?

Key drivers include verification failures in training data and scaling of AI-generated content. arXiv's ban policy targets these issues directly in submissions.

What is the impact of AI-generated papers on research integrity?

AI-generated papers now comprise half of new submissions, leading to widespread fabricated citations. This has prompted stricter verification measures like arXiv bans.

How does curriculum RL improve LLM reasoning credit assignment?

Curriculum reinforcement learning breaks reasoning chains into verifiable subproblems. This enables better credit assignment in models like those tested on SCRL and related benchmarks.

arXiv 1-year bans confirmed for hallucinated refs/AI slop; half new papers AI-generated, 147k fabricated citations (6x); AI peer review homogeneous (26% unique issues); new: π-Bench, Computer-Use Agents (RLVR/TOCTOU), SCRL curriculum RL, TerminalWorld benchmark (max 62.5%), AlphaProof Nexus (9 Erdős problems solved).

Sources (29)

Updated May 24, 2026

AI Research Daily

LLM reasoning, hallucination drivers, verification failures, and deception

Key Questions

What new policy is arXiv implementing for hallucinated references?

How homogeneous are AI peer reviewers compared to human ones?

What is π-Bench designed to evaluate?

What benchmarks test computer-use agents and terminal tasks?

How many Erdős problems has AlphaProof Nexus solved?

What drives hallucination and deception in current LLMs?

What is the impact of AI-generated papers on research integrity?

How does curriculum RL improve LLM reasoning credit assignment?

Fixing AI Agent Chaos: How Quantitative Role Clarity Improves Multi-Agent Systems

Google DeepMind's AlphaProof Nexus Solves 9 Erdős Problems and ...

AlphaProof Nexus solves 9 Erdős problems and proves 44 sequence conjectures

From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning

TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

ArXiv Will Ban You for Hallucinated References

Weekend reads: arXiv to ban researchers with hallucinated ...

ARXIV BANS AUTHORS WHO USE HALLUCINATED REFERENCES

Multi-Stream LLMs: new paper on parallelizing/separating prompts, thinking, I/O

@EliasEskin: 🚨 AVSD is a new self-distillation method that enables learning from multiple "views" of privileged i...

The Research Behind Computer-Use Agents

Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning

π-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows

AI research papers are getting better, and it’s a big problem for scientists

They find that AI reviewers are generally homogenous, accurate, ...

Half of New Research Papers Now AI-Generated - arXiv Responds With ...

ArXiv to ban authors over unchecked AI-generated academic papers

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

AI Fabricated Citations Flood Scientific Papers, Eroding Trust Across Disciplines

OmniGUI/OmniGUI · Datasets at Hugging Face

Benchmarking and Improving GUI Agents in High-Dynamic ...

CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

OpenComputer: Verifiable Software Worlds for Computer-Use Agents

Daily ArXiv CS Digest — May 19, 2026 #ArXiv #AI #machinelearning #deeplearning #NLP #llm #research

Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics

@GaryMarcus: The pure LLM debate - which I had for many years, here and elsewhere - is indeed no longer relevant....

ChatGPT vs. Google Scholar: Which is Better for Fact-Checking? #ChatGPT #GoogleScholar

Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution

How I Built an AI Research Skill That Checks Its Own Citations