Academic provenance crisis and agentized forensics

Key Questions

What is llmtester and its scale?

llmtester is a new 50k CLI benchmark for LLMs. It tests capabilities across large-scale evaluations.

Which benchmarks exposed LLM limitations recently?

ADeLe, ARC-AGI-3, MIRAGE, and CirrusBench showed flops for LLMs. They highlight issues in reasoning and generalization.

What is the academic provenance crisis?

It involves fakes at NeurIPS/ICML, coding errors in 25% of papers, and LLM failures like ZEH/coding at 25%. Agentized forensics like Sakana/CORAL address it.

How do Sakana and CORAL contribute to research integrity?

Sakana, CORAL, FlowPIE, and AgentSLR automate and verify scientific discovery. They combat provenance issues with agentized tools.

What is the success rate of Meta's math models?

Meta's math models achieve 70th percentile, alongside daVinci/Lean advances. This contrasts with broader academic coding errors.

What tools aid paper analysis like PaperLens?

PaperLens and OpenSeeker enable agentized forensics for provenance. They support verification amid LLM hallucinations.

Can prompt injection fool LLM judges?

Yes, prompt injection can manipulate LLM judges for high scores like an 'A'. This raises concerns for academic evaluation.

What is self-execution simulation in research context?

Self-execution simulation verifies claims via execution-based checks, as in FactReview. It enhances agentized forensics for coding and math.

llmtester 50k CLI; ADeLe/ARC-AGI-3/MIRAGE/CirrusBench flops; NeurIPS fakes/ICML; Sakana/CORAL/FlowPIE/AgentSLR/PaperCircle/FactReview exec/lit pos; ZEH/coding 25%; Meta math 70p; daVinci/Lean; PaperLens/OpenSeeker; prompt injection judges; self-execution sim coding.

Sources (8)

Updated Apr 8, 2026

AI Space Insight

Academic provenance crisis and agentized forensics

Key Questions

What is llmtester and its scale?

Which benchmarks exposed LLM limitations recently?

What is the academic provenance crisis?

How do Sakana and CORAL contribute to research integrity?

What is the success rate of Meta's math models?

What tools aid paper analysis like PaperLens?

Can prompt injection fool LLM judges?

What is self-execution simulation in research context?

Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework

FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification

New LLM benchmark: llmtester — Hive

New method predicts the success of LLMs on untried tasks with high accuracy

LLM Agent Automates End-to-End Research Cycle

Efficient and Principled Scientific Discovery through Bayesian Optimization: A Tutorial

@emollick: New report from us: Can you prompt inject your way to an “A”? As LLMs increasingly are used as judg...

@salonium reposted: 6/ We also find: Coding errors in ~25% of papers. Major coding errors in about 1...