OpenAI o1 Tops ER Triage Benchmarks

Key Questions

What performance did OpenAI's o1 achieve on medical benchmarks?

OpenAI's o1 scored 89% on NEJM diagnostics and 67% on Harvard ER triage benchmarks. This outperforms doctors, who scored 50-55% on ER triage. BRIDGE analysis on over 1M EHRs confirms o1's edge, though Chain-of-Thought (CoT) reasoning sometimes hurts performance.

How does o1 compare to human doctors in ER triage?

o1 correctly diagnosed 67% of ER patients, compared to 50-55% by triage doctors in a Harvard trial. This demonstrates o1's superior accuracy in emergency triage scenarios. Related studies highlight this in viral discussions and Hacker News.

What is PhysicianBench?

PhysicianBench evaluates LLM agents in real-world Electronic Health Record (EHR) environments. It extends realistic assessments for medical AI applications. The benchmark is discussed in research papers and forums.

What challenges were noted in BRIDGE analysis?

BRIDGE, using over 1M EHRs, confirms o1's advantage but shows Chain-of-Thought (CoT) can degrade performance. This highlights areas for improvement in clinical reasoning. It supports accelerated med AI deployments.

What are the next steps for medical AI like o1?

Ethics, regulations, and clinical trials are the next priorities. These steps are essential before widespread deployment. The highlight notes acceleration in med AI due to o1's benchmarks.

o1 hits 89% NEJM diagnostics, 67% Harvard ER triage vs doctors 50-55%; BRIDGE 1M+ EHR confirms edge but CoT hurts; PhysicianBench extends EHR agents; viral study critique exposes text-only/multimodal gaps vs FDA imaging; MLHC 2026 signals med papers. Ethics/regs/trials accelerating med deployments.

Sources (5)