LLM Benchmark Watch

LifeSciBench and domain-specific models — new benchmark for real scientific reasoning

LifeSciBench and domain-specific models — new benchmark for real scientific reasoning

Key Questions

What is LifeSciBench and how was it created?

LifeSciBench is a 750-task benchmark with expert-written rubrics covering 19,020 criteria for real life-science research tasks. It was released by OpenAI to evaluate practical scientific reasoning.

How do general models perform on LifeSciBench?

Top general models like GPT-5.5 pass only about 26% of tasks, highlighting the gap between broad capabilities and specialized scientific reasoning.

What specialized model was tested and what were the results?

GPT-Rosalind, a domain-specialized model, achieved 36% on the benchmark, outperforming general models and signaling OpenAI's interest in vertical AI for life sciences.

Why was LifeSciBench introduced?

It addresses limitations of narrow, fact-based biology benchmarks by focusing on realistic research workflows and introduces a new rubric-based evaluation paradigm.

What does LifeSciBench imply for future AI development?

Results underscore the need for domain-specific training and evaluation. It also marks OpenAI's strategic push into vertical models tailored for scientific applications.

OpenAI releases LifeSciBench, a 750-task benchmark with expert-written rubrics (19,020 criteria) grading AI on real life-science research. Top general models like GPT-5.5 pass only ~26% of tasks. GPT-Rosalind, a domain-specialized model, reaches 36%. This highlights the gap between general capability and specialized scientific reasoning, and introduces a new evaluation paradigm. Also signals OpenAI's push into vertical models for life sciences.

Sources (3)
Updated Jun 18, 2026
What is LifeSciBench and how was it created? - LLM Benchmark Watch | NBot | nbot.ai