AI for Scientific Discovery, Robotics, Healthcare

Key Questions

What cautionary findings exist for clinical LLMs?

A Lancet Digital Health comment highlights deception risks, and the BRIDGE benchmark shows exam performance dropping from 92% to 44.8% on real clinical language tasks.

How do traditional ML models compare to tabular foundation models in healthcare?

Established methods like XGBoost match or exceed tabular foundation models in clinical predictions, with TabPFN winning only 16.7% of tasks while being 5.5x slower.

What improvement does MRPO offer for medical multimodal reasoning?

MRPO reduces cascading errors from 64% to 13% in early-stage failure cases through step-aware reinforcement learning.

How does DiffusionGemma perform on radiology tasks?

It achieves parity with autoregressive models on radiology report drafting while providing 3.5-4.4x speedup using discrete diffusion.

Which model outperforms AlphaFold3 in protein structure prediction?

ESMFold2, an open-source model, surpasses AlphaFold3 on relevant benchmarks for scientific discovery applications.

What robotics advancements are reported?

Xpeng VLA 2.0, RotVLA (98.2% on LIBERO), OASIS for zero-shot humanoid loco-manipulation, and ABot-M0.5 for unified mobility-manipulation are highlighted.

How effective is Mayo Clinic's REDMOD for cancer detection?

It detects pancreatic cancer up to 3 years early using multimodal AI on clinical data.

What does GeneBench-Pro reveal about AI agents?

OpenAI's GeneBench-Pro exposes a 'noticing-to-acting gap' where models struggle to translate observations into correct actions on research-grade genomics tasks.

Major cautionary signals: Lancet Digital Health comment on deception risks in clinical LLMs; BRIDGE benchmark shows 92% exam performance drops to 44.8% on real clinical language; new study finds established ML (XGBoost) matches tabular foundation models in clinical predictions — TabPFN wins only 16.7% of tasks, 5.5x slower. On the positive side, MRPO reduces cascading errors in medical multimodal reasoning from 64% to 13% early-stage failure. DiffusionGemma achieves parity with autoregressive models on radiology report drafting with 3.5-4.4x speedup. Also: ESMFold2 open-source outperforms AlphaFold3; Curia radiology foundation model SOTA; Xpeng VLA 2.0; Flash-WAM 23x speedup; RotVLA 98.2% on LIBERO; OASIS zero-shot humanoid loco-manipulation; Mayo Clinic REDMOD detects pancreatic cancer 3 years early; LifeSciBench; hallucination in world models paper; ABot-M0.5 unified mobility-and-manipulation world action model; DiscoPER autonomous scientific discovery; BioInsight multi-agent biomedical discovery; ViTAdapter-CWMSDA SOTA on 7 medical segmentation datasets; OpenAI GeneBench-Pro exposes 'noticing-to-acting gap'.

Sources (15)