LLM Evaluation: Proxies Meet Domain Benchmarks
- Proxy metrics from expert token statistics predict downstream reasoning with 0.81 Spearman correlation, beating cross-entropy baselines.
- These...

Created by Jonathan Jones
Frontier LLM research, product launches, and commercial AI innovations
Explore the latest content tracked by LLM Innovation Tracker
OpenAI Codex background agents now run autonomously to handle infrastructure tasks like IaC maintenance, refactoring, and CI/CD workflows without...
US banks lead in AI maturity, with 59% of deployed initiatives now generating measurable value, yet at least 30% of GenAI projects are expected to...
AI workflows break traditional testing because outputs vary with context, temperature, and stochastic elements.
Four-level framework addresses...
Anthropic’s Claude Mythos Preview uncovered over 10,000 high- or critical-severity vulnerability candidates across 1,000 open-source projects in one...
Luma's AI agents cut Hollywood TV production from six weeks to one, with two major studios already deploying them for consistent multi-shot workflows....
The piece brands Anthropic's profitability narrative a "swindle", drawing 54 Hacker News comments in a pointed critique of its financial story.
Three fresh approaches signal a shift from standard transformers toward more efficient LLM designs:
Two papers push back on standard LLM data practices.
Symphonia, an open-source platform, uses LLMs to automate iterative expert consensus, delivering faster and more scalable evidence synthesis for...
Medical foundation models are moving beyond benchmarks into real deployments spanning genomics and clinical imaging.
RL struggles with out-of-distribution enterprise tasks, prompting alternatives that deliver richer training signals.
Antigravity 2.0 leads the OpenSCAD Architectural 3D LLM Benchmark, highlighting progress in LLM-driven 3D generation for architecture. The result drew notable attention, scoring 339 points on Hacker News.
A wide range of LLMs were evaluated on behavioral experiments with over 200,000 participants and nearly 26 million human responses. The study directly compares base models versus post-trained versions to uncover key differences.
MLLMs are rapidly moving from research concepts to deployed systems handling text, vision, and sensors.
AI features succeed when placed in background async pipelines rather than blocking user requests.