LLM Benchmark Watch

Claude 4.7/4.8 + Mythos leak + legal/tools + Mythos cyber + $900B val

Claude 4.7/4.8 + Mythos leak + legal/tools + Mythos cyber + $900B val

Key Questions

What are Claude Opus 4.7's top benchmark scores?

Opus 4.7 leads with SWE-Pro at 64.3%, Vibe Code at 71%, GPQA at 91.3%, and SWE-Bench at 80.8%.

What was leaked about the Mythos model?

The Mythos leak revealed strong AISI cyber performance at 68.6% and leadership on HLE 2026 at 64.7%.

What business moves has Anthropic made recently?

Anthropic acquired Stainless to enhance API developer tools and is in talks for Microsoft Maia chips.

Who recently joined Anthropic and why is it significant?

Andrej Karpathy joined Anthropic, bringing expertise from Tesla AI and signaling strategic talent acquisition.

How is Claude expanding in enterprise settings?

Claude is seeing rapid adoption in corporate finance, with KPMG rolling it out to 276,000 staff and partnerships like Bristol Myers for drug discovery.

What real-world gaps are noted for Claude models?

Real-world performance gaps versus smaller models have been observed despite high benchmark scores.

How does Mythos compare to OpenAI models on HLE 2026?

Mythos leads HLE 2026 at 64.7%, ahead of GPT-5.4 Pro at 58.7% and GPT-5.5 Pro at 57.2%.

What legal or acquisition activity supports Anthropic's growth?

The acquisition of Stainless strengthens Claude API tools, and enterprise spending shows Anthropic surpassing OpenAI in some business adoption metrics.

Opus4.7 SWE-Pro64.3%#1/Vibe Code71%, GPQA 91.3%, SWE-Bench 80.8%; Mythos leak (AISI 68.6%); Anthropic acquires Stainless; Karpathy joins; talks for Microsoft Maia chips. HLE 2026: Mythos leads at 64.7%. Real-world gaps vs smaller models noted.

Sources (30)
Updated May 23, 2026