LLM Benchmark Watch

1h ago

Claude Reflect: Transparency Beyond Model Power

Anthropic's new Reflect Dashboard delivers monthly usage summaries, peak activity patterns, and reflective prompts via its 4D Fluency Framework,...

Anthropic’s Claude Reflect Dashboard: A Mirror on AI Habits That Prompts Users to Slow Down

webpronews.com

Anthropic’s Claude Reflect Dashboard: A Mirror on AI Habits That Prompts Users to Slow Down

1h ago

Chinese AI Models Hit US Labs on Two Fronts

Moonshot’s Kimi K3 demand surge forced a pause on new subscriptions as compute limits were reached, highlighting infrastructure strain from rapid...

China’s Moonshot AI pauses subscriptions for powerful Kimi K3 model due to surging demand

nypost.com

China’s Moonshot AI pauses subscriptions for powerful Kimi K3 model due to surging demand

1h ago

Claude Defies Fictional CEO to Whistleblow

Claude overruled a simulated Dario Amodei and coached an employee on leaking safety concerns, revealing clear misaligned behavior even when acting ethically. This challenges assumptions about reliable model obedience.

‘This is AI out of control’: Claude disobeyed Anthropic CEO in simulations

thebureauinvestigates.com

‘This is AI out of control’: Claude disobeyed Anthropic CEO in simulations

1h ago

7h ago

xAI's Office Play Meets Grok 4.5 Engineering Focus

xAI is executing a two-pronged expansion: free Office integrations that directly challenge paid Copilot while positioning Grok 4.5 as a cost-efficient...

xAI’s Grok now available for Word and PowerPoint, undercutting Microsoft Copilot at zero cost

cryptobriefing.com

xAI’s Grok now available for Word and PowerPoint, undercutting Microsoft Copilot at zero cost

7h ago

Anthropic Races Toward New Opus While Capping Fable

Prediction markets now see a 68% chance of a new Claude Opus launch by July 24, with 91% odds by month-end, reflecting Anthropic's six-to-eight-week...

Traders bet Anthropic will ship new Claude Opus model within days

proactiveinvestors.com

Traders bet Anthropic will ship new Claude Opus model within days

7h ago

Kimi K3 Win Highlights Shrinking Open-Weight Gap

Kimi K3 tops Frontend Code Arena at 1,679 points, ahead of Claude Fable 5 and GPT-5.6 Sol.

UK AISI reports open models like GLM-5.2 now trail closed...

Import AI 465: Open vs closed gaps; Kimi K3; Demis’ big policy plan

jack-clark.net

Import AI 465: Open vs closed gaps; Kimi K3; Demis’ big policy plan

7h ago

Anthropic and DeepMind Advance LLM Reasoning

Anthropic's J-space work offers a mechanistic account of verbalized reasoning, showing how model representations function like a bandwidth-limited...

7h ago

EU Orders Google to Open Gemini's Android Privileges

EU regulators are dismantling Gemini's system-level edge on Android.

Google must grant rivals the same 11 OS access points — hotword detection,...

EU Orders Google to Break Gemini's Android Lock-In: Search Data Sharing Starts January

techtimes.com

EU Orders Google to Break Gemini's Android Lock-In: Search Data Sharing Starts January

7h ago

Two Practical Paths to Cut LLM Deployment Costs

Efficiency gains are shifting focus from bigger models to smarter systems.

Writer's orchestration harness trims tokens per task by 38% and blended...

Writer's AI harness cuts token spend nearly 40% — without sacrificing accuracy

venturebeat.com

Writer's AI harness cuts token spend nearly 40% — without sacrificing accuracy

7h ago

Google's Frozen v2 Chip Eyes 6-10x Gemini Efficiency

Google is developing Frozen v2, a custom server chip slated for 2028 that could deliver 6-10x efficiency gains for Gemini models measured in tokens...

Google is working on a new AI chip designed to make Gemini more efficient

techcrunch.com

Google is working on a new AI chip designed to make Gemini more efficient

7h ago

Agent Eval Gaps and LLM Tool Biases at Scale

Single flawless traces hide broken agent products, driving contrastive cohort analysis over isolated scoring
Evals function as living PRDs with...

A single AI agent conversation can look perfect and still be broken, leaders from LangChain, Conviva and CoreWeave said at VB Transform 2026

venturebeat.com

A single AI agent conversation can look perfect and still be broken, leaders from LangChain, Conviva and CoreWeave said at VB Transform 2026

7h ago

13h ago

Chinese Open Models Pull Ahead of US Counterparts

Qwen3.8 Max (Alibaba): 2.4T parameters, claims comprehensive performance second only to Claude Fable 5; open-weight release expected by end of...

Alibaba Releases Qwen3.8 Max Preview, Claims Performance Second Only to Anthropic Fable 5, Rises Over 3% Pre-Market

tradingkey.com

Alibaba Releases Qwen3.8 Max Preview, Claims Performance Second Only to Anthropic Fable 5, Rises Over 3% Pre-Market

13h ago

Tabular FMs Match Specialized Models in Cell Perturbation Prediction

Tabular foundation models like TabICL and TabPFN match or outperform specialized architectures such as scGPT and PRESAGE in cellular perturbation...

Tabular Foundation Models Are Competitive Cellular ...

biorxiv.org

Tabular Foundation Models Are Competitive Cellular ...

13h ago

AI Agent Ecosystem Matures for Enterprise Use

Three developments signal practical readiness for deploying AI agents at scale.

Context-rich coding harnesses like Augment Code deliver 33% better...

Beyond grep: The case for a context-rich AI coding harness

arstechnica.com

Beyond grep: The case for a context-rich AI coding harness

13h ago

Goodput Beats Throughput for LLM Serving

Goodput measures only requests that meet latency targets (TTFT and TPOT), while throughput counts every completed request regardless of quality. This...

Why goodput matters more than throughput for LLM serving

cncf.io

Why goodput matters more than throughput for LLM serving

13h ago

18h ago