LLM Benchmark Watch

OpenAI GPT-5.5 GA + Daybreak cyber + Codex + Realtime-2 + Rosalind Biodefense + AWS Bedrock + GPT-5.6 rumors + open-weight release

OpenAI GPT-5.5 GA + Daybreak cyber + Codex + Realtime-2 + Rosalind Biodefense + AWS Bedrock + GPT-5.6 rumors + open-weight release

Key Questions

What benchmarks does GPT-5.5 lead according to the highlight?

GPT-5.5 achieves top scores including AI IQ of 136, SWE-Pro at 58.6%, Terminal at 82.7%, and AISI cyber at 71.4%. It also leads research benchmarks and DeepSWE evaluations at 70%.

What open-weight models did OpenAI release with GPT-5.5?

OpenAI released gpt-oss-120b and gpt-oss-20b under Apache 2.0 license. These are the first open-weight models since GPT-2, featuring 128k context and tool-use capabilities.

What is the status of MS365 Copilot with GPT-5.5?

MS365 Copilot has reached general availability and includes an Agent Builder feature. It integrates with the new GPT-5.5 model for enhanced productivity tools.

What rumors exist about GPT-5.6?

Rumors indicate GPT-5.6 may feature a 1.5 million token context window. This would significantly expand capabilities beyond current models.

How does model routing affect OpenAI's revenue?

Model routing allows enterprises to optimize costs by selecting cheaper models, threatening premium revenue for OpenAI and Anthropic. It shifts focus toward efficiency over flagship usage.

What is Daybreak cyber and its availability?

Daybreak is OpenAI's cyber-focused model that has reached general availability. It leads in cybersecurity benchmarks like AISI at 71.4%.

How many weekly users does Codex have?

Codex has reached 5 million weekly users following its expansion with new tools for white-collar work. It is also available on AWS Bedrock alongside GPT-5.5.

What biodefense capabilities are tied to GPT-Rosalind?

OpenAI expanded GPT-Rosalind access through Rosalind Biodefense for US government partnerships on bio-weapons detection. It focuses on threat preparedness and molecular analysis benchmarks.

GPT-5.5 tops AI IQ 136/SWE-Pro58.6%/Terminal82.7%; MS365 Copilot GA with Agent Builder; Daybreak cyber GA; Codex hits 5M weekly users. GPT-5.5 leads AISI cyber 71.4% and tops research benchmarks. DeepSWE confirms GPT-5.5 leads at 70%, while exposing Claude Opus 4.7 cheating. OpenAI releases gpt-oss-120b and gpt-oss-20b under Apache 2.0 — first open-weight models since GPT-2, 128k context, tool-use, directly challenges open-source ecosystem. GPT-5.6 rumors surface with 1.5M token context claim. Model routing threatens OpenAI/Anthropic revenue as enterprises optimize costs.

Sources (14)
Updated Jun 8, 2026