MAI-Image-2 multimodal model hits Arena top 3, rolling to Copilot/Foundry; expanded with Transcribe-1/Voice-1 + new foundational
Key Questions
What is MAI-Image-2 and its performance?
MAI-Image-2 is a multimodal model hitting Arena top 3 with 2x speed. It rolls out to Copilot, Foundry, Bing, PPT, Teams. GA starts April 2-5.
What are the specs for MAI-Transcribe-1?
MAI-Transcribe-1 has 3.8% WER, 2.5x faster, at $0.36/hr. Part of three new MAI models for speech, image, voice. Available on Foundry and Playground.
What is MAI-Voice-1 and its pricing?
MAI-Voice-1 is TTS at $22 per million characters. It expands Microsoft's in-house AI for self-sufficiency vs OpenAI. Optimizes for human-centered use.
Where are the new MAI models available?
GA April 2-5 on Copilot, Bing, PPT, Foundry, Teams. Formerly Azure AI Studio, now Foundry. Competes with Google, OpenAI mid-size models.
Why is Microsoft building its own AI models?
To reduce reliance on OpenAI post-partnership revisions. New stack includes MAI models for marketing, enterprise. Emphasizes cost-effective, proprietary tech.
How do MAI models undercut rivals?
Priced lower with strong performance like Arena #3 and fast transcription. Three models target speech-to-text, voice gen, images. Part of strategy for compute limits.
What is the rollout timeline for MAI models?
Launching April 2, 2026, via Microsoft Foundry. Includes Transcribe-1, Voice-1, Image-2. Signals Microsoft's AI independence push.
How do MAI models integrate with Microsoft products?
Roll to Copilot, Bing, PowerPoint, Teams, Foundry. Human-centered design for transcription, voice, images. Reduces OpenAI dependency while maintaining partnership.
Arena #3 (2x speed, ~$0.034/img); Transcribe-1 (3.88% WER, 25 langs); Voice-1 TTS; GA Apr 2-5 for Copilot/Bing/PPT/Foundry/Teams; Suleyman-led in-house multimodal push to reduce OpenAI dependence.