Introducing MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2: Microsoft Foundry’s Next Leap in AI

Microsoft continues to push the boundaries of AI innovation with the public preview of three new models in Microsoft Foundry: MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2. These models are designed to empower developers with advanced multimedia AI capabilities, now available for integration into real-world applications via Azure Speech and Foundry.

What’s New?

MAI-Transcribe-1

A first-generation speech recognition model, MAI-Transcribe-1 delivers enterprise-grade accuracy across 25 languages at roughly half the GPU cost of leading alternatives. It’s engineered for reliability across accents, languages, and challenging audio conditions, making it ideal for:

Real-time transcription for IVR systems, virtual assistants, and call centers
Live captioning for events and meetings
Automated media subtitling and archiving
Education and e-learning platforms
Customer and market insights through structured data extraction

MAI-Voice-1

This high-fidelity speech generation model can produce 60 seconds of expressive audio in under one second on a single GPU. It powers features like Copilot’s Audio Expressions and podcast tools, and is available for developers to create custom voices (with responsible AI safeguards). Use cases include:

Conversational AI and agent assist
Accessibility and live captioning
Media production and content creation

MAI-Image-2

MAI-Image-2 is Microsoft’s most advanced text-to-image model, debuting in the top-3 on the Arena.ai leaderboard. It excels at generating photorealistic images, rendering text in graphics, and handling complex layouts. It’s already powering Bing Image Creator, Copilot, and PowerPoint, and is now available for developers to:

Ideate and visualize creative concepts
Generate custom visuals for enterprise communications
Prototype UX and product concepts

Why It Matters

These models are not just technological milestones—they’re already powering Microsoft’s own products and are now accessible to developers everywhere. With lower costs, high efficiency, and enterprise-grade reliability, they enable scalable, production-ready AI solutions.

Getting Started

Try the models: Experiment with MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 in the MAI Playground
Build in Foundry: Deploy these models via Azure Speech and Foundry APIs
Learn more: Official announcement

Microsoft’s ongoing collaboration with NVIDIA and advances in Azure AI infrastructure ensure that these models are supported by world-class hardware and cloud engineering, ready for the most demanding enterprise workloads.

Sources: