Microsoft presents its own new transcription and speech generation models

ByEditor

Apr 3, 2026

Microsoft presents its own new transcription and speech generation models

Microsoft has presented its first artificial intelligence models for transcription and voice generation, which already work in the company’s own services such as Copilot and Azure Speech, and are part of a strategy aimed at launching the most advanced frontier models in 2027.

The technology company has launched its three most recent models under public early access: the MAI-Image-2 image generator, the MAI-Voice-1 voice generator, and the new MAI-Transcribe-1 transcription generator.

While MAI-Image-2 was presented in mid-March as a model capable of generating professional photorealistic results from text, MAI-Transcribe-1 and MAI-Voice-1 are the first generation of two new models with which Microsoft aims to create “a comprehensive proprietary audio AI platform, designed specifically for developers.”

In this context, MAI-Transcribe-1 is a highly accurate speech recognition model, which has support for 25 languages. Microsoft has highlighted its efficiency on its blog, as it has a GPU cost approximately 50 percent lower than the main alternatives.

It is designed to provide real-time transcription and captioning of live events, virtual assistants, call center workflows, meetings, and learning modules, among other use cases.

When it comes to MAI-Voice-1, Microsoft claims that it is “ultra-fast”, as it can generate up to 60 seconds of audio in less than a second using a single GPU. It currently powers expressive voice experiences in Copilot’s audio and podcast features.

All three models are already in use in Microsoft Copilot, Bing, PowerPoint, and Azure Speech services, and can be found in Playground and Foundry.

These models are part of Microsoft’s self-development strategy, with which it aims to create cutting-edge models next year to compete with companies such as OpenAI and Anthropic.

As explained by the executive director of Microsoft AI, Mustafa Suleyman, in an interview with Bloomberg, they intend to reach “the absolute frontier”, and in 2027 they have set the goal of “really reaching the latest technology” in models that can respond or generate text, images and audio.