Microsoft AI Introduces MAI-Transcribe-1.5: 2.4% WER on unprocessed analysis, Best-in-Class FLEURS Accuracy, and Up to 5th Fastest Long-Range Audio Transcription

0 0 3 minutes read

Microsoft AI Introduces MAI-Transcribe-1.5: 2.4% WER on unprocessed analysis, Best-in-Class FLEURS Accuracy, and Up to 5th Fastest Long-Range Audio Transcription

Last week Microsoft AI announced MAI-Transcribe-1.5. It is the second iteration of the company's internal speech and text family. The model targets accuracy across 43 languages, accents, and sound environments. The Microsoft team is putting it to work in production writing.

What is MAI-Transcribe-1.5

MAI-Transcribe-1.5 is an automatic speech recognition (ASR) model. It takes audio as input and returns text. Microsoft built it in-house, not on a third-party basis. The model handles 43 languages in one system. Optimized for various pronunciations, dialects, and real-world acoustic environments.

Microsoft includes it in Copilot, Teams, GitHub, and Dynamics 365 Contact Center. It is also available in Foundry, Microsoft's modeling platform.

The Case for Accuracy

Accuracy here is measured by Word-Error-Rate (WER). A lower WER means fewer errors per word typed. Microsoft reports the best WER for all 43 languages in FLEURS. FLEURS is a common transcription measure for many languages.

On the Artificial Analysis leaderboard, the model posts a WER of 2.4%. That puts it third in the open competition benchmark. So the picture is divided. The Microsoft team claimed first place in FLEURS and third in Artificial Analysis.

Language expansion is another matter of accuracy. Coverage increased from 25 to 43 languages. 18 new languages are added without compromising accuracy. Ten of them are South Asian, including Bengali, Tamil, and Telugu. Eight are European, such as Ukrainian, Greek and Catalan.

Speed

MAI-Transcribe-1.5 leads the accuracy-times-speed on the Functional Analysis leaderboard. It works up to 5x faster than models with comparable accuracy. The effect is great for long audio files. The model can record an hour of audio in less than 15 seconds.

Microsoft cites speedups of up to 5x over Gemini 3.1, Scribe v2, and GPT-4o-Transcribe in long audio. Against the previous MAI-Transcribe-1, the Azure card lists up to 5.7x faster long form. For batch pipelines that process large archives, that latency gap closes quickly.

Keyword (Business) Bias: An Aspect Worth Understanding

General writers often fail at domain-specific vocabulary. This includes people, product names, medical terms, and internal acronyms. Those words tend to matter a lot to business users.

MAI-Transcribe-1.5 adds keyword bias, also called entity bias. You provide a list of domain-specific keywords. Azure Card supports up to 200 keywords. The model biases its predictions in that range. Obviously, it doesn't enforce a match. It uses the shared context to determine when the bias should be active. Microsoft reports a 30% WER reduction in FLEURS when biasing is used.

A short example shows the result. Without bias, the words translate to “Sean,” “Oif,” and “Societal.” With the given list of names, the model also finds “Shaun,” “Aoife,” and “Xochitl.” This includes meetings, healthcare, and call centers with niche vocabulary.

Use Cases

The Azure model card lists concrete production scenarios. Each maps to a general engineering work:

Video caption of media and content platforms.
Access tools that depends on accurate captions.
Conference transcription in Teams-style interaction tools.
Call analysis through contact centers and support figures.
Content creation workflow which require a quick draft transcript.
Ambassadors of the word which converts speech to text before consultation.

Automatic language detection helps when the input language is unknown. The model detects spoken language without manual setup.

MAI-Transcribe-1.5 vs MAI-Transcribe-1

The table below compares the two generations using only the facts mentioned.

Attribute	MAI-Transcribe-1	MAI-Transcribe-1.5
Combined languages	25	43
Keyword/business bias	Not listed	Up to 200 keywords
Long form indexing speed	The foundation	Up to 5.7x faster
WER Performance Analysis	Not specified	2.4% (ranked #3)
FLEURS position (by Microsoft)	state of the art	Best in class across 43 languages
Automatic language detection	Not specified	Yes
Life cycle	Early release	Generally Available (GA)
Input / Output	Audio / Text	Audio / Text