Microsoft AI Introduces MAI-Transcribe-1.5: 2.4% WER on unprocessed analysis, Best-in-Class FLEURS Accuracy, and Up to 5th Fastest Long-Range Audio Transcription

Last week Microsoft AI announced MAI-Transcribe-1.5. It is the second iteration of the company's internal speech and text family. The model targets accuracy across 43 languages, accents, and sound environments. The Microsoft team is putting it to work in production writing.
What is MAI-Transcribe-1.5
MAI-Transcribe-1.5 is an automatic speech recognition (ASR) model. It takes audio as input and returns text. Microsoft built it in-house, not on a third-party basis. The model handles 43 languages in one system. Optimized for various pronunciations, dialects, and real-world acoustic environments.
Microsoft includes it in Copilot, Teams, GitHub, and Dynamics 365 Contact Center. It is also available in Foundry, Microsoft's modeling platform.
The Case for Accuracy
Accuracy here is measured by Word-Error-Rate (WER). A lower WER means fewer errors per word typed. Microsoft reports the best WER for all 43 languages in FLEURS. FLEURS is a common transcription measure for many languages.
On the Artificial Analysis leaderboard, the model posts a WER of 2.4%. That puts it third in the open competition benchmark. So the picture is divided. The Microsoft team claimed first place in FLEURS and third in Artificial Analysis.
Language expansion is another matter of accuracy. Coverage increased from 25 to 43 languages. 18 new languages are added without compromising accuracy. Ten of them are South Asian, including Bengali, Tamil, and Telugu. Eight are European, such as Ukrainian, Greek and Catalan.
Speed
MAI-Transcribe-1.5 leads the accuracy-times-speed on the Functional Analysis leaderboard. It works up to 5x faster than models with comparable accuracy. The effect is great for long audio files. The model can record an hour of audio in less than 15 seconds.
Microsoft cites speedups of up to 5x over Gemini 3.1, Scribe v2, and GPT-4o-Transcribe in long audio. Against the previous MAI-Transcribe-1, the Azure card lists up to 5.7x faster long form. For batch pipelines that process large archives, that latency gap closes quickly.
Keyword (Business) Bias: An Aspect Worth Understanding
General writers often fail at domain-specific vocabulary. This includes people, product names, medical terms, and internal acronyms. Those words tend to matter a lot to business users.
MAI-Transcribe-1.5 adds keyword bias, also called entity bias. You provide a list of domain-specific keywords. Azure Card supports up to 200 keywords. The model biases its predictions in that range. Obviously, it doesn't enforce a match. It uses the shared context to determine when the bias should be active. Microsoft reports a 30% WER reduction in FLEURS when biasing is used.
A short example shows the result. Without bias, the words translate to “Sean,” “Oif,” and “Societal.” With the given list of names, the model also finds “Shaun,” “Aoife,” and “Xochitl.” This includes meetings, healthcare, and call centers with niche vocabulary.
Use Cases
The Azure model card lists concrete production scenarios. Each maps to a general engineering work:
- Video caption of media and content platforms.
- Access tools that depends on accurate captions.
- Conference transcription in Teams-style interaction tools.
- Call analysis through contact centers and support figures.
- Content creation workflow which require a quick draft transcript.
- Ambassadors of the word which converts speech to text before consultation.
Automatic language detection helps when the input language is unknown. The model detects spoken language without manual setup.
MAI-Transcribe-1.5 vs MAI-Transcribe-1
The table below compares the two generations using only the facts mentioned.
| Attribute | MAI-Transcribe-1 | MAI-Transcribe-1.5 |
|---|---|---|
| Combined languages | 25 | 43 |
| Keyword/business bias | Not listed | Up to 200 keywords |
| Long form indexing speed | The foundation | Up to 5.7x faster |
| WER Performance Analysis | Not specified | 2.4% (ranked #3) |
| FLEURS position (by Microsoft) | state of the art | Best in class across 43 languages |
| Automatic language detection | Not specified | Yes |
| Life cycle | Early release | Generally Available (GA) |
| Input / Output | Audio / Text | Audio / Text |
Powers and Limitations
Power:
- Installation of 43 languages from one model, from 25.
- Keyword/entity bias produces up to 30% WER reduction in FLEURS.
- A transcript of less than 15 seconds of an hour of audio.
- Available now with Azure AI Foundry.
- It's robust to loud, real-world noise, according to Microsoft.
Limitations:
- There is no dial yet, so speaker labels are not available.
- There is no native streaming API, so real-time usage is limited.
- Several claims of accuracy, speed, and cost are for first-timers.
- It is ranked third in Artificial Analysis, behind two competitors.
Sources



