Google Health AI Releases MedASR: Conformer-Based Medical Speech to Text Model for Clinical Dictation

The Google Health AI team released MedASR, an open-source medical speech to text model that guides clinical reporting and doctor-patient conversations and is designed to connect directly to modern AI workflows.
What is MedASR and where does it fit?
MedASR is a speech-to-text model based on the Conformer architecture and pre-trained for medical conferencing and transcription. It is positioned as a starting point for developers looking to build healthcare-based voice applications such as radiology calling tools or note-taking systems.
The model has 105 million parameters and accepts mono channel audio at 16000 hertz and 16 bit integer waveforms. It only produces text output, so it falls directly into natural language processing or generative models like MedGemma.
MedASR sits within the Health Engineer Foundations portfolio, alongside MedGemma, MedSigLIP and other medical domain models that share common terms of use and a consistent governance context.
Training data and domain expertise
MedASR trains a diverse corpus of identified medical discourse. The dataset includes approximately 5000 hours of physician and clinical interviews across radiology, internal medicine and family medicine.
Training matches audio segments with transcription and metadata. Subsections of the interview data were defined with medically named entities including symptoms, medications, and conditions. This gives the model a robust integration of clinical terms and sentence patterns from common texts.
The model is English only, and most of the training audio comes from native English speakers who were raised in the United States. The documentation notes that performance may be lower for some speaker profiles or noisy microphones and recommends fine-tuning such settings.
Architecture and decoding
MedASR follows the Conformer encoder design. Conformer combines convolution blocks and self-attention layers to capture spatial acoustic patterns and long-range temporal dependence in the same stack.
The model is presented as an automatic speech detector with CTC visual style. In reference use, developers use AutoProcessor creating input features from waveform audio and AutoModelForCTC generating a sequence of tokens. Decoding uses greedy coding automatically. The model can also be paired with an external six-gram language model with a size 8 beam search to improve the word error rate.
MedASR training uses JAX and ML Pathways on TPUv4p, TPUv5p and TPUv5e hardware. These systems provide the necessary scale for large speech models and align with Google's broad base training stack.
Working in medical speech activities
The main results, with greedy decoding and the six-gram language model, are:
- RAD DICT, radiologist reporting: MedASR greedy 6.6 percent, MedASR and language model 4.6 percent, Gemini 2.5 Pro 10.0 percent, Gemini 2.5 Flash 24.4 percent, Whisper v3 Large 25.3 percent.
- GENERAL DICT, general and internal medicine: MedASR greedy 9.3 percent, MedASR and language model 6.9 percent, Gemini 2.5 Pro 16.4 percent, Gemini 2.5 Flash 27.1 percent, Whisper v3 Large 33.1 percent.
- FM DICT, family medicine: MedASR greedy 8.1 percent, MedASR and language model 5.8 percent, Gemini 2.5 Pro 14.6 percent, Gemini 2.5 Flash 19.9 percent, Whisper v3 Large 32.5 percent.
- Eye Gaze, dictation in 998 MIMIC chest X ray case: MedASR greedy 6.6 percent, MedASR and language model 5.2 percent, Gemini 2.5 Pro 5.9 percent, Gemini 2.5 Flash 9.3 percent, Whisper v3 Big 12.5 percent.
Developer workflow and usage options
A small example of a pipeline is:
from transformers import pipeline
import huggingface_hub
audio = huggingface_hub.hf_hub_download("google/medasr", "test_audio.wav")
pipe = pipeline("automatic-speech-recognition", model="google/medasr")
result = pipe(audio, chunk_length_s=20, stride_length_s=2)
print(result)
For more control, developers upload AutoProcessor again AutoModelForCTCrescale the sound to 16000 hertz with librosapass the tensors to CUDA if they exist and make the call model.generate followed by processor.batch_decode.
Key Takeaways
- MedASR is a lightweight, portable medical ASR model: It has 105M parameters, is specially trained for medical calling and transcription, and is released under the Health AI Developer Foundations program as an English-only model for health developers.
- Domain specific training about 5000 hours of identified medical audio: MedASR is pre-trained on doctor calls and clinical interviews in all specialties such as radiology, internal medicine and family medicine, giving you a stronger coverage of clinical terminology compared to general purpose ASR programs.
- Vocabulary error rates rival or better medical vocation benchmarks: In internal radiology, general medicine, family medicine and Eye Gaze data sets, MedASR with greedy or language model matching or surpasses the big common models like Gemini 2.5 Pro, Gemini 2.5 Flash and Whisper v3 The big error rate of medical English speech.
Check out Repo, Model on HF and technical details. Also, feel free to follow us Twitter and don't forget to join our 100k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.
Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the power of Artificial Intelligence for the benefit of society. His latest endeavor is the launch of Artificial Intelligence Media Platform, Marktechpost, which stands out for its extensive coverage of machine learning and deep learning stories that sound technically sound and easily understood by a wide audience. The platform boasts of more than 2 million monthly views, which shows its popularity among viewers.



