xAI Launches Standalone Grok Speech-to-Text and Text-to-Speech APIs, Targeting Enterprise Voice Developers

nimda April 19, 2026

0 18 3 minutes read

xAI Launches Standalone Grok Speech-to-Text and Text-to-Speech APIs, Targeting Enterprise Voice Developers

Elon Musk's AI company xAI has launched two independent audio APIs – Speech-to-Text (STT) API and Text-to-Speech (TTS) API – both built on the same infrastructure that powers Grok Voice for mobile apps, Tesla cars, and Starlink customer support. The release moves xAI directly into the competitive speech API market currently occupied by ElevenLabs, Deepgram, and AssemblyAI.

What is the Grok Speech-to-Text API?

Speech-to-Text is a technology that converts spoken sound into written text. For developers building meeting transcription tools, voice agents, call center analytics, or accessibility features, the STT API is a building block. Rather than developing this from scratch, developers call the endpoint, send the audio, and get a structured transcript in return.

The Grok STT API is now available, providing transcription in all 25 languages in both batch and broadcast modes. Batch mode is designed to process pre-recorded audio files, while stream allows real-time recording as the audio is captured. Pricing is kept straight: Speech-to-Text is $0.10 per hour for bulk and $0.20 per hour for streaming.

The API includes word-level timestamps, speaker dialing, and multi-channel support, as well as intelligent Cross-Text Normalization that handles numbers, dates, currencies, and more. It also agrees 12 audio formats — nine container formats (WAV, MP3, OGG, Opus, FLAC, AAC, MP4, M4A, MKV) and three raw formats (PCM, µ-law, A-law), with a maximum file size of 500 MB per request.

Speaker dial the process of separating sound from individual speakers — to answer the question 'who said what.' This is important for multi-speaker recordings such as meetings, interviews, or customer calls. Name-level timestamps provide precise start and end times for each word in transcription, enabling use cases such as subtitle production, searchable recordings, and legal documents. Reverse Text Conversion converts spoken forms such as 'one hundred sixty-seven thousand dollars nine hundred eighty-three and fifteen cents' into a readable structured output: “$167,983.15.”.

Benchmark Performance

The xAI research team makes strong claims about accuracy. For call organization recognition – names, account numbers, dates – Grok STT claims an error rate of 5.0% compared to ElevenLabs at 12.0%, Deepgram at 13.5%, and AssemblyAI at 21.3%. That's a huge margin if it's involved in production. For video and podcast transcription, Grok and ElevenLabs tied with a 2.4% error rate, with Deepgram and AssemblyAI following at 3.0% and 3.2% respectively. The xAI team also reports a 6.9% word error rate in standard audio benchmarks.

What is the Grok Text-to-Speech API?

Text-to-speech converts text into spoken audio. Developers use TTS APIs to power voice assistants, read-aloud features, podcast production, IVR (interactive voice response) systems, and accessibility tools.

The Grok TTS API delivers fast, natural speech synthesis with detailed control over speech tags, and is priced at $4.20 for 1 million characters. The API accepts up to 15,000 characters per REST request; for long content, a WebSocket streaming endpoint is available that has no text length limit and starts returning audio before the full input is processed. The API supports 20 languages and five different voices: Ara, Eve, Leo, Rex, and Sal — with Eve set as the default.

In addition to voice selection, developers can inject inline and wraparound speech tags to control delivery. This includes inline tags like [laugh], [sigh]again [breath]and wrap tags like text again textallowing developers to create immersive, life-like delivery without complex markup. This articulation addresses one of the main limitations of traditional TTS programs, which tend to produce what is technically correct but emotionally special.

Key Takeaways

xAI introduced two independent audio APIs – Grok Speech-to-Text (STT) and Text-to-Speech (TTS) – built on the same production stack that already serves millions of users across all Grok mobile apps, Tesla vehicles, and Starlink customer support.
The Grok STT API provides real-time and batch logging in all 25 languages with speaker dialing, word-level timestamps, Reverse Text Conversion, and support for 12 audio formats — priced at $0.10/hour for bulk and $0.20/hour for streaming.
On the phone are business recognition benchmarksGrok STT reports an error rate of 5.0%, significantly outperforming ElevenLabs (12.0%), Deepgram (13.5%), and AssemblyAI (21.3%), with particularly strong performance in medical, legal, and financial use cases.
The Grok TTS API supports five voices (Ara, Eve, Leo, Rex, Sal) in 20 languages, with internal speech tags and wrappers like [laugh], [sigh]again giving developers effective control over voice delivery – priced at $4.20 per 1 million characters.

Check it out Technical details here. Also, feel free to follow us Twitter and don't forget to join our 130k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us

Michal Sutter is a data science expert with a Master of Science in Data Science from the University of Padova. With a strong foundation in statistical analysis, machine learning, and data engineering, Michal excels at turning complex data sets into actionable insights.

Source link

nimda April 19, 2026

0 18 3 minutes read