Supertone Releases Supertonic v3: An On-Device Text-to-Speech Model With 31 Language Support, Fewer Reading Failures, and Highlighter

0 0 7 minutes read

Supertone Releases Supertonic v3: An On-Device Text-to-Speech Model With 31 Language Support, Fewer Reading Failures, and Highlighter

Supertone has released Supertonic 3, the third generation of its on-device, ONNX-based text-to-speech system. Supertonic 3 ships with support for 31 languages, improved reading accuracy, fewer repetitive failures and skips, and ONNX public goods compatible with v2. Lightning Fast, On-Device, Multilingual and Accurate TTS.

Changed from v2 to v3

Compared to Supertonic 2, Supertonic 3 reduces repetitive failures and skips, improves speaker uniformity in a shared language set, and expands language usage from 5 to 31 languages. Version 2 supports English, Korean, Spanish, Portuguese, and French. Version 3 adds Japanese, Arabic, Bulgarian, Czech, Danish, German, Greek, Estonian, Finnish, Croatian, Hungarian, Indonesian, Italian, Lithuanian, Latvian, Dutch, Polish, Romanian, Russian, Slovak, Slovenian, Swedish, Turkish, Ukrainian, and Vietnamese — 31 ISO language codes. There is also a special one na a backup of a document whose language is unknown or without a supported set.

The model is slowly growing to accommodate more languages. At about 99M parameters across ONNX public assets, Supertonic 3 is significantly smaller than open TTS systems by 0.7B to 2B. The small size of the model is a practical advantage for download size, startup time, and definition on the device. The update also brings a total of ONNX public goods printing disc 404 MB. In addition, Supertone has recently introduced a system Voice Builderwhich allows developers to create custom, native TTS models from their voice recordings.

One new capability in v3 that wasn't in v2 is clear tag support. Supertonic 3 supports simple speech tags like this , again . This allows you to embed prosodic symbols directly into the input text without a separate pre-processing step or a separate expression model. For developers building voice interfaces or accessibility tools, this means you can specify pauses for breathing or laughter in line in your text input.

Architecture and Runtime

Basic architectures carry over from previous versions: a speech autoencoder that encodes waveforms into continuous latent representations, a flow-to-latency module that maps text to audio features, and a time predictor that controls natural timing. Flow matching is a generative modeling technique that learns a vector field to transform a simple distribution into a target distribution – it samples faster than distribution models with lower computational steps, which is why Supertonic can produce usable output in just 2 decision steps. To further refine the output, v3 includes Length-Aware Rotary Position Embedding (LARoPE) advanced speech-to-text alignment and uses a Self-Cleaning Flow Simulation method during training to remain robust against noisy data labels.

In terms of performance, Supertonic 3 runs faster on the CPU, even when compared to larger benchmarks measured on the A100 GPU, and uses very little memory. It doesn't require a GPU, which makes local, browser, and limited deployments much easier.

Learning Accuracy

For all languages measured, Supertonic 3 stays within the WER/CER range that competes with larger TTS models such as VoxCPM2, while maintaining a lightweight approach to the device. WER (Word Error Rate) and CER (Character Error Rate) are standard metrics for TTS reading: you compile a passage, apply ASR over the output, and compare the transcription to the original text. CER is used for languages that do not have clear word boundaries; others use WER. The efficiency of the system is best reflected in the hardware at the extreme edge; reaches the average 0.3x RTF of Onyx Boox Go 6 (iE-ink e-reader) in airplane mode. In addition, the ecosystem has been expanded to include Flutter (with macOS support), .NET 9again Go awaywhile web usage is growing onnxruntime-web with pure client-side functionality.

General Documentation

A distinguishing feature carried forward from v2 is the adaptation of the built-in script. Supertonic handles more complex forms – financial expressions such as $5.2Mphone numbers with area codes and extensions such as (212) 555-0142 ext. 402time and date formats are similar 4:45 PM on Wed, Apr 3, 2024and technical units like 2.3h again 30kph – without any pre-processing pipeline or phonetic annotations. The financial expression “$5.2M” should be read as “five point two million dollars,” and “$450K” as “four hundred fifty thousand dollars.” All four competing programs failed this. The technical unit “2.3h” should be read as “two points three hours” and “30kph” as “thirty kilometers per hour.” All four contestants failed at this stage. Competing systems tested include ElevenLabs Flash v2.5, OpenAI TTS-1, Gemini 2.5 Flash TTS, and Microsoft.

Getting started

The installation of the Python SDK is pip install supertonic. On first launch, the SDK downloads model properties from Hugging Face by default. A small example:

from supertonic import TTS
tts = TTS(auto_download=True)
style = tts.get_voice_style(voice_name="M1")
text = "A gentle breeze moved through the open window while everyone listened to the story."
wav, duration = tts.synthesize(text, voice_style=style, lang="en")
tts.save_audio(wav, "output.wav")
print(f"Generated {duration:.2f}s of audio")

Marktechpost Visual Explainer

Overview

Supertonic 3: TTS on device,
Now in 31 languages

Supertonic 3 is a simple, open-source text-to-speech program by Supertone Inc. It runs entirely with ONNX Runtime on your device — no cloud, no API call, no data leaving your device. v3 expands from 5 to 31 languages, adds expressive tags, reduces read failures, and remains compatible with the v2 ONNX interface.

31
Languages

~99M
Parameters

404 MB
ONNX goods

MIT
Code license

What's New in v3

Four Key Improvements Over Supertonic 2

Version 3 is a focused upgrade – same contract thinking, better output.

🌐
31 languages – Extended since v2 release for 5 languages (en, ko, es, pt, fr). It now includes Japanese, Arabic, German, Hindi, Russian, Turkish, Vietnamese, and 20 other ISO codes, as well as special. na going back in unknown languages.
✅
Sustainable learning – Repetitive failures and omissions are few, especially in short and long words. This was a known limitation in v2 that v3 addresses directly.
🎭
Display tags – Support , again inline in the text, without different preprocessing or external modeling.
🔊
Loudspeaker compatibility – Improved consistency across the shared language set compared to Supertonic 2. Voices are more consistent across languages.

Installation

Start in Less Than a Minute

Install the Python SDK via pipe. On first launch, model assets are automatically downloaded from Hugging Face — no manual setup required.

pip install supertonic

Get Started Immediately

Basic Usage of Python

The SDK model automatically downloads assets on first launch. Specify the voice, encode your text, and save the WAV output.

from supertonic import TTS

# Auto-downloads ONNX assets on first run
tts = TTS(auto_download=True)

# Select a preset voice (M1—M5 male, F1—F5 female)
style = tts.get_voice_style(voice_name="M1")

text = "A gentle breeze moved through the open window."

# synthesize() returns (wav_array, duration_in_seconds)
wav, duration = tts.synthesize(text, voice_style=style, lang="en")

tts.save_audio(wav, "output.wav")
print(f"Generated {duration:.2f}s of audio")

text = "I can't believe it  that actually worked!"
wav, duration = tts.synthesize(text, voice_style=style, lang="en")

Languages

31 Supported Languages + `na` Fall back

All 31 languages share the same architectural model and the ONNX indexing pipeline. use the na text code whose language is unknown or without a supported set.

en English

to Korean

yes Japanese

ar Arabic

bg Bulgarian

cs Czech

Yes Danish

until German

el Greek

es Spanish

et Estonian

fi Finnish

Fr French

Hello Hindi

Mr Croatian

huh Hungarian

id Indonesian

it Italian

Lt Lithuanian

lv Latvian

nl Dutch

pl Polish

pt Portuguese

ro Romanian

ru Russian

sk Slovak

sl Slovenian

sv Swedish

tr Turkish

p Ukrainian

vi Vietnamese

General Documentation

Handles Complex Inputs Without Preprocessing

Supertonic 3 reads financial expressions, dates, phone numbers, and technical units correctly out of the box — no G2P module or phonetic annotations required. Below: Supertonic versus four major commercial/open source systems.

Section	Installation Example	Supertonic 3	ElevenLabs / OpenAI / Gemini / Microsoft
Financial Disclosures	$5.2M / $450K	✓	✗ All four failed
Time and date	4:45 PM, Wednesday April 3rd	✓	✗ All four failed
Phone number	(212) 555-0142 ext. 402	✓	✗ All four failed
Technical Division	2.3h at 30km/h	✓	✗ All four failed

Shipping and Accessories

Works Everywhere – 11 Platforms, No GPU Required

ONNX public goods run on CPU in fixed voice mode without GPU dependency. Browser support is through WebGPU and WASM onnxruntime-web. Audio output is 16-bit WAV; batch inference is supported.

🐍PythonONNX runtime

🟨Node.jsServer side JS

🌐BrowserWebGPU / WASM

☕JavaThe JVM

⚙️C++High-perf

🔷c#.NET

🔵Go awayGo to work time

🍎Swift / iOSThe natives

🦀RustSystems

💙FlutterCross-platform

📄Code: MITLicense

🤖Model: OpenRAIL-MLicense

Key Takeaways

Supertonic 3 extends language support from 5 (v2) to 31 languages, growing from 66M to ~99M parameters with a total ONNX legacy size of 404 MB
New in v3: clear tags (, , ), more stable reading of short and long speech, and relatively improved speaker similarity. v2
ONNX public interface compatible with v2 — existing integration improves without changing the inference code
Reading accuracy benchmarked against VoxCPM2; The v3 stays within the competitive WER/CER range while being very small
RTF/throughput numbers specific to v3 have not been published; the 167× faster than real-time calculation is a v2 benchmark and should not be considered the same as v3
Native output of WAV files are 16-bit ensuring reliable sound for engineering applications

Check it out GitHub Repo again Hugging Face Space. Also, feel free to follow us Twitter and don't forget to join our 150k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us

Source link

nimda 2 hours ago

0 0 7 minutes read