Google Releases Gemini 3.5 Live Translate, a speech-to-speech streaming audio model that covers 70+ languages across the Meet, Translate, and Live APIs.

0 0 3 minutes read

Google recently announced Gemini 3.5 Live Translate. It is their latest audio model for live speech-to-speech translation. Speech-to-speech means that spoken audio goes in, and translated spoken audio goes out. The model automatically detects more than 70 languages and produces translated speech. It preserves the speaker's pitch, tempo, and pitch in the output. Turn-by-turn systems wait for the speaker to finish before responding. Gemini 3.5 Live Translate generates speech continuously instead. It balances the trade-off between waiting for context and rendering quickly. More content improves quality. Fast output keeps the translation consistent with the speaker. The effect lasts for a few seconds behind the speaker during the session.

Gemini 3.5 Live Translate

Gemini 3.5 Live Translate is a single audio model (gemini-3.5-live-translate-preview), not a conversational assistant. It processes speech as the sound comes in, instead of following a full sentence. Handles multilingual input without manually configuring settings. Its sound robustness allows applications to operate in noisy, unpredictable environments.

The model is available in three locations. Developers get it in public preview with Gemini Live API and Google AI Studio. Businesses are getting a private preview of Google Meet starting this month. Everyone gets the Google Translate app on Android and iOS.

How Streaming Works

Design differences are important for building real-time features. The live chat agent uses turn-based interactions. It relies on relaxation, goal acquisition, and distraction management. Live Translation uses continuous stream processing instead. It translates as the speaker speaks, without waiting for a turn to end.

To handle real-time latency limitations, the translation method accepts only audio input. Text input is not supported in translation mode. The model also reduces the use of tools and system instructions in this mode. That keeps it a focused translator instead of a general agent.

Building with Live API

Developers configure the translation within the Live API session setup. You have set a translationConfig block within the generationConfig. I targetLanguageCode field takes a BCP-47 code, such as "pl" or "es". BCP-47 is a standard format for language tags en or pt-BR. It goes without saying "en". I echoTargetLanguage boolean controls for input that is already in the target language. When trueThe model echoes that statement. When falsekeep quiet. You can also enable it inputAudioTranscription again outputAudioTranscription in the transcription of the text.

Fixed audio formats. Input is 16-bit PCM raw at 16kHz, mono, little endian. The output is 16-bit PCM raw at 24kHz, mono, little endian. PCM is uncompressed raw noise. Sends audio in 100ms chunks. In client-side applications, ephemeral tokens in v1alpha endpoint avoid revealing your API key.

Size	Live agent	Live Translation
Role model	An assistant who listens, reasons, and acts	Translator pipeline / real-time translator
Working together	It is based on response, and managing disruptions	Continuous streaming processing, no turning
Tools	Call function, Google search, instructions	Translation only, no tools or instructions
Input	Text, audio, video, and image	Only noise, of strong delay
Configuration	Generation, speech, tools, instructions	`targetLanguageCode` again `echoTargetLanguage`

Use Case

The model targets live translation across several settings. Google lists multilingual calls, meetings, courses, and broadcasts. Developer platforms reduce the workload of real-time media integration. Agora, Fishjam, LiveKit, Pipecat, and Vision Agents already use the Live API. These platforms host sophisticated real-time media streaming infrastructure. That allows developers to focus on the user experience instead.

Google's example application demonstrates simultaneous copying and translation of multiple languages. Grab is testing a driver-passenger interaction model at pickup locations. Grab users make more than 10 million voice calls per month. CJ ENM, LiveKit, and others have reported positive feedback for quality, accuracy, and low latency.

How it changes Google Meet and Translate

According to Google's official release, Google Meet will soon use 3.5 Live Translate to translate speech. The table shows what was said before and after the Meet.

Power	Previous Meeting	With 3.5 Live Translate
Languages	5	70+
Combinations per meeting	Only to and fro in English	2000+ combinations
Access	Existing interface	Updated visual link for quick access

The Meet update is in private preview for select enterprise Workspace customers this month. A wider release follows later this year. In the Translate app, the Live translation feature works with any connected headphones. Shows speaker tone in all 70+ languages. Android also gains listening mode. He holds the phone to his ear like a normal phone. The translated sound then broadcasts to the ear, without others hearing it.

Key Takeaways

Gemini 3.5 Live Translate is Google's latest audio model for live speech-to-speech translation in 70+ languages.
It broadcasts continuously instead of looping, staying a few seconds behind the speaker.
Developers can configure it via Live API using targetLanguageCode again echoTargetLanguage; audio only, 16kHz in, 24kHz out.
It comes from the Gemini Live API, Google Meet (5→70+ languages), and the Translate app.
All generated sounds carry an invisible SynthID watermark for identification.

Check it out Model Card again Technical details. Also, feel free to follow us Twitter and don't forget to join our 150k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us

Source link

nimda 6 hours ago

0 0 3 minutes read