Gemini 2.5 Native audio update, and Text-to-Speech model update

What customers say
Google Cloud customers are already using Gemini's audio capabilities to drive real business results, from asset processing to customer calls.
- “Users often forget that they are talking to an AI within a minute of using the sidekick, and in some cases they thank the bot after a long conversation … The new API Ai capabilities are offered with Gemini [2.5 Flash Native Audio] Empower our sellers to win. “ – David Wurtz, VP of product, Shopkhify
- “By combining the Gemini 2.5 Flash Anive Audio Model … We have significantly improved MIA's capacity since its launch in May 2025. This powerful combination has helped us generate more than 14,000 loans for our 14,000 partners.– Jason Bressler, Chief Technology Officer, United Wholesale Loans (UWM)
- “Working with the Gemini 2.5 Flash audio audio model with Vertex AI allows Neoto.ai Ai receivers to achieve irreversible intelligence… – David Yang, Co-Founder, Newo.ai
Live Speech Interpretation
Gemini now supports new speech-to-speech translation capabilities designed to handle both continuous listening and two-way conversations.
With continuous listening, Gemini automatically translates multilingual speech into a single target language. This allows you to put on headphones and hear the world around you in your own language.
In a two-way conversation, Gemini's live speech-to-speech translation handles translation between two languages in real-time, automatically switching the output language based on who is speaking. For example, if you speak English and want to chat with a Hindi speaker, you'll hear the English translation in real time in your head, while your phone broadcasts Hindi when you're done talking.
Gemini's live speech translation has many key skills that come in handy in the real world:
- Language acquisition: Translate speech in more than 70 languages and 2000 language pairs by combining Gemini Model's world knowledge and various skills with its native audio capabilities
- Transfer Style: It captures the nuances of human speech, observing the speaker's nature, pacing and pitch so that the translation sounds natural.
- Multilingual installation: Understanding multiple languages at the same time in one context, helps you follow multiple conversations without needing to fiddle with language settings.
- Automatic detection: Select the language spoken and it starts translating, so you don't even need to know what language is spoken to start translating.
- Sound intensity: Filtering the existing sound so that you can have a good conversation even in high, outdoor places.


