OpenAI Releases Three Realtime Audio Models: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper in the Realtime API

0 0 4 minutes read

OpenAI Releases Three Realtime Audio Models: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper in the Realtime API

OpenAI has released three new audio models for its Realtime API, each targeting a different capability for live voice applications: GPT-Realtime-2 for intelligent voice agents, GPT-Realtime-Translate for live speech translation, and GPT-Realtime-Whisper for transcribed streams. Alongside the release of the model, the Realtime API is officially out of beta and now generally available – a logical signal for developers who are currently building systems on it. All three models are readily available through the OpenAI API and can be tested on the Playground.

Together, they are pushing voice applications beyond the basic question-and-answer loop — to systems that can listen, think, translate, transcribe, and act within a single conversation.

GPT-Realtime-2: Voice Consultation with 128K Content Window

The flagship release is GPT-Realtime-2, which the OpenAI team describes as the first voice model with GPT-5-class reasoning. GPT-Realtime-2 can process complex requests, manage interruptions, and continue conversations naturally. OpenAI has expanded the model's context window from 32K to 128K tokens, allowing longer conversations and complex operations without losing context.

Previous voice models often stalled in multi-step applications or dropped previous context during long sessions. GPT-Realtime-2 is specially designed to keep the conversation moving while responding to the request.

Developers can enable short introductory phrases — such as “let me check that” or “just a second” — so users know the agent is working on the application. The model can also call many tools at the same time and explain what it is doing while it is doing it – so instead of dead air during a multi-step operation, the user gets an active comment. These features directly address one of the most common failure modes in deployed voice agents: awkward silences that make the system feel broken.

A control that is especially useful for manufacturers of adjustable thinking effort. Developers can dial the intensity of thinking across five levels: small, low, medium, high, and xhigh. The default is “low” to keep latency low for light applications, while demanding tasks may incur more computation. This means that teams can tune performance and latency trade-offs at the session level depending on the usage scenario – a quick customer lookup doesn't require as much depth of thought as a multi-step booking workflow.

GPT-Realtime-2 also adds a tone control. A model can adjust its communication style depending on the situation – staying calm during problem solving, switching to empathy when users are frustrated, and changing after a successful outcome. The model is also better at understanding industry-specific terms, including healthcare vocabulary and proper nouns.

For benchmarks, the benefits are measurable. High-resolution GPT-Realtime-2 scored 96.6% on Big Bench Audio, compared to 81.4% for GPT-Realtime-1.5 — an improvement of 15.2 percentage points. GPT-Realtime-2 with xhigh logic scored 48.5% on the next Audio MultiChallenge command, compared to 34.7% for GPT-Realtime-1.5.

Big Bench Audio explores the power of challenging reasoning in language models that support audio input. The Audio MultiChallenge tests dynamic conversational intelligence in spoken conversational systems, including follow-up instructions, contextualization, stability, and handling of natural speech corrections.

Price: GPT-Realtime-2 is priced at $32 per 1M audio input token ($0.40 for cached input tokens) and $64 per 1M audio output token.

GPT-Realtime-Translate: Live Speech Translation in 70+ Languages

GPT-Realtime-Translate is a new live translation model that translates speech from 70+ input languages to 13 output languages while being compatible with the speaker. Unlike GPT-Realtime-2, this model is a dedicated translation pipeline — speech goes in one language and comes out in another. It is not a chat agent; designed to convert one audio stream to another in real time.

The difference is important for developers choosing the right tool. Whether your application requires two customer support streams or a live translator for an in-person event, GPT-Realtime-Translate is a purpose-built option. If you need a model to re-think, call functions, or capture context around every curve, GPT-Realtime-2 handles that.

Price: GPT-Realtime-Translate costs $0.034 per minute.

GPT-Realtime-Whisper: Transcript Broadcasts As People Talk

GPT-Realtime-Whisper is a new speech-to-text streaming model built for low-latency speech-to-text — it records audio as people speak, so live productions feel faster, more responsive, and more natural.

The first Whisper model was designed for finished audio components, making it best suited for post-session recording. GPT-Realtime-Whisper is a streaming partner, purpose-built for applications that require live output. For real-time transcription, gpt-realtime-whisper gives you controllable delay — lower delay settings produce earlier text, while higher delay settings can improve transcription quality.

Use cases include live broadcast captions, meeting notes generated during a conversation, and voice agents that need to continuously understand the user rather than waiting for turn-by-turn input.

Price: GPT-Realtime-Whisper costs $0.017 per minute.

Architectural Patterns and New Vocabularies

Developers can choose between three types of session depending on the use case: a voice agent session when the application needs an assistant that responds to the user, a translation session when the application needs a translator, and a transcription session when text from audio is needed without model-generated responses.

On the voice release side, two new voices, Cedar and Marin, join the API list with this release only.

All three models — GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper — are now available through the OpenAI Realtime API, which is generally available starting today.

Key Takeaways

GPT-Realtime-2 delivers GPT-5 class audio processing with a 128K context window, five-level adjustable processing effort, tone control, parallel instrument calls, and interrupt availability.
In Big Bench Audio, GPT-Realtime-2 (top) scores 96.6% vs. 81.4% of GPT-Realtime-1.5; in the Audio MultiChallenge, the xhigh difference scored 48.5% versus 34.7%.
GPT-Realtime-Translate handles live speech translation from all 70+ input languages to 13 output languages for $0.034/min
GPT-Realtime-Whisper streaming transcripts in real-time with controllable latency at $0.017/min
The Realtime API is out of beta and generally available today alongside two new voices, Cedar and Marin

Check it out Full Technical Details here. Also, feel free to follow us Twitter and don't forget to join our 150k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us

Source link

nimda 2 hours ago

0 0 4 minutes read

OpenAI Releases Three Realtime Audio Models: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper in the Realtime API

GPT-Realtime-2: Voice Consultation with 128K Content Window

GPT-Realtime-Translate: Live Speech Translation in 70+ Languages

GPT-Realtime-Whisper: Transcript Broadcasts As People Talk

Architectural Patterns and New Vocabularies

Key Takeaways

nimda

Leave a Reply Cancel reply

Subscribers, Revenue, Market Share & Global Reach

5-return back to the base

Gemma 3 270m: Model of a hyper-effective compact of AI

Anthropic Introduces Natural Language Autoencoders That Convert Claude's Inner Workings Directly Into Human-Readable Text Descriptions

Cut researchers present the work that calls llms: Eliminating SQL relief to improve the accuracy of information and efficiency

OASIS: Simuleringar av social interaction mellan en miljon agent

FALCON 3 models are now available at Amazon Sagemaker Jumpstart

This AI paper introduces codesters: Physical models are symbolic language with code / guide

Meta SAM 2.1 is now available in Amazon SageMaker JumpStart

GPT-Realtime-2: Voice Consultation with 128K Content Window

GPT-Realtime-Translate: Live Speech Translation in 70+ Languages

GPT-Realtime-Whisper: Transcript Broadcasts As People Talk

Architectural Patterns and New Vocabularies

Key Takeaways

nimda

Subscribe to our mailing list to get the new updates!

Build a CloakBrowser Automation Workflow with Stealth Chromium, Persistent Profiles, and Browser Signal Testing

Anthropic Introduces Natural Language Autoencoders That Convert Claude's Inner Workings Directly Into Human-Readable Text Descriptions

Related Articles

Anthropic Introduces Natural Language Autoencoders That Convert Claude's Inner Workings Directly Into Human-Readable Text Descriptions

Build a CloakBrowser Automation Workflow with Stealth Chromium, Persistent Profiles, and Browser Signal Testing

LightSeek Foundation Releases TokenSpeed, Open-Source LLM Inference Engine Targeting TensorRT-LLM-Level Performance for Agentic Workloads.

Meta AI Releases NeuralBench: An Open Source Integrated Framework to Benchmark NeuroAI Models on 36 EEG Tasks and 94 Datasets

Leave a Reply Cancel reply

Subscribers, Revenue, Market Share & Global Reach

5-return back to the base

Gemma 3 270m: Model of a hyper-effective compact of AI

Anthropic Introduces Natural Language Autoencoders That Convert Claude's Inner Workings Directly Into Human-Readable Text Descriptions

Cut researchers present the work that calls llms: Eliminating SQL relief to improve the accuracy of information and efficiency

OASIS: Simuleringar av social interaction mellan en miljon agent

FALCON 3 models are now available at Amazon Sagemaker Jumpstart

This AI paper introduces codesters: Physical models are symbolic language with code / guide

Meta SAM 2.1 is now available in Amazon SageMaker JumpStart