Google AI Introduces ZeroBAS: A Neural Method for Binaural Audio Synthesis from Monaural Sound Recordings and Position Information Without Training on Any Binaural Data

Humans have an amazing ability to detect local sound sources and interpret their environment using auditory cues, a phenomenon called spatial hearing. This capability enables tasks such as identifying speakers in noisy settings or navigating complex environments. Simulating such an auditory spatial perspective is essential for developing immersive experiences in technologies such as augmented reality (AR) and virtual reality (VR). However, the transition from monaural (one-channel) to binaural (two-channel)—which captures spatial auditory effects—faces significant challenges, especially due to the limited availability of multi-channel and spatial audio data.
Traditional mono-to-binaural synthesis methods often rely on digital signal processing (DSP) frameworks. These methods exemplify auditory effects using components such as head-related transfer function (HRTF), room impulse response (RIR), and ambient noise, often treated as linear time-invariant (LTI) systems. Although DSP-based techniques are well established and can produce a realistic sound experience, they fail to account for the nonlinear acoustic wave effects found in real-world audio transmissions.
Supervised learning models have emerged as an alternative to DSP, which use neural networks to synthesize binaural sound. However, such models face two major limitations: First, the lack of annotated binaural datasets that describe the position and second, the risk of overfitting in certain acoustic areas, speaker characteristics, and training datasets. The need for specialized equipment for data collection also hinders these methods, making supervised methods expensive and ineffective.
To address these challenges, researchers from Google have proposed ZeroBAS, A zero-shot neural method for mono-to-binaural speech synthesis independent of binaural training data. This method uses geometric time warping (GTW) and amplitude scaling (AS) techniques based on source location. These raw signals are further refined using a pre-trained denoising vocoder, yielding binaural sound that appears realistic. Notably, ZeroBAS successfully generalizes across various room conditions, as demonstrated using the recently presented TUT Mono-to-Binaural dataset, and achieves performance comparable to, or even better than, modern supervised methods in data extraction. the data.
The ZeroBAS framework includes a three-stage structure as follows:
- In section 1, Geometric Time Warping (GTW) converts monaural input into two channels (left and right) by simulating the interaural time difference (ITD) based on the relative position of the sound source and the listener's ears. GTW calculates the time delay of the left and right ear channels. The twisted signals are then sequentially combined to produce the original binaural channels.
- In phase 2, Amplitude scale (AS) improves spatial fidelity of distorted signals by simulating interaural level difference (ILD) based on the inverse square law. Since human perception of sound space depends on both ITD and ILD, the latter dominates high frequency sounds. Using the Euclidean source distances from both ears and , the amplitudes are measured.
- In Section 3, we introduce iterative refinement of the warped and measured signals using a pre-trained noisy vocoder, WaveFit. This decoder uses log-mel spectrogram properties and probabilistic diffusion models (DDPMs) to generate pure binaural waveforms. By repeatedly using the vocoder, the system reduces acoustic artifacts and ensures high-quality output of binaural sound.
Coming to the analysis, ZeroBAS was tested on two datasets (results in Tables 1 and 2): Second Lecture Dataset and innovation are presented TUT Mono-to-Binaural The dataset. The latter was designed to test the generalizability of mono-to-binaural coupling methods in various acoustic environments. In objective testing, ZeroBAS showed significant improvement over DSP bases and approached the performance of supervised methods despite not being trained on binary data. Notably, ZeroBAS obtained the highest results on the out-of-distribution TUT dataset, highlighting its robustness across a variety of conditions.
Subjective testing also confirmed the effectiveness of ZeroBAS. A Mean of Perception (MOS) test showed that human listeners rated the ZeroBAS output as less natural than supervised methods. In MUSHRA testing, ZeroBAS achieved similar spatial quality to monitored models, with listeners unable to detect statistically significant differences.
Although this method is very remarkable, it has some limitations. ZeroBAS struggles to process phase information directly because the encoder has no orientation, and relies on general models instead of environment-specific ones. Despite these limitations, its ability to generalize successfully highlights the power of implicit learning in binaural sound combinations.
In conclusion, ZeroBAS offers an attractive, room-agnostic approach to two-speech fusion that achieves intelligibility quality comparable to supervised methods without requiring two training data. Its strong performance in different acoustic environments makes it a promising candidate for real-world applications in AR, VR, and immersive audio systems.
Check it out Paper and details. All credit for this study goes to the researchers of this project. Also, don't forget to follow us Twitter and join our Telephone station again LinkedIn Grup. Don't forget to join our 65k+ ML SubReddit.
🚨 Recommend Open Source Platform: Parlant is a framework that is changing the way AI agents make decisions in customer-facing situations. (Promoted)

Vineet Kumar is a consulting intern at MarktechPost. He is currently pursuing his BS at the Indian Institute of Technology (IIT), Kanpur. He is a machine learning enthusiast. He is interested in research and recent developments in Deep Learning, Computer Vision, and related fields.
📄 Meet 'Height': The only standalone project management tool (Sponsored)