StereoFoley: Object-Aware Stereo Audio Generation from Video

0 5 1 minute read

StereoFoley: Object-Aware Stereo Audio Generation from Video

Introducing StereoFoley, a video-to-audio production framework that produces semantically aligned, temporally synchronized, and spatially accurate stereo audio at 48 kHz. Although recent generative video-to-audio models achieve strong semantic and temporal fidelity, they mostly remain limited to mono or fail to deliver object-aware stereo images, hindered by the lack of professionally mixed, spatially accurate video and audio datasets. First, we develop and train a basic model that generates stereo audio from video, achieving high levels of both semantic and synchronization accuracy. Next, to overcome the limitations of the data set, we present an artificial data generation pipeline that includes video analysis, object tracking, and audio integration with range-based dynamic audio controls, allowing for accurate object-aware spatial audio. Finally, we fine-tune the underlying model on this synthetic dataset, revealing clear object–noise interactions. Since there are no proven metrics available, we present stereo object awareness methods and confirm them with human listening research, which shows a strong correlation with perception. This work establishes the first end-to-end framework for the generation of stereo sound that realizes the video object, addressing a critical gap and setting a new benchmark in the field.