VSSFlow: Combining the Generation of Conditional Audio for Video and Speech with Collaborative Learning

Video-based audio and speech production, including video-to-sound (V2S) and text-to-speech (VisualTTS) tasks, are often treated as separate tasks, with limited testing for integration within a brand framework. Recent efforts to integrate V2S with VisualTTS face challenges in handling different types of scenarios (eg, various video and transcription scenarios) and require complex training phases. Combining these two functions is still an open problem. To close this gap, we introduce VSSFlow, which seamlessly integrates both V2S and VisualTTS functions into a unified flow simulation framework. VSSFlow uses a novel combination method to handle different input signals. We find that cross-attention and self-attention layers show different inductive biases in the process of situation presentation. Thus, VSSFlow leverages this learning bias to effectively handle different representations: multidimensional attention for ambiguous video contexts and self-attention for deterministic speech texts. Furthermore, contrary to the prevailing belief that joint training of these two tasks requires complex training techniques and may reduce performance, we find that VSSFlow benefits from an end-to-end joint learning process for sound and speech production without additional designs in the training stages. The detailed analysis includes a common sound learned previously shared between the tasks, which accelerates the convergence, improves conditional production, and stabilizes the free guidance process of segmentation. Extensive testing shows that VSSFlow outperforms the domain-specific baseline in both the V2S and VisualTTS benchmarks, underscoring the significant power of integrated production models.
- † Renmin University of China



