Data-Centric Studies for Improving Speech-Language Training

nimda December 16, 2025

0 6 1 minute read

Data-Centric Studies for Improving Speech-Language Training

Spoken Question-Answering (SQA) is a key skill for usable artificial intelligence systems. Recently, several speech language models (SpeechLMs) have been released with a special focus on improving their SQA performance. However, the lack of controlled release of data pre-processing and editing makes it challenging to understand which factors are responsible for performance, despite the great benefits from similar studies in other data methods. In this work, we address this gap by conducting data-driven experiments to pre-train SpeechLMs. We focus on three key research questions in speech language pre-training data: (1) how to process raw web-searched audio content for text-to-speech pre-training, (2) how to construct pre-training datasets to improve web-crawling data and (3) how to interpolate (text, audio) segments into the training sequence. We use information from our controlled data mining to pre-train a 3.8B-parameterized SpeechLM, called SpeLangy, which outperforms models by up to 3x larger with 10.2% overall performance. We hope that our findings highlight the impact of active data processing on speech language pretraining and guide future data-focused experiments on SpeechLMs.