Bridging the Gap Between Text and Speech Comprehension in LLMs

Large Language Models (LLMs) can be modified to extend their text capabilities to speech input. However, these speech-adapted LLMs remain inferior to their text-based counterparts—and have changed pipelines—in language comprehension tasks. We call this deficit the text-to-speech comprehension gap: the decrease in performance observed when a speech-adapted LLM processes spoken input compared to when an original text-based LLM processes the same text. Recent approaches to reducing this gap rely on large-scale synthesis of textual association speech, which is expensive and heavily dependent on synthetic data, or on large proprietary speech datasets, which cannot be reproduced. As a result, there is still a need for more data-efficient methods to bridge the text-to-speech understanding gap. In this work, we analyze the gap as driven by two factors: (i) the forgetting of textual skills during adaptation, and (ii) the misalignment of different modes between speech and text. Based on this analysis, we present SALAD—Efficient Sample Alignment and Learning through Active Selection and Cross-modal Distillation—which combines cross-modal filtering with objective synthetic data to improve alignment while reducing forgetting. Applied to LLMs 3B and 7B, SALAD achieves competitive performance with a robust open-weight model across broad domain benchmarks in knowledge, language comprehension, and reasoning, while training on an order of magnitude of small speech data from a public company.
- † University of Toulon, Aix Marseille Université, CNRS, LIS



