Delayed Fusion: Combining Large-Scale Language Models for First-Pass Decoding in End-to-End Speech Recognition

This paper presents an efficient decoding method for end-to-end automatic speech recognition (E2E-ASR) with large-scale language models (LLMs). Although shallow pooling is a common method for pooling language models in E2E-ASR recording, we face two practical problems with LLMs. (1) LLM assumptions are computationally expensive. (2) There may be terminology differences between the ASR and LLM models. To solve this mismatch, we need to retrain the ASR and/or LLM model, which is time-consuming at best and in most cases impractical. We propose “delayed integration,” which applies LLM scores to ASR concepts with a delay during decoding and enables easy use of pre-trained LLMs in ASR tasks. This approach will not only reduce the number of ideas received by the LLM but also the number of calls considered by the LLM. It also allows re-discovery of ASR views during recording if ASR and LLM use different tokens. We show that delayed clustering provides improved coding speed and accuracy compared to shallow clustering and better N retrievals using the LibriHeavy ASR corpus and three public LLMs, OpenLLaMA 3B & 7B and Mistral 7B.



