Reactive Machines

A Rough Principled Adoption of Coding in Speech

Predictive coding speeds up automatic speech production by allowing a fast draft model to suggest tokens that are validated by a larger target model. However, for speech LLMs that generate acoustic tokens, direct matching of tokens is overly restrictive: many different tokens are acoustically or emotionally interchangeable, reducing reception rates and limiting acceleration. We present Principled Coarse-Graining (PCG), which validates propositions at the level of Acoustic Similarity Groups (ASGs) derived from the embedding of a target model. By dividing the weight of each token into the overlapping groups it contains, we define a coarse-grained distribution of overlap and sample the resulting group variance. This provides a guarantee of accuracy at the group level while allowing the received draft token to represent any member of the group in practice. In LibriTTS, PCG increases the reception and output associated with conventional predictive modeling and speech-specific relaxation while maintaining speaker intelligibility and uniformity. These results suggest auditory recognition, group-level adoption as a simple and common way to speed up speech token production while maintaining speech quality.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button