A Theoretical Framework for Acoustic Neighborhood Embedding

This paper provides a theoretical framework for interpreting embeddings of acoustic neighbors, which are representations of the phonetic content of variable-width audio or text in fixed-dimensional embeddings. A possible interpretation of the distances between embeddings is proposed, based on the definition of the general quantity of phonetic similarity between words. This gives us a framework for understanding and using embeddedness in a meaningful way. Theoretical proofs and evidence to support the cluster-wise isotropy approximation are presented, which allow us to reduce the distances to simple Euclidean distances. Four experiments that validate the framework and show how it can be applied to various problems are described. Nearest neighbor search between audio and text embeddings can provide classification accuracy similar to that of finite state transducers (FSTs) for words as large as 500k. The embedding grades provide an accuracy of 0.5% points difference compared to phonetic grading grades in finding non-vocabulary words, and producing classifications similar to those derived from human listening tests of English language synthesis. The theoretical framework also allows us to use embedding to predict the expected confounding of wake device words. All source code and pre-trained models are provided.


