Which Tests Which Model? A Taxonomy of Discourse Model Testing

Speech-based models have recently gained incredible capabilities across many tasks. However, their evaluation remains inconsistent across all types of tasks and types of models. Different models excel at different aspects of speech processing and thus require different experimental protocols. This paper proposes a unified taxonomy that answers the question: Which test fits which model? The taxonomy defines three orthogonal axes: the test feature being measured, the model capabilities needed to attempt the task, and the task or protocol requirements needed to perform it. We categorize the broad set of tests and benchmarks that exist along these axes, covering areas such as representational learning, speech production, and interactive conversation. By mapping each assessment to the skills expressed by the model (eg, speech production, real-time processing) and its methodological requirements (eg, data efficiency, human judgment), the taxonomy provides a systematic framework for aligning models with appropriate assessment methods. It also reveals systematic gaps, such as limited inclusion of prosody, interaction, or reasoning, that highlight priorities for future benchmark design. Overall, this work provides a conceptual basis and a practical guide for selecting, interpreting, and extending the evaluation of discourse models.



