Polynorm: Shot-based normalized text-to-speech vlm

Normal text (TN) is an important step in Exprocessing systems in text-to-speech (TTS) systems, converting written forms into their spoken equivalents. Traditional TN systems can demonstrate high accuracy, but they involve significant engineering effort, are difficult to scale, and pose challenges for covering languages, especially in low settings. We propose polynorm, a rapid TN-based approach using large-scale linguistic models (LLMS), which aims to reduce dependence on hand-built rules and enables extensive linguistic processing with minimal human intervention. In addition, we present an agnostic pipeline for automatic data processing and testing, designed to facilitate reversible selection in various languages. An eight-language test shows a consistent reduction in error rate (WEL) compared to a classroom-based program. To support further research,



