Reactive Machines

Appropriate Classification of Language Models from Mixed to Specialized Domains

This paper was accepted at the ICLR 2026 Workshop on Navigating and Addressing Basic Model Data Problems.

Linguistic models achieve remarkable performance on a variety of knowledge, language, and reasoning tasks due to the scale and diversity of training data available. A typical training recipe is a two-stage paradigm: pre-training first on the full corpus of data followed by specialization on a subset of high-quality, specialized data from the full corpus. In a multi-domain setting, this involves the continuous pre-training of multiple models for each specialized domain, called split model training. We propose a method to pre-train multiple models independently over a common training corpus, and determine the optimal computational allocation between pre-training and continuing pre-training using scaling rules. Our method accurately predicts the loss of a model of size N with D pretraining tokens and D', and extends to larger model sizes and number of tokens. Applied to language model training, our approach consistently improves performance across common sense knowledge and reasoning benchmarks across models of different sizes and computational budgets.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button