Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training

Although scaling laws for large-scale linguistic models (LLMs) often focus on proxy metrics such as premature loss, predicting declining task performance has been considered unreliable. This paper challenges this view by proposing a specific framework to demonstrate the measurement of performance measurement from a training budget. We find that for a fixed token-to-parameter equation, a simple power law can accurately describe the behavior of the log-accuracy scaling in many popular downstream functions. Our results show that the direct method extrapolates better than the proposed two-stage procedure, which tends to compound errors. In addition, we present functional forms that predict the accuracy of all token-to-parameter measurements and a computational account of reasoning under repeated sampling. We validate our findings on models with up to 17B parameters trained with up to 350B tokens across the two dataset combinations. To support reproducibility and encourage future research, we release the complete set of training losses and test results below.
- ** Work done while at Apple



