Reactive Machines

Passing Hyperparameter Across Modules, Width, Depth, Batch and Duration

Hyperparameter tuning can dramatically affect the training stability and ultimate performance of large-scale models. Recent works on neural network parameters, such as μP, have enabled the transfer of appropriate global parameters to all model sizes. These works suggest the practical practice of searching for hyperparameters of the global basis at a small model size, and transferring them to a large size. We extend these functions in two important ways. To handle scaling with the most important scaling axes, we propose a Complete(d) Parameterisation that includes scaling with width and depth – using the CompleteP adaptation – as well as batch size and training duration. Second, for our parameterisation, we investigate the hyperparameter of each module and transmission. We illustrate the practical challenges of navigating the high-dimensional hyperparameter landscape, and propose practical guidelines for addressing this optimization problem. We show that, with the right parameter setting, the hyperparameter transfer holds even in the hyperparameter regime of each module. Our research includes a wide range of hyperparameters to optimize modern models: learning rates, AdamW parameters, weight decay, initialization scales, and residual block multipliers. Our tests show significant training speed improvements in Large Language Models with parameters passed to each module.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button