Passing Hyperparameter Across Modules, Width, Depth, Batch and Duration

0 2 1 minute read

Passing Hyperparameter Across Modules, Width, Depth, Batch and Duration

Hyperparameter tuning can dramatically affect the training stability and ultimate performance of large-scale models. Recent works on neural network parameters, such as μP, have enabled the transfer of appropriate global parameters to all model sizes. These works suggest the practical practice of searching for hyperparameters of the global basis at a small model size, and transferring them to a large size. We extend these functions in two important ways. To handle scaling with the most important scaling axes, we propose a Complete(d) Parameterisation that includes scaling with width and depth – using the CompleteP adaptation – as well as batch size and training duration. Second, for our parameterisation, we investigate the hyperparameter of each module and transmission. We illustrate the practical challenges of navigating the high-dimensional hyperparameter landscape, and propose practical guidelines for addressing this optimization problem. We show that, with the right parameter setting, the hyperparameter transfer holds even in the hyperparameter regime of each module. Our research includes a wide range of hyperparameters to optimize modern models: learning rates, AdamW parameters, weight decay, initialization scales, and residual block multipliers. Our tests show significant training speed improvements in Large Language Models with parameters passed to each module.

† University of Cambridge
** Work done while at Apple

Figure 1: We optimize 50M/1.6B micro-token scale parameters (learning rate, initialization scale, Adam ε, moment, and weight decay) with an evolutionary strategy. These hyperparameters (HPs) can be optimized globally with a value assigned to the entire model, or per module (with 13 module types, some tuned per depth). The individual module approach leads to better results at the 50M scale—global HPs require 2.3× longer training to achieve the same performance. Most importantly, our new configuration, Complete(d)P, enables direct transfers (without subsequent configuration) to a ~14000× larger FLOP budget.

Source link

nimda 1 day ago

0 2 1 minute read