Machine Learning

Choosing the best model size and dataset size under a fixed LLMS budget

Getting started

Language models (LLMS), we are permanently constrained by budget. Such an obstacle leads to basic sales – Off So you ask the question:

Should we overestimate the model with more parameters, or should we limit it to more details?

In particular, the performance and efficiency of the LLMS is greatly influenced by this trade-off. It is therefore important to find the right balance between the number of model parameters and the number of tokens used.

Complete transformer-sized scale training

  • N is the number of model parameters.
  • D is the number of tokens.
  • C Planned Budget.

It is straightforward to see that in C, n and D are very similar to each other.

Previous studies (Kaplan et al., 2020; Hoffmann et al., 2022) found that the training loss of machine learning models follows a power law following the law of computation: l (c) αc ^ {- – α α € – very good The size of the model and the size of the Dataset are averaged by computing as: n_Optαc ^ a, D_Optαc ^ b to find the specific values ​​of a and b.

In this article, we will use small converters to test how to measure n and D under a modified computer C.

Test setup

We design a model of a small transformer, and call it “Tiny Transformer” with the following structural properties that influence the size of the parameter:

  • Model Disension (D_Model)
  • MLP Dimension (D_MLP)
  • Number of layers (n_layers)

We would like to train a variable classification transformer on a sequence sequence of length 64 for the wikitext-2 dataset.

To study the scaling effect, we defined a grid of models from very small (16 hidden units, 12 layers) to large (4 hidden units, 4 layers) and combined them with a range of tokens from 5K to 1m. See the code below:

model_configs = [
    {"d_model": 16,  "d_mlp": 64,   "n_layers": 1},  
    {"d_model": 24,  "d_mlp": 96,   "n_layers": 1},   
    {"d_model": 32,  "d_mlp": 128,  "n_layers": 2},
    {"d_model": 48,  "d_mlp": 192,  "n_layers": 2},
    {"d_model": 64,  "d_mlp": 256,  "n_layers": 3},
    {"d_model": 96,  "d_mlp": 384,  "n_layers": 3},
    {"d_model": 128, "d_mlp": 512,  "n_layers": 4},   
]
# number of tokens (D) we train on — simulated via few steps × batch × seq_len
token_budgets = [5e3, 1e4, 3e4, 5e4, 1e5, 3e5, 5e5, 1e6]  # small for demo

Using the computer cost as C≈n × D, our idea is to combine the loss function of each (n, d) and find a pair (n, d) where the model reaches the minimum loss function of C: This is the measure we want.

Implementation and observation

We use the code below to train the model up to a fixed number of unique steps (n, d) to pair and record the result.


results = []
device = "cuda" if torch.cuda.is_available() else "cpu"

for cfg in model_configs:
    model = TinyTransformer(vocab_size=len(tokenizer), **cfg)
    N_params = count_params(model)
    for D in token_budgets:
        steps = int(D // (SEQ_LEN * 16))  # assuming batch_size=16
        dataloader = DataLoader(
            tokenized_dataset["train"].shuffle(seed=0),
            batch_size=16,
            collate_fn=collate_fn
        )
        avg_loss = train_one(model, dataloader, steps=steps, device=device)
        compute = N_params * D
        results.append({
            "N": N_params,
            "D": D,
            "C": compute,
            "loss": avg_loss
        })

We then plot the final loss against the computer (n × d):

Image by the author: Loss of training vs compute

We have the following important observations:

  1. For small computer budgetssmall models trained on most of the available data perform better than large models trained on very little data.
  2. For large computer budgetslarger models are better when enough data are available.
  3. The right model size doesn't directly increase with the budget. For example, doubling the computation does not lead to twice as many parameters as before.

The structure below provides an efficient bound on the model size, that is, the set of model sizes that have the lowest loss for a given computation.

Photo by Author: Up Front

“Best” Model

To determine the “best” model, we could choose the model size and the number of tokens that minimize the loss in a limited budget.

We assume that both follow a power law relationship

  1. Take the logarithm of the value: log? (N_opt) = αlog? (C) +
  2. Agree to a direct refund. Reversibility is nothing but a power law.

The following code provides such a return:

# Fit log-log linear regression
a_slope, a_intercept, *_ = st.linregress(np.log(frontier.C), np.log(frontier.N))
b_slope, b_intercept, *_ = st.linregress(np.log(frontier.C), np.log(frontier.D))

In testing our toy, we found that N_Opt ~ c ^ 0.14 and D_Opt ~ c ^ 0.86. This result may not show the whole picture because we did a model test with a specific configuration. But we can still see that the growth of computing leads to an increase in efficient sizes, but at a decreasing rate. Obviously, the remaining budget should be attributed to additional training tokens.

In addition, the above computation gives the fact that the best ratio is n_opt / d_opt = c ^ -0.72. This means that when increasing the computation, you should add more training tokens rather than increasing the model size.

How to cook

In this test, although it is a case of toys, we can extract several insights:

  1. For a limited budget, using an intermediate model with more data can produce a much larger model with limited data.
  2. The size of the relevant model and the size of the data increases with the computation. Don't train the model with too many parameters if you have a small budget.
  3. When the budget increases, consider first the appropriate ratio of N_OPT / D_OPT that you should decide if you should increase the size of the model or add more training data.

Lasting

In this blog tutorial, we present a trade-off study between model size and data under the LLMS shaped budget for the case of toys. The experiment shows that we can find the appropriate model of the model and the number of tokens to evaluate the best performance of the model with a given budget, which allows researchers to design the LLMS wisely and achieve the best results.

Index

[1] Kaplan, J., McCandlish, S., Henighan, T., Brown, TB, Chess, B., WU, D. (2020). Measuring rules for neural linguistic models.

[2] Hoffmann, J., Borgeaud, S., Mensch, A. Guy, A., OSIERTO, S., Simony, K., Elsen, E., … Sifre, L. (2022). Training large scale models for a large language.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button