ANI

Why Massive AI Models Actually Generalize Better

0 9 7 minutes read

Why Massive AI Models Actually Generalize Better

Summary: While modern AI systems like ChatGPT and Gemini are incredibly powerful, they remain “black boxes” whose internal mechanisms are poorly understood. Researchers have developed a simplified mathematical “toy model” to peel back the curtain.

Using tools from statistical physics, the team has identified how high-dimensional data fluctuations, once thought to be noise, actually stabilize learning and prevent the “mystery of overfitting,” potentially marking a shift from empirical observation to a fundamental “theory of gravity” for artificial intelligence.

Key Research Findings

The Keplerian Phase: AI research is currently in a phase similar to Johannes Kepler’s early planetary observations; we have identified “scaling laws” (performance improves with more data/size), but we lack a “Newtonian” theory explaining why.
Neural Networks as Organisms: Deep learning models are not manually engineered algorithms but are described as “organisms grown in a lab,” where intelligent behavior emerges from complex network structures rather than a set of human-written rules.
The Overfitting Mystery: Large models should, in theory, memorize data rather than learn patterns (overfitting). However, AI models often generalize better as they grow. The Harvard team used ridge regression as a toy model to solve this mathematically.
Renormalization Theory: The researchers suggest that the ability to learn without overfitting arises from principles of renormalization. In high-dimensional spaces (millions of variables), microscopic details are absorbed into a few parameters, allowing complex systems to display simple, stable large-scale behavior.
Statistical Fluctuations: The study shows that high-dimensional fluctuations, small random variations in data, actually stabilize the learning process rather than destabilizing it, helping the model generalize.

Source: SISSA

Artificial intelligence systems based on neural networks — such as ChatGPT, Claude, DeepSeek or Gemini — are extraordinarily powerful, yet their internal workings remain largely a “black box”.

To better understand how these systems produce their responses, a group of physicists at Harvard University has developed a simplified mathematical model of learning in neural networks that can be analysed mathematically using the tools of statistical physics.

By using simplified “toy models” and renormalization theory from statistical physics, Harvard researchers are uncovering the fundamental mathematical laws that allow large neural networks to stabilize learning and avoid overfitting. Credit: Neuroscience News

“Toy models”, like the one presented in the study just published in the Journal of Statistical Mechanics: Theory and Experiment (JSTAT), provide researchers with a controlled theoretical laboratory for investigating the fundamental mechanisms of neural networks.

A deeper understanding of how these systems work could help design artificial intelligence systems that are more efficient and reliable, while also addressing some of the current challenges.

The laws of AI

It’s a bit like when Kepler described the laws governing the motion of the planets. “The way Newton’s laws of gravity were discovered was first by identifying scaling laws between the orbital periods of planets and their radii,” explains Alexander Atanasov, a PhD student in theoretical physics at Harvard University and first author of the new study.

Kepler formulated his laws by observing planetary motion, without fully understanding the mechanisms behind it. Yet that work proved crucial: it later enabled Newton to uncover gravity, leading to a much deeper understanding of the universe.

In studies of deep learning—the branch of artificial intelligence based on neural networks—we may still be in a similar Keplerian phase. Today researchers have identified several empirical laws that describe how neural networks behave, but we still lack a kind of “theory of gravity” explaining why they behave that way.

Scientists, for example, know about the scaling laws. “We know that if we take a model and make it bigger, or give it more data, its performance increases,” explains Cengiz Pehlevan, Associate Professor of Applied Mathematics at Harvard University and senior author of the study.

These laws make performance predictable, but they do not yet reveal the deeper mechanisms behind it. This approach is not only inefficient—today’s AI systems consume enormous amounts of energy—but also does little to advance our understanding of how these systems actually work.

Neural networks as biological organisms

“Deep learning models are not algorithms written by hand as a set of rules. They’re not engineered manually,” explains Atanasov. “It’s much more similar to an organism being grown in a lab.”

Generative AI chatbots rely on neural networks, a technology that — in a very distant way — resembles the functioning of a biological brain. They consist of many small processing units, called artificial neurons, each performing simple operations but connected together in a complex network.

It is this networked structure that allows “intelligent” behaviour to emerge. Although we know the mathematical operations performed by each individual component, predicting and mechanistically explaining the behaviour of the system as a whole remains extremely difficult: as the number of components grows, the complexity increases rapidly.

A toy model

Since it is currently impossible to analyse a full-scale neural network with exact mathematical methods, Atanasov and his colleagues chose to work with a simplified model that still captures many key features of more complex systems.

“The model we’re studying is simple enough to be solved mathematically,” explains Jacob Zavatone-Veth, Junior Fellow at the Harvard Society of Fellows and co-author of the study. “At the same time, it reproduces several of the key phenomena seen in large neural networks.”

The toy model used in the study is ridge regression, a variant of linear regression.

Linear regression is a statistical method used to estimate relationships between variables. For example, if we know the height and weight of 100 people, we can use linear regression to identify a mathematical relationship between the two and estimate the height of a new person based only on their weight.

The mystery of overfitting — and why it often doesn’t happen

Ridge regression is a type of regression that helps reduce the phenomenon known as overfitting. When models are trained on large datasets, a neural network — a bit like a very diligent but perhaps not particularly insightful student — may end up simply memorising the training data instead of learning patterns that allow it to generalise and make reliable predictions on new data.

Yet deep learning models often behave in a surprising way. “Despite being extremely large, these models can learn from the data without overfitting,” explains Atanasov, calling it “one of the great mysteries of deep learning.”

At first glance this seems counterintuitive. In theory, larger models should be more prone to overfitting. Instead, the scaling laws show that performance often improves as more data are used during training.

New insights

The new study offers one possible piece of that explanation. According to the researchers, the ability of neural networks to learn without overfitting may arise from principles related to renormalization theory, a framework widely used in statistical physics.

To see why, it helps to consider the dimensionality of the data processed by modern AI systems. In the earlier example of linear regression we considered only two variables — height and weight. Real systems such as ChatGPT, however, operate in spaces with thousands or even millions of variables, making an exact mathematical analysis extremely difficult.

Here ideas from statistical physics become useful. In very high-dimensional data, small random variations — known as statistical fluctuations — naturally appear. Renormalization theory shows that many microscopic details can be effectively absorbed into a small number of parameters, meaning that even very complex systems can display relatively simple large-scale behaviour.

Using this framework and their simplified toy model, the researchers show how these high-dimensional fluctuations can actually stabilise learning rather than destabilise it.

“This is something we can understand by analysing simpler linear models,” explains Pehlevan, suggesting that the same mechanism may explain why current neural networks avoid overfitting even when they are highly over-parameterised.

The simplified model may also serve another purpose. As Zavatone-Veth notes, it could be a kind of baseline for understanding how learning might behave in very high-dimensional systems.

By studying a model that is simple enough to analyse mathematically, researchers can identify which aspects of learning are likely to be generic—that is, expected to appear across many different neural networks—and which instead depend on the details of a specific model. In this sense, studies like this may help clarify some of the more fundamental principles underlying learning in complex systems.

Key Questions Answered:

Q: Why call it a “toy model”? Is it just a game?

A: A “toy model” is a simplified version of a complex system that is stripped of unnecessary details so it can be solved with exact mathematics. It’s like a physicist studying a “spherical cow” to understand the basics of biology—it provides a controlled laboratory to find the “laws” of learning that apply to the giant black boxes of modern AI.

Q: What is the “mystery of overfitting” exactly?

A: Imagine a student who memorizes every single answer to a practice test but then fails the actual exam because they didn’t understand the underlying concepts. That’s overfitting. AI models are massive enough to “memorize” the whole internet, yet they somehow manage to understand the patterns of language instead. This study suggests physics-based “renormalization” is what keeps them on track.

Q: How does this help make AI better?

A: Currently, building AI is incredibly energy-intensive and involves a lot of trial and error. If we understand the “physics” of how these models grow and learn, we can design them to be more efficient from the start, requiring less data and power to achieve the same “intelligence.”

Editorial Notes:

This article was edited by a Neuroscience News editor.
Journal paper reviewed in full.
Additional context added by our staff.

About this AI research news

Author: Federica Sgorbissa
Source: SISSA
Contact: Federica Sgorbissa – SISSA
Image: The image is credited to Neuroscience News

Original Research: Open access.
“Scaling and renormalization in high-dimensional regression” by Alexander Atanasov, Jacob A Zavatone-Veth and Cengiz Pehlevan. Journal of Statistical Mechanics: Theory and Experiment
DOI:10.1088/1742-5468/ae4bba

Abstract

Scaling and renormalization in high-dimensional regression

From benign overfitting in overparameterized models to rich power-law scalings in performance, simple ridge regression displays surprising behaviors sometimes thought to be limited to deep neural networks.

This balance of phenomenological richness with analytical tractability makes ridge regression the model system of choice in high-dimensional machine learning. In this paper, we present a unifying perspective on recent results on ridge regression using the basic tools of random matrix theory and free probability, aimed at readers with backgrounds in physics and deep learning.

We highlight the fact that statistical fluctuations in empirical covariance matrices can be absorbed into a renormalization of the ridge parameter. This “deterministic equivalence” allows us to obtain analytic formulas for the training and generalization errors in a few lines of algebra by leveraging the properties of the S-transform of free probability.

From these precise asymptotics, we can easily identify sources of power-law scaling in model performance. In all models, the S-transform corresponds to the train-test generalization gap, and yields an analog of the generalized-cross-validation estimator. Using these techniques, we derive fine-grained bias-variance decompositions for a very general class of random feature models with structured covariates.

This allows us to discover a scaling regime for random feature models where the variance due to the features limits performance in the overparameterized setting. We also demonstrate how anisotropic weight structure in random feature models can limit performance and lead to nontrivial exponents for finite-width corrections in the overparameterized setting.

Our results extend and provide a unifying perspective on earlier models of neural scaling laws.

Source link

nimda 3 weeks ago

0 9 7 minutes read