Machine Learning

Machine learning meets panel data: What experts need to know

Writers: Augusto Cerqua, Marco Letta, Gabriele Pinto

Machine learning (ML) has gained a major role in economics, social science, and business decision making. In the public sector, ML is increasingly being used for policy problems such as forecasting, for example, targeting public funding, local regeneration forecasting, or predictive thinking. In the private sector, similar activities arise where firms seek to predict customer churn, or develop financial risk assessments. In both domains, better forecasting translates into more efficient allocation of resources and more effective interventions.

To achieve these goals, ML algorithms are increasingly applied to panel data, which is characterized by repeated observations of the same units over multiple time periods. However, ML models were not designed to be used with panel data, which show different cross-sitelinal and longitudinal dimensions. When ML is applied to panel data, there is a significant risk of a subtle but serious problem: Data leakage. This happens when information not available during the forecasting process mistakenly enters the model training process, reducing the forecasting performance. In our paper “In MIS MIS (use) machine learning with panel data“(Cerqua, Letta, & Pinto, 2025), recently published in Oxford Bulletin of Economics and StatisticsWe provide the first systematic evaluation of data leakage in ML with Panel Data, propose clear guidelines for practitioners, and illustrate the results through an Empirical application with US County Data.

Leakage problem

Panel data includes two structures: temporal dimension (observed units over time) and category dimension (multiple units, such as regions or firms). The standard ml Prakthiza, divides the sample randomly in the training and test groups, the aspectifite is asmulat is arven independent and clearly distributed (Iid) data. This assumption is violated when automatic ML procedures (such as random segmentation) are applied to panel data, creating two main types of leakage:

  • Temporal leakage: Future information leaks into the model during the training phase, making predictions look vaguely accurate. In addition, past information can save a set of tests, making it 'predictive' to retrieve.
  • Class phase leakage: the same or similar units appear in both the training and test sets, which means that the model has already “seen” the dimension of the data period.

Figure 1 shows how different strategies affect the risk of leakage. Randomization at the Unit-Time Level (Panel a) is the biggest problem, as it introduces temporal and phase leakage. Other methods such as dividing units (panel b), by groups (panel c), or by time (panel d), reduce one type of leakage but not the other. Because of this, there is no strategy that completely eliminates the problem: The right choice depends on the work at hand (see below), because in some cases one type of leak may not be of concern.

Figure 1 | Training and testing are set under separate and distinct rules

Notes: In this example, the panel data is organized with years as the time variable, states as the unit variable, and states as the group variable. Image made by the authors.

Two types of forecasting policy problems

A key understanding of research is that researchers should clearly define their goal of predicting well. We distinguish two broad classes of forecasting policy problems:

1. Divisional prediction: The task is to map all units at the same time. For example, removing missing data from GDP per capita across regions where only regions have reliable estimates. The best division here in that unit: different units are assigned to training and testing sets, while all times are kept. This eliminates phase-phase leakage, although temporary leakage remains. But since prediction is not objective, this is not a real problem.

2. Sequential prediction Here, the division is time: Previous periods of training, future periods of testing. This avoids temporary leakage but not leakage of the lower section, which is not a concern since the same units are predicted in time.

The inappropriate method in both cases is the unit-time randomization (panel a lo figure 1), which contaminates the results with both types of leakage and produces efficient metrics.

Practical Guidelines

To help professionals, we summarize a set of Do's and Dots for applying ML to Panel Data:

  • Select the sample classification according to the research question: Based on the problem unit of the category, based on the time forecast.
  • Temporary leaks can be made not only by sight, but also by forecasts. To predict, use only advanced or aggressive predictors. You use a variable that has the most benefit (eg
  • Change validation validation to panel data. K-Fold's random CV is available in the most convenient software packages, because it combines future and past information. Instead, use rolling or expanding forecast windows, or parallel CV in units / groups of phase forecasts.
  • Ensure that the performance of the sample is tested on data that is not actually seen, not on data that has already been encountered during training.

An Empirical Application

To illustrate these issues, we analyze a balanced panel of 3,058 US Counties from 2000 to 2019, focusing on sequential forecasting. We consider two tasks: The regression problem – predicting income per Capita- and the allocation problem – predicting that income will decrease for the next year.

We use hundreds of models, a variety of different techniques, the use of predictors of the time of composition, the inclusion of set results, and algorithms (random forests, XGBOOST, OGIT, NOLS). This perfect design allows us to be able to exchange when leaking leaking performance. Figure 2 below reports the main findings.

Panel A of FIGURE 2 shows the predictive performance of the classification functions. Random bursts yielded the highest accuracy, but this is an illusion: the model has already seen the same details during training.

Panel B shows the performance of the regression function forecasts. Also, random splits make the models look much better than the real thing, while time-based splits show much lower, but reasonable, accuracy.

Figure 2 | Temporal leakage in the forecasting problem

Panel A – Classification function

Panel B – Retrieval Task

In the paper, we also show that the excess of forced accuracy is calculated more during years marked by distributional shifts and large breaks, such as positive regressions, which make the results more misleading for policy purposes.

Why is it important

A data leak is more than a technical breach; It has real world consequences. In policy applications, a model that appears to be very accurate during validation may be distorted when deployed, leading to misuse, or illegal reductions, or misdirection. In business contexts, the same issue can translate into poor investment decisions, ineffective customer orientation, or false confidence in risk assessment. The risk is particularly significant where machine learning models are intended to serve as early warning systems, where undue reliance on inappropriate performance can lead to worse outcomes.

Conversely, well-designed models, even if they are accurate on paper, provide reliable and trustworthy predictions that can objectively inform decision-making.

Take away

ML has the potential to transform decision-making in both policy and business, but only if it is used well. Panel data offers rich opportunities, yet it is vulnerable to data leakage. To extract reliable insights, practitioners must adapt ML workflows for the purpose of prediction, account for both temporal and phase structures, and use validation techniques that prevent high-testing and high-accuracy manipulation. When these principles are followed, the models avoid the trap of inflated performance and instead provide guidance that truly helps policymakers work with processes and businesses to make better decisions. Given the rapid adoption of ML with panel data in both the public and private domains, addressing these issues is now a pressing area of ​​applied research.

Progress

A. Cerqua, M. Letta, and G. Pinto, “in (MIS) application of machine learning to panel data”, Oxford Bulletin of Economics and Statistics (2025): 1-13, https://doi.org/10.11111/EBES.70019.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button