Ensembles for Ensembles: A Guide to Packing

machine learning is a complex game of embedded engineering. The difference of a small improvement in travel time or points lost can be measured in the millions of dollars a team brings in if it does what it takes to be the best. Not only does every part of the system need to be perfect, the way it is put together needs to be perfect as well.
State of the art
Gradient-enhanced models have historically been the most competitive models for tabular and time series forecasting problems. These are aggregation methods because they combine the results of several basic measurements to produce a final answer that is better than any individual prediction alone. But the art scene is starting to change. Pre-trained models such as TabPFN for tabular data, and Chronos for time series are starting to match or exceed advanced gradient models in some benchmarks. In a way these are also methods of aggregation, apart from aggregating multiple predictions, they are a collection of data that they learn from. The intuition behind this applies widely, and can be taken further.
There is now a situation where two completely different builds are fighting for the top spot on all ML leaderboards, and are closely followed by dozens of other builds with their own sets of strengths and weaknesses. Given that they all learn in different ways, and also learn from different data, they can all be used together in a complementary cluster that preserves many strengths, while eliminating many weaknesses. If done right, this likely leads to better performance, and a more powerful model.
Assertions and assumptions
The same techniques that can be used to determine which data are important for making a particular prediction can also be used to determine which models are important for making a particular prediction. Just as a combination of basic measurements in gradient-enhanced models is better than a single measurement, a combination of models is better than one.
Throughout this discussion, there is a general assumption that all relevant data are used in the modeling process. In other words, all relevant information is known at time t (or decision time). In data science, this is not a trivial assumption to make, and doing so falsely will invalidate the claims made here. As it turns out, most work in data science is just trying to satisfy this assumption with data in the right format. Also note that the covariates/factors presented in the models are not fixed as different structures perform better with different data, and may not be able to handle certain types of data at all (this will be a particularly relevant point for adjusting pre-mixed linguistic/numerical models, which are still in development).
Multi-Layer Packaging
A general approach that can be prepared for time series or tabular regression/plot problems
Background 1
There are many ways to do blending techniques, and it makes a lot of sense to organize these steps in layers. The first layer is a set of basic models (eg CatBoost, MLPs, TabPFN, etc.).
For tabular problems, these can be trained with bootstrap aggregation, where new training sets are created from samples from the base training set and replaced. Individual models are then trained on each new set and their predictions are averaged. Hyperparameter optimization can also be performed for each of these models, although this is computationally more expensive as each model for each sample (or “bag”) is retrained multiple times. In order to reduce the training time, a hyperparameter optimization program like Optuna can be used so that the running models that do not perform well can be shortened, and the local minimum can be set to zero quickly by using some statistical optimization techniques. Alternatively, several hyperparameter presets can be applied to each model based on what tends to work best for that particular model on similar datasets. Different models with different presets can be scaled together to “represent” a single model, or they can be registered as different versions of the model and used in the next layer.
For time series forecasting, traditional bootstrapping becomes problematic. Since the time dimension must be respected, the process cannot randomly divide this data and resample it to create new training sets. Instead, cross-validation should be done with a rolling window in between. In this process a new model is created to predict in a validation window with timestamps strictly after those in the training set. After training and testing, that validation window is added to the training set and the process is repeated for the next time slice (next validation window). This gives a good idea of how well the model will perform every time, but models are rarely integrated at this step. Since recent time series data are usually more informative, only the model trained in the last step is used for prediction. However, predictions from previous windows can still be used for the next layer.
Layer 2
After training the base models, the test metrics on the training set and the validation set are obtained. At all intermediate steps, the test set should be given full attention. In layer 2, new strategies can be applied as the performance of the model is known, and strong predictions (hopefully) have already been made.
For tabular problems, a second round of bagged models can be trained when the assumptions of layer 1 models are added as features. In case the basic model does not perform well in validation, it may be dropped at this step.
For time series, the same strategy cannot be performed since the layer 1 models never make predictions for the entire training set. This would not be possible as there would be no data to train on to get the initial predictions of the training set, and a model trained on anything after that cannot be used to find those predictions needed to be used as features in the model. The caveat to this is that if the layer 2 model structure can handle missing values, or only a subset of the training set with predictions is used, then full retraining (of the training data and layer 1 model predictions) can be performed at this layer. While this is possible, and perhaps useful, there are better ways.
As the performance of the models is known and predictions are made, combinations of the base model predictions can be used as new predictions. There are several ways to do this:
- Just measure them all
- The weight of each forecast is set by its validation performance and its average
- Take a linear combination of all the predictors that minimizes the loss by ordinary least squares
- Make a greedy cluster starting with the best performing model and gradually add weight to the other models until the performance stops improving.
- If that's not enough, the whole model can be trained with only the basic model assumptions (this is really useful if there is a large enough number of out-of-replication assumptions)
Note that the validation windows of layer 1 become the training set of layer 2, so only the last validation set of layer 1 is used as the validation set of layer 2. Instead of trying to figure out which one method is best, layer 2 should try all of them since these steps are computationally efficient.
Layer 3
Time to pack more layers… The tabular approach yielded predictions from another round of bagged models, and the time series approach yielded predictions for different clustering techniques. Layer 3 will simply apply one of the pooling techniques mentioned in the time series of layer 2 to create the final meta model. This is the model that should be used for testing in the test set, although it is a good idea to ensure that it outperforms the base models. The latter model should almost always win, and will not be too sensitive to bad predictions from one model since bad predictions can be discounted, and tend to average out. Conversely, If one model picks up on a pattern that others don't, a multi-layer stack can learn to improve those predictions. The only cases where this doesn't work are if one model is consistently better across the board, which is quite rare, or one or more basic models are so bad, in which case they should be removed entirely.
Was it all worth it?
Maybe. The downside to this is that it requires training multiple models instead of just one. When data sets are large enough, training and reference time can quickly become a bottleneck for some applications. The counter to this is that the process is highly scalable, and functional algorithms can be used instead of deep learning if needed. LightGBM is an order of magnitude faster than deep learning, and is often still competitive.
This philosophy of combining ensembles in machine learning has been popularized and fully adopted by AutoGluon. In fact, it's the real standard for their AutoML offering, and their team has contributed a lot to the open community and to the bleeding edge research in this field. As the pre-training frontier of tabular/time series transformers has yet to be fully explored, expect additional model variations in the future to improve this technique.
There is good reason to believe that this philosophy will continue to win, as it has in many other domains:
- A democracy is a collection of elected officials, and elected officials represent their constituents (in theory at least). Although not perfect, it is still the best system for now.
- Medical diagnosis improves from many perspectives. Combining examinations from multiple radiologists, pathologists, or specialists further reduces diagnostic rates. Each doctor may catch different patterns or edge cases, and their collective judgment is more reliable than any individual assessment.
- Even stock markets are a combination of beliefs about the future. Although historically the information contained in the movements of these markets has not been directly relevant to the masses, prediction markets and prediction platforms are changing this.
- In the latest release of Claude Code (February 2026), Anthropic introduced collaborative “agent teams” where multiple instances of Claude work together on tasks, communicating through shared task lists and peer-to-peer communication. xAI uses a similar multi-agent approach with Grok 4 Heavy/Grok 4.20, where independent agents work in parallel and “verify” each other's solutions before converging on a final answer.
It turns out that collaboration is the way to go. Ensembles of ensembles appear repeatedly in the best human-created systems, and the machine learning domain is the same. In the age of wisdom, weighing this idea will not be an option.



