Are Foundation Models Ready for Your Production Tabular Data?

0 0 11 minutes read

Are Foundation Models Ready for Your Production Tabular Data?

are large-scale AI models trained on a vast and diverse range of data, such as audio, text, images, or a combination of them. Because of this versatility, foundation models are revolutionizing Natural Language Processing, Computer Vision, and even Time Series. Unlike traditional AI algorithms, foundation models offer out-of-the-box predictions without the need for training from scratch for every specific application. They can also be adapted to more specific tasks through fine-tuning.

In recent years, we have seen an explosion of foundation models applied to unstructured data and time series. These include OpenAI’s GPT series and BERT for text tasks, CLIP and SAM for object detection, classification, and segmentation, and PatchTST, Lag-Llama, and Moirai-MoE for Time Series forecasting. Despite this growth, foundation models for tabular data remain largely unexplored due to several challenges. First, tabular datasets are heterogeneous by nature. They have variations in the feature types (Boolean, categorical, integer, float) and different scales in numerical features. Tabular data also suffer from missing information, redundant features, outliers, and imbalanced classes. Another challenge in building foundation models for tabular data is the scarcity of high-quality, open data sources. Often, public datasets are small and noisy. Take, for instance, the tabular benchmarking website openml.org. Here, 76% of the datasets contain fewer than 10 thousand rows [2].

Despite these challenges, several foundation models for tabular data have been developed. In this post, I review most of them, highlighting their architectures and limitations. Some questions I want to answer are: What’s the current status of foundation models for tabular data? Can they be applied in production, or are they only good for prototyping? Are foundation models better than classic Machine Learning algorithms like Gradient Boosting? In a world where tabular data represents most data in companies, knowing which foundation models are being implemented and their current capabilities is of great interest to the data science community.

TabPFN

Let’s start by introducing the most well-known foundation model for small-to-medium-sized tabular data: TabPFN. This algorithm was developed by Prior Labs. The first version dropped in 2022 [1], but updates to its architecture were released in January of 2025 [2].

TabPFN is a Prior-Data Fitted Network, which means it uses Bayesian inference to make predictions. There are two important concepts in Bayesian inference: the prior and the posterior. The prior is a probability distribution reflecting our beliefs or assumptions about parameters before observing any data. For instance, the probability of getting a 6 with a die is 1/6. The posterior is the updated belief or probability distribution after observing data. It combines your initial assumptions (the prior) with the new evidence. For example, you might encounter that the probability of getting a 6 with a die is actually not 1/6, because the die is biased.

In TabPFN, the prior is defined by 100 million synthetic datasets that were carefully designed to capture a wide range of potential scenarios that the model might encounter. These datasets contain a wide range of relationships between features and targets (you can find more details in [2]).

The posterior is the predictive distribution function

This is computed by training the TabPFN model’s architecture on the synthetic datasets.

Model architecture

TabPFN architecture is shown in the following figure:

TabPFN model’s architecture. Image taken from the original paper [2].

The left side of the diagram shows a typical tabular dataset. It’s composed of a few training rows with input features (x₁, x₂) and their corresponding target values (y). It also includes a single test row, which has input features but a missing target value. The network’s goal is to predict the target value for this test row.

The TabPFN architecture is composed of a series of 12 identical layers. Each layer contains two attention mechanisms. The first is a 1D feature attention, which learns the relationships between the features of the dataset. It essentially allows the model to “attend” to the most relevant features for a given prediction. The second attention mechanism is the 1D sample attention. This module looks at the same feature across all other samples. Sample attention is the key mechanism that enables In-Context Learning (ICL), where the model learns from the provided training data without needing any backpropagation. These two attention mechanisms enable the architecture to be invariant to the order of both samples and features.

The output of the 12 layers is a vector that is fed into a Multilayer Perceptron (MLP). The MLP is a small neural network that transforms the vector into a final prediction. For a classification task, the final prediction is not a class label. Instead, the MLP outputs a vector of probabilities, where each value represents the model’s confidence that the input belongs to a specific class. For example, for a three-class problem, the output might be [0.1, 0.85, 0.05]. This means the model is 85% confident that the input belongs to the second class.

For regression tasks, the MLP’s output layer is modified to produce a continuous value instead of a probability distribution over discrete classes.

Usage

Using TabPFN is quite easy! You can install it via pip or from the source. There is great documentation provided by Prior Labs that links to the different GitHub repositories where you can find Colab Notebooks to explore this algorithm right away. The Python API is just like that of Scikit Learn, using fit/predict functions.

The fit function in TabPFN doesn’t mean the model will be trained as in the classical Machine Learning approach. Instead, the fit function uses the training dataset as context. This is because TabPFN leverages ICL. In this approach, the model uses its existing knowledge and the training samples to understand patterns and generate better predictions. ICL simply uses the training data to guide the model’s behavior.

TabPFN has a great ecosystem where you can also find several utilities to interpret your model through SHAP. It also offers tools for outlier detection and the generation of tabular data. You can even combine TabPFN with traditional models like Random Forest to enhance predictions by working on hybrid approaches. All these functionalities can be found in the TabPFN GitHub repository.

Remarks and limitations

After testing TabPFN on a large private dataset containing both numerical and categorical features, here are some takeaways:

Make sure you preprocess the data first. Categorical columns must have all elements as strings; otherwise, the code raises an error.
TabPFN is a great tool for small- to medium-sized datasets, but not for large tables. If you work with big datasets (i.e., more than 10,000 rows, over 500 features, or more than 10 classes), you’ll hit the pre-training limits, and the prediction performance will be affected.
Be aware that you may encounter CUDA errors that are difficult to debug.

If you are interested in seeing how TabPFN performs on different datasets compared to classical boosted methods, I highly recommend this excellent post from Bahadir Akdemir:

TabPFN: How a Pretrained Transformer Outperforms Traditional Models on Tabular Data (Medium blog post)

CARTE

The second foundation model for tabular data leverages graph structures to create an interesting model architecture: I’m talking about the Context Aware Representation of Table Entries, or CARTE model [3].

Unlike images, where an object has specific features regardless of its appearance in an image, numbers in tabular data have no meaning unless context is added through their respective column names. One way to account for both the numbers and their respective column names is by using a graph representation of the corresponding table. The SODA team used this idea to develop CARTE.

CARTE transforms a table into a graph structure by converting each row into a graphlet. A row in a dataset is represented as a small, star-like graph where each row value becomes a node connected to a center node. The column names serve as the edges of the graph.

Graph representation of a tabular dataset. The center node is initially set as the average of the other nodes. The center node acts as an element that captures the overall information of the graph. Image sourced from the original paper [3].

For categorical row values and column names, CARTE uses a d-dimensional embedding generated from a language model. In this way, prior data preprocessing, such as categorical encoding on the original table, is not needed.

Model architecture

Each of the created graphlets contains node (X) and edge (E) features. These features are passed to a graph-attentional network that adapts the classical Transformer encoder architecture. A key component of this graph-attentional network is its self-attention layer, which computes attention from both the node and edge features. This allows the model to understand the context of each data entry.

CARTE model’s architecture. Image taken from the original paper [3].

The model architecture also includes an Aggregate & Readout layer that acts on the center node. The outputs are processed for the contrastive loss.

CARTE was pretrained on a large knowledge base called YAGO3 [4]. This knowledge base was built from sources like Wikidata and contains over 18.1 million triplets of 6.3 million entries.

Usage

The GitHub repository for CARTE is under active development. It contains a Colab Notebook with examples on how to use this model for regression and classification tasks. According to this notebook, the installation is quite straightforward, just through pip install. Like TabPFN, CARTE uses the Scikit-learn interface (fit-predict) to make predictions on unseen data.

Limitations

According to the CARTE paper [3], this algorithm has some major advantages, such as being robust to missing values. Additionally, entity matching is not required when using CARTE. Because it uses an LLM to embed strings and column names, this algorithm can handle entities that might appear different, for instance, “Londres” instead of “London”.

While CARTE performs well on small tables (fewer than 2,000 samples), tree-based models can be more effective on larger datasets. Additionally, for large datasets, CARTE might be computationally more intensive than traditional Machine Learning models.

For more details on the experiments conducted by the developers of this foundational model, here’s a great blog written by Gaël Varoquaux:

CARTE: toward table foundation models

TabuLa-8b

The third foundation model we’ll review was built by fine-tuning the Llama 3-8B language model. According to the authors of TabuLa-8b, language models can be trained to perform tabular prediction tasks by serializing rows as text, converting the text to tokens, and then using the same loss function and optimization methods in language modeling [5].

TabuLa-8b’s architecture features an efficient attention masking scheme called the Row-Causal Tabular Masking (RCTM) scheme. This masking allows the model to attend to all previous rows from the same table in a batch, but not to rows from other tables. This structure encourages the model to learn from a small number of examples within a table, which is crucial for few-shot learning. For detailed information on the methodology and results, check out the original paper from Josh Gardner et al. [5].

Usage and limitations

The GitHub repository rtfm contains the code of TabuLa-8b. Here you will find in the Notebooks folder an example of how to make inference. Note that unlike TabPFN or CARTE, TabuLa-8b doesn’t have a Scikit-learn interface. If you want to make zero-shot predictions or further fine-tune the existing model, you need to run the Python scripts developed by the authors.

According to the original paper, TabuLa-8b performs well in zero-shot prediction tasks. However, using this model on large tables with either many samples or with a large number of features, and long column names, can be limiting, as this information can quickly exceed the LLM’s context window (the Llama 3-8B model has a context window of 8,000 tokens).

TabDPT

The last foundation model we’ll cover in this blog is the Tabular Discriminative Pre-trained Transformer, or TabDPT for short. Like TabPFN, TabDPT combines ICL with self-supervised learning to create a powerful foundation model for tabular data. TabDPT is trained on real-world data (the authors used 123 public tabular datasets from OpenML). According to the authors, the model can generalize to new tasks without additional training or hyperparameter tuning.

Model architecture

TabDPT uses a row-based transformer encoder similar to TabPFN, where each row serves as a token. To handle the different number of features of the training data (F), the authors standardized the feature dimension F_max via padding (F < F_max) or dimensionality reduction (F > F_max).

This foundation model leverages self-supervised learning, essentially learning by itself without needing a labeled target for every task. During training, it randomly picks one column in a table to be the target and then learns to predict its values based on the other columns. This process helps the model understand the relationships between different features. Now, when training on a large dataset, the model doesn’t use the entire table at once. Instead, it finds and uses only the most similar rows (called the “context”) to predict a single row (the “query”). This method makes the training process faster and more effective.

TabDPT’s architecture is shown in the following figure:

TabDPT architecture. Image taken from the original paper [6].

The figure illustrates how the training of this foundation model was carried out. First, the authors sampled B tables from different datasets to construct a set of features (X) and a set of targets (y). Both X and y are partitioned into context (X_ctx, y_ctx) and query (X_qy, y_qy). The query X_qyis input that is passed through the embedding functions (which are indicated by a rectangle or a triangle). The model also creates embeddings for X_ctx, and y_ctx. These context embeddings are summed together and concatenated with the embedding of X_qy. They are then passed through a transformer encoder to get a classification ̂y_cls or regression ̂y_reg for the query. The loss between the prediction and the true targets is used to update the model weights.

Usage and limitations

There is a GitHub repository that provides code to generate predictions on new tabular datasets. Like TabPFN or CARTE, TabDPT uses an API similar to Scikit-learn to make predictions on unseen data, where the fit function uses the training data to leverage ICL. The code of this model is currently under active development.

While the paper doesn’t have a dedicated limitations section, the authors mention a few constraints and how they are handled:

The model has a predefined maximum number of features and classes. The authors suggest using Principal Component Analysis (PCA) to reduce the number of features if a table exceeds the limit.
For classification tasks with more classes than the model’s limit, the problem can be broken down into multiple sub-tasks by representing the class number in a different base.
The retrieval process can add some latency during inference, although the authors note that this can be minimized with modern libraries.

Take-home messages

In this blog, I have summarized foundation models for tabular data. Most of them were released in 2024, but all are under active development. Despite being quite new, some of these models already have good documentation and ease of usage. For instance, you can install TabPFN, CARTE, or TabDPT through pip. Additionally, these models share the same API call as Scikit-learn, which makes them easy to integrate into existing Machine Learning applications.

According to the authors of the foundation models presented here, these algorithms outperform classical boosting methods such as XGBoost or CatBoost. However, foundation models still cannot be used on large tabular datasets, which limits their use, especially in production environments. This means that the classical approach of training a Machine Learning model per dataset is still the way to go in creating predictive models from tabular data.

Great strides have been made toward a foundation model for tabular data. Let’s see what the future holds for this exciting area of research!

Thank you for reading!

I’m Carmen Martínez Barbosa, a data scientist who loves to share new algorithms useful for the community. Read my content on Medium or TDS.

References

[1] N. Hollman et al., TabPFN: A transformer that solves small tabular classification problems in a second (2023), table representation learning workshop.

[2] N. Hollman et al., Accurate predictions on small data with a tabular foundation model (2025), Nature.

[3] M.J. Kim, L Grinsztajn, and G. Varoquaux. CARTE: Pretaining and Transfer for Tabular Learning (2024), Proceedings of the 41st International conference on Machine Learning, Vienna, Austria.

[4] F. Mahdisoltani, J. Biega, and F.M. Suchanek. Yago3: A knowledge base from multilingual wikipedias (2013), in CIDR.

[5] J. Gardner, J.C. Perdomo, L. Schmidt. Large Scale Transfer Learning for Tabular Data via Language Modeling (2025), NeurlPS.

[6] M. Junwei et al. TabDPT: Scaling Tabular Foundation Models on Real Data (2024), arXiv preprint, arXiv:2410.18164.

Source link

nimda 10 hours ago

0 0 11 minutes read