Machine Learning

How to Generate Synthetic Data: A Comprehensive Guide Using Bayesian Sampling and Univariate Distributions

that drives organizations nowadays. But what happens when observations are scarce, costly, or hard to collect? That’s where synthetic data comes into play because we can generate artificial data that mimics the statistical properties of real-world observations. In this blog, I will provide a background in synthetic data, together with practical hands-on examples. I will discuss two powerful techniques on how to generate synthetic data: Bayesian Sampling and Univariate Distribution Sampling. In addition, I will show how to generate data from only the expert’s knowledge. All practical examples are created with the help of the bnlearn and the distfit library. By the end of this blog, you will understand how Probability Density functions and Bayesian techniques can be leveraged to generate high-quality synthetic data.



An Introduction To Synthetic Data

In the last decade, the amount of data has grown rapidly and led to the insight that higher quality data is more important than quantity. Higher data quality helps to draw more accurate conclusions and allows to make better-informed decisions. In many domains, such as healthcare, finance, cybersecurity, and autonomous systems, real-world data can be sensitive, expensive, imbalanced, or difficult to collect, particularly for rare or edge-case scenarios. This is where Synthetic Data becomes a powerful alternative. However, in the last few years, we have also seen a huge trend of synthetic data generation for artificially generated images, texts, and audio. Whatever the goal is, synthetic data is becoming more important, which is also stressed by various companies like Gartner [1], which predicts that real data will be overshadowed very soon. There are, roughly speaking, two main categories of creating synthetic data (Figure 1), Probabilistic and Generative.

  • Probabilistic (distribution-based). Here we estimate statistical distributions from real measurements (or define them theoretically), and then we can sample new synthetic observations from these distributions. Examples include fitting univariate distributions or constructing Bayesian networks for multivariate data.
  • Generative or simulation-based: Learned models are used, such as neural networks, agent-based systems, or rule-based engines, to produce synthetic data without relying strictly on predefined probability distributions. This includes approaches like GANs for image data, discrete-event simulation for process modeling, and large language models (LLMs) for generating realistic synthetic text or structured records based on prompt-driven patterns.
Figure 1. Schematic overview of synthetic data generation approaches. Image by author.

In this blog, I will focus on Probabilistic methods (Figure 1, blue/left part), where the goal is to estimate the underlying distribution so that we can mirror either an existing dataset or generate data from an expert’s knowledge. I will make a deep dive into univariate distribution fitting and Bayesian sampling, where I will discuss the following four concepts of synthetic data generation:

  1. Synthetic Data That Mimics Existing Continuous Measurements (expected with independent variables).
    We start with an existing dataset where the variables have continuous values. The goal is to fit a model per variable that can be used to generate measurements that mirror the original properties. The measurements are assumed to be independent of each other.
  2. Synthetic Data That Mimics Expert Knowledge. (expected to be continuous and Independent variables). We start without a dataset but only with expert knowledge. We will determine the best Probability Density Functions (PDFs) with their parameters that mimic the expert domain knowledge. The designed model can then be used to generate new measurements.
  3. Synthetic Data That Mimics an Existing Categorical Dataset (expected with dependent variables).
    We start with an existing categorical dataset. We will learn the structure and parameters from the data and the feature interdependence. The fitted model can be used to generate measurements that mirror the properties of the original dataset.
  4. Synthetic Data That Mimics Expert Knowledge (expected to be categorical and with dependent variables).
    We start without a dataset but only with expert knowledge. The difference with approach 2 is that this model captures experts’ knowledge to encode dependencies between multiple variables using a directed graph. The fitted model can be used to generate a synthetic dataset solely based on the knowledge of the expert.

In the next section, I will explain the four approaches in more detail, including hands-on examples. But before we go into the details, I will first provide a background about probability density functions and Bayesian Sampling.


What You Need To Know About Probability Density Functions

Before we dive into the creation of synthetic data using probability distributions (approaches 1 and 2), I will start with a brief introduction to probability density functions (PDFs). First of all, there are many probability distributions as depicted in Figure 2. Important about these PDFs is that we understand their characteristics, as it will help to get more intuition about how they can mimic real-world observations. The basics are as follows: a PDF describes the likelihood of a continuous variable taking on a specific value, and different distributions have characteristic shapes: bell curves, exponential decays, uniform spreads, etc. These shapes, shown in Figure 2, need to match real-world behavior (e.g., response times, income levels, or temperature readings) with candidate distributions.

Figure 2. Overview of Probability Density Functions and their parameters. Created by: Rasmus Bááth (2012) (MIT license).

The better a PDF matches the distribution of the real variables, the better our synthetic data will be. However, the challenge with real-world variables is that these often exhibit skewness, multimodality, heavy tails, etc, and thus do not always align neatly with well-known distributions. But selecting the wrong distribution can lead to misleading simulations and unreliable results.

Luckily, various packages can help us find the best PDF for the variables, such as distfit [2]. This library is incredibly useful because it automates the process of scanning through a wide range of theoretical distributions, fitting them to the variables in our dataset, and ranking them based on goodness-of-fit metrics such as the Kolmogorov-Smirnov statistic or log-likelihood. This approach will find the best-fitting theoretical distribution without relying on intuition or trial-and-error. In the use case, I will demonstrate its working, but first, a brief introduction to Bayesian sampling.


What You Need To Know About Bayesian Sampling

Before we dive into the creation of synthetic data using Bayesian Sampling (approaches 3 and 4), I will explain the concepts of sampling from multinomial distributions. At its core, Bayesian Sampling refers to generating data points from a probabilistic model defined by a Directed Acyclic Graph (DAG) and its associated Conditional Probability Distributions (CPDs). The structure of the DAG encodes the dependencies between variables, while the CPDs define the exact probability of each variable conditioned on its parents. When combined, they form a joint probability distribution over all variables in the network. The two best-known Bayesian sampling techniques are Forward Sampling and Gibbs Sampling and are both available in the bnlearn for Python package [4].

Bayesian Forward Sampling is an intuitive technique that samples values by traversing the graph in topological order, starting with root nodes that have no parents. Each variable is then sampled based on its Conditional Probability Distribution (CPD) and the previously sampled values of its parent nodes. This method is ideal when you want to simulate new data that follows the generative assumptions of your Bayesian Network. In bnlearn this is the default method. It is particularly powerful for creating synthetic datasets from expert-defined DAGs, where we explicitly encode our domain knowledge without requiring observational data.

Alternatively, when some values are missing or when exact inference is computationally expensive, Gibbs Sampling can be used. This is a Markov Chain Monte Carlo (MCMC) method that iteratively samples from the conditional distribution of each variable given the current values of all others. This produces samples from the joint distribution, even without needing to compute it explicitly. While Forward Sampling is better suited for full synthetic data generation, Gibbs Sampling excels in scenarios involving partial observations, imputation, or approximate inference. This method can be set in bnlearn as follows: bn.sampling(DAG, methodtype="gibbs").

Let’s go to the next section, where we will experiment with probability distribution parameters to see how they affect the shape and behavior of synthetic data. We will use distfit to find the best PDF that matches real-world variables and evaluate how well they replicate the original data structure.


The Predictive Maintenance Dataset

The hands-on examples are based on the predictive maintenance dataset [3] (CC BY 4.0 licence), which contains 10,000 sensor data points from machinery over time. The dataset is a so-called mixed-type dataset containing a combination of continuous, categorical, and binary variables. It captures operational data from machines, including both sensor readings and failure events. For instance, it includes physical measurements like rotational speed, torque, and tool wear (all continuous variables reflecting how the machine is behaving over time). Alongside these, we have categorical information such as the machine type and environmental data like air temperature. The dataset also depicts whether specific types of failures occurred, such as tool wear failure or heat dissipation failure (these are represented as binary variables).

Photo by Krzysztof Kowalik on Unsplash
Table 1: The table provides an overview of the variables in the predictive maintenance dataset. There are different types of variables, identifiers, sensor readings, and target variables (failure indicators). Each variable is characterized by its role, data type, and a brief description.

Generate Continuous Synthetic Data

In the following two sections, we will generate synthetic data where the variables have continuous values and under the assumption that the variables are independent of each other. The two flavors of generating synthetic data with this approach are (1) by starting with an existing dataset, and (2) by translating expert domain knowledge into a structured, synthetic dataset. Moreover, if we need multiple continuous variables, we need to treat each variable separately or independently (1), then we can identify the best probability distribution per variable (2), and finally, we can generate synthetic values (3). This approach is particularly useful when we need to simulate realistic inputs for testing, modeling, or when working with small datasets.

1. Generate Continuous Synthetic Data that Closely Mirrors the Distribution of Real Data

The aim in this section is to generate synthetic data that closely mirrors the distribution of real data. The predictive maintenance dataset contains five continuous variables, among them the Torque measurements for which the description is as follows:

Torque should normally be within expected operation range: low torque is less critical, but excessively high torque suggests mechanical strain or stress.

In the code block below, we will import the distfit library [2], load the dataset, and visually inspect the Torque measurements to get an intuition of the range and possible outliers.

# Install library
pip install distfit
# Import library
from distfit import distfit

# Initialize distfit
dfit = distfit()

# Import dataset
df = dfit.import_example(data='predictive_maintenance')

# print dataframe
print(df)
+-------+------------+------+------------------+----+-----+-----+-----+-----+
|  UDI | Product ID  | Type | Air temperature  | .. | HDF | PWF | OSF | RNF |
+-------+------------+------+------------------+----+-----+-----+-----+-----+
|    1 | M14860      |   M  | 298.1            | .. |   0 |   0 |   0 |   0 |
|    2 | L47181      |   L  | 298.2            | .. |   0 |   0 |   0 |   0 |
|    3 | L47182      |   L  | 298.1            | .. |   0 |   0 |   0 |   0 |
|    4 | L47183      |   L  | 298.2            | .. |   0 |   0 |   0 |   0 |
|    5 | L47184      |   L  | 298.2            | .. |   0 |   0 |   0 |   0 |
| ...  | ...         | ...  | ...              | .. | ... | ... | ... | ... |
| 9996 | M24855      |   M  | 298.8            | .. |   0 |   0 |   0 |   0 |
| 9997 | H39410      |   H  | 298.9            | .. |   0 |   0 |   0 |   0 |
| 9998 | M24857      |   M  | 299.0            | .. |   0 |   0 |   0 |   0 |
| 9999 | H39412      |   H  | 299.0            | .. |   0 |   0 |   0 |   0 |
|10000 | M24859      |   M  | 299.0            | .. |   0 |   0 |   0 |   0 |
+-------+-------------+------+------------------+----+-----+-----+-----+-----+
[10000 rows x 14 columns]

# Make plot
dfit.lineplot(df['Torque [Nm]'], xlabel='Time', ylabel='Torque [Nm]', title='Torque Measurements')

We can see from Figure 3 that the range across the 10.000 datapoints is mainly between 20 and 50 Nm. The values that are excessively above this range can thus be critical. This information, together with the line plot, helps to build an intuition of the expected distribution.

Figure 3. Lineplot shows the torque measurements across 10.000 measurements.

With the use of distfit, we can now search over 90 univariate distributions to determine the best fit for the Torque measurements. However, testing for each distribution can take some time, especially when we use the bootstrap parameter to more accurately validate the fit for each distribution. In the code block below, you can set the n_boots=100 parameter lower to speed up the computations. Therefore, it is also possible to test only across the most popular PDFs (with the distr parameter). See the code block below to determine the best PDF with its parameters for the Torque measurements.

# Import library
from distfit import distfit
import matplotlib.pyplot as plt

# Initialize distfit and set the bootstraps to validate the fit.
dfit = distfit(distr='popular', n_boots=100)

# Fit model
dfit.fit_transform(df['Torque [Nm]'])

# Plot PDF/CDF
fig, ax = plt.subplots(1,2, figsize=(25, 10))
dfit.plot(chart='PDF', n_top=10, ax=ax[0])
dfit.plot(chart='CDF', n_top=10, ax=ax[1])
plt.show()

# Create line plot
dfit.lineplot(df['Torque [Nm]'], xlabel='Time', ylabel='Torque [Nm]', title='Torque Measurements', projection=True)

# Print fitted parameters
print(dfit.model)
{'name': 'loggamma',
 'score': 0.00010374408112953594,
 'loc': -1900.0760925689528,
 'scale': 288.3648181697778,
 'arg': (835.7558898693087,),
 'params': (835.7558898693087, -1900.0760925689528, 288.3648181697778),
 'model': ,
 'bootstrap_score': 0.12,
 'bootstrap_pass': True,
 'color': '#e41a1c',
 'CII_min_alpha': 23.457570647289003,
 'CII_max_alpha': 56.28002364712847}
Figure 4. Left: PDF, and right: CDF. The top 5 fitted theoretical distributions are shown in different colors. The best fit is Loggamma and is colored in red. (image by the author)

After running the code block, we can see the detection of the Loggamma distribution as the best fit (Figure 4, red solid line). The upper bound confidence interval (CII)alpha=0.05 is 56.28, which seems a reasonable threshold based on a visual inspection (red vertical dashed line). Note that the use of CII is not needed for the generation of synthetic data. A full projection of the estimated PDF can be seen in Figure 5.

Figure 5. Lineplot shows the torque measurements across 10.000 measurements. The empirical PDF is estimated based on the current data and plotted on the right side. The estimated theoretical PDF is also plotted on the right side. (image by the author)

With the estimated Loggamma distribution and the fine-tuned population parameters (c=835.7, loc=-1900.07, scale=288.36), we can now generate synthetic data for Torque. The .generate() function automatically uses the model parameters, and we only need to specify the number of samples that we want to generate. For example, we can generate 200 samples and plot the data points (Figure 6, code block below).

# Create synthetic data
X = dfit.generate(200)

# Plot the Synthetic data (X)
dfit.lineplot(X, xlabel='Time', ylabel='Generated Torque [Nm]', title='Synthetic Data')
Figure 6. Synthetic data is generated for n=200 Torque measurements. We can see a range between the values [20, 60] with some outliers. The red horizontal line is the previously estimated confidence interval for alpha=0.05 (image by the author).

At this point, we have estimated the PDF that mirrors the measurements of the variable Torque. With the estimated parameters of the PDF, we can sample from the fitted distribution and generate synthetic data. Note that the predictive maintenance dataset contains four more continuous measurements, and if we need to mimic those as well, we must repeat this entire procedure for each variable separately. This model for generating synthetic data provides many opportunities. For instance, it allows testing machine learning pipelines under rare or critical operating conditions that may not be present in the original dataset, thereby improving performance evaluation. Or in case your dataset is small, it allows you to generate more datapoints.


2. Generate Continuous Synthetic Data Using Expert Knowledge

In this section, we will generate synthetic data that closely mirrors expert knowledge. Or in other words, we do not have any data at the start, only experts’ knowledge. However, we do aim to create a synthetic dataset. To demonstrate this approach, I will use a hypothetical use case: Suppose that experts physically operate the machinery, and we need to understand the intensity of activities to also include it in the model to determine failures. An expert provided us with the following information about the operational activities:

Most people start to work at 8 but the intensity of machinery operations peak around 10. Some machinery operations will also be seen before 8, but not a lot. In the afternoon, the machinery operations gradually decrease and stop around 6 pm. There is usually also a small peak of intense machinery operations arround 1–2 pm.

Step 1: Translate domain knowledge into a statistical model.

With the description, we now need to decide the best-matching theoretical distribution. However, choosing the best theoretical distribution requires investigating the properties of many distributions (see Figure 1). In addition, you may need more than one distribution; namely, a mixture of probability density functions. In our example, we will create a mixture of two distributions, one PDF for the morning and one PDF for the afternoon activities.

Model for the morning: Most people start to work at 8 but the intensity of machinery operations peak around 10. Some machinery operations will also be seen before 8, but not a lot.

To model the morning machinery operations, we can use the Normal distribution. This distribution is symmetrical without heavy tails. A few normal PDFs with different mu and sigma parameters are shown in Figure 7A. Try to get a feeling for how the slope changes on the sigma parameter. For our machinery operations, we can set the parameters with a mean of 10 AM with a relatively narrow spread, such as sigma=1.

Model for the afternoon: The machinery operations gradually decrease and stop around 6 pm. There is usually also a small peak of intense machinery operations arround 1–2 pm.

A suitable distribution for the afternoon machinery operations could be a skewed distribution with a heavy right tail that can capture the gradually decreasing activities. The Weibull distribution can be a candidate as it is used to model data that has a monotonic increasing or decreasing trend. However, if we do not always expect a monotonic decrease in network activity (because it is different on Tuesdays or so), it may be better to consider a distribution such as gamma (Figure 7B). To tune the parameters so that is matches the afternoon description, it is practical to use the generalized gamma distribution as it provides more control on the parameter tuning.

Figure 7. (A) Normal distribution with various parameters. Source: Wikipedia. (B) gamma distribution with various parameters. Source: Wikipedia.

At this point, we have chosen our two candidate distributions to model the machinery operations: Normal PDF for the morning and the Generalized Gamma PDF for the afternoon. In the next section, we will fine-tune the PDF parameters to create a mixture of PDFs that matches the machinery operations for the entire day.

Step 2: Parameter Fine-Tuning To Determine The Best Fit.

To create a model that closely resembles the machinery operations, we will generate data separately for the morning and the afternoon (see code block below). For the morning machinery operations, we decided to use the normal distribution with a mean of 10 (representing the peak at 10 am) and a standard deviation of 1. We will draw 8000 samples. For the afternoon machinery operations, we use the generalized gamma distribution. After playing around with the loc parameter, I decided to set the second peak at loc=13. We could also have used loc=14 but this creates a slightly larger gap between the morning and afternoon machinery operations. Furthermore, the peak in the afternoon was described to be smaller, and therefore, we will generate 2000 samples.

The next step is to combine the two synthetic measurements and create a mixture of PDFs that matches the machinery operations for the entire day. Note that shuffling the samples is important because, without it, samples are ordered first by the 8000 samples from the normal distribution and then by the 2000 samples from the generalized gamma distribution. This order could introduce bias in any analysis or modeling that is performed on the dataset when splitting the dataset. We can now plot the distribution and see what it looks like (Figure 8). Usually, it takes a few iterations to fine-tune the parameters. 

import numpy as np
from scipy.stats import norm, gengamma

# Set seed for reproducibility
np.random.seed(1)

# Generate data from a normal distribution
normal_samples = norm.rvs(10, 1, 8000)
# Create a generalized gamma distribution with the specified parameters
dist = gengamma(a=1.4, c=1, scale=0.8, loc=13)
# Generate data from the gamma distribution
gamma_samples = dist.rvs(size=2000)

# Combine the two datasets by concatenation
X = np.concatenate((normal_samples, gamma_samples))

# Shuffle the dataset
np.random.shuffle(X)

# Plot
bar_properties={'color': '#607B8B', 'linewidth': 1, 'edgecolor': '#5A5A5A'}
plt.figure(figsize=(20, 15)); plt.hist(X, bins=100, **bar_properties)
plt.grid(True)
plt.xlabel('Time', fontsize=22)
plt.ylabel('Intensity of Machinery Operations', fontsize=22)
Figure 8. A mixture of probability density functions: the Normal and generalized Gamma distribution. Image by Author.

We were able to convert the expert’s knowledge into a mixture of PDFs and created synthetic data that allows us to model the normal/expected behavior of machinery operations (Figure 8). The histogram clearly shows a major peak at 10 am with machinery operations starting from 6 am up to 1 pm, and a second peak around 1–2 pm with a heavy right tail towards 8 pm.


Generate Categorical Synthetic Data

In the following two sections, we will generate synthetic data where the variables are categorical and assumed to be dependent on each other. Here again, we can follow the same two approaches: starting from an existing dataset to learn the distribution and their dependence, and defining a DAG based on expert domain knowledge and then generating synthetic data.

1. Generate Categorical Synthetic Data That Mimics an Existing dataset.

The aim in this section is to generate synthetic data that closely mirrors the distribution of real categorical and a dependent dataset. The difference with section 1 is that we now aim to mimic an existing categorical dataset and take into consideration its (inter)dependence between the features. The dataset we will use is again the predictive maintenance dataset [3]. In the code block below, we will import the bnlearn library, load the dataset.

# Install bnlearn library
pip install bnlearn
# Import library
import bnlearn as bn

# Load dataset
df = bn.import_example('predictive_maintenance')

# print dataframe
+-------+------------+------+------------------+----+-----+-----+-----+-----+
|  UDI | Product ID  | Type | Air temperature  | .. | HDF | PWF | OSF | RNF |
+-------+------------+------+------------------+----+-----+-----+-----+-----+
|    1 | M14860      |   M  | 298.1            | .. |   0 |   0 |   0 |   0 |
|    2 | L47181      |   L  | 298.2            | .. |   0 |   0 |   0 |   0 |
|    3 | L47182      |   L  | 298.1            | .. |   0 |   0 |   0 |   0 |
|    4 | L47183      |   L  | 298.2            | .. |   0 |   0 |   0 |   0 |
|    5 | L47184      |   L  | 298.2            | .. |   0 |   0 |   0 |   0 |
| ...  | ...         | ...  | ...              | .. | ... | ... | ... | ... |
| 9996 | M24855      |   M  | 298.8            | .. |   0 |   0 |   0 |   0 |
| 9997 | H39410      |   H  | 298.9            | .. |   0 |   0 |   0 |   0 |
| 9998 | M24857      |   M  | 299.0            | .. |   0 |   0 |   0 |   0 |
| 9999 | H39412      |   H  | 299.0            | .. |   0 |   0 |   0 |   0 |
|10000 | M24859      |   M  | 299.0            | .. |   0 |   0 |   0 |   0 |
+-------+-------------+------+------------------+----+-----+-----+-----+-----+
[10000 rows x 14 columns]

Before we can learn the causal structure and the parameters of the entire system using Bayesian methods, we need to clean the dataset first. In our first step, we take only the 8 relevant categorical variables; [Type, Machine failure, TWF, HDF, PWF, OSF, RNF]. Other variables, such as unique identifiers (UID and Product ID) holds no meaningful information for modeling. In addition, modeling mixed datasets (categorical and continuous) at the same time is not supported.

# Load dataset
df = bn.import_example('predictive_maintenance')

# Get discrete columns
cols = ['Type', 'Machine failure', 'TWF', 'HDF', 'PWF', 'OSF', 'RNF']
df = df[cols]

# Structure learning
model = bn.structure_learning.fit(df, methodtype='hc', scoretype='bic')
# [bnlearn] >Computing best DAG using [hc]
# [bnlearn] >Set scoring type at [bds]
# [bnlearn] >Compute structure scores for model comparison (higher is better).

# Compute edge weights using ChiSquare independence test.
model = bn.independence_test(model, df, test='chi_square', prune=True)

# Plot the best DAG
bn.plot(model, edge_labels='pvalue', params_static={'maxscale': 4, 'figsize': (15, 15), 'font_size': 14, 'arrowsize': 10})

dotgraph = bn.plot_graphviz(model, edge_labels='pvalue')
dotgraph

# Store to pdf
dotgraph.view(filename='bnlearn_predictive_maintanance')

In the code block above, we determined the causal relationships. The Bayesian model learned the causal relationships based on the data using a search strategy and scoring function. A scoring function quantifies how well a specific DAG explains the observed data, and the search strategy is to efficiently walk through the entire search space of DAGs to eventually find the most optimal DAG without testing them all. We will use HillClimbSearch as a search strategy and the Bayesian Information Criterion (BIC) as a scoring function for this use case. The causal DAG is shown in Figure 9 where the detected root variable is PWF (Power Failure), and the target variable is Machine failure. We can see from the figure that the failure modes (TWF, HDF, PWF, OSF, RNF) have a complex dependency on the Machine failure. As expected. The RNF variable (the random variable) is not included as a node, and the Type is not a cause for Machine failure. The structure learning process detected these relationships quite well.

Figure 9. DAG based on Hillclimbsearch and BIC scoring function. All the continuous values are removed. The edges are the -log10(P-values) that are determined using the chi-square test. The image is created using Bnlearn. Image by the author.

Given the dataset and the DAG, we can estimate the (conditional) probability distributions of the individual variables using parameter learning. The bnlearn library supports Parameter learning for discrete and continuous nodes:

# Parameter learning
model = bn.parameter_learning.fit(model, df, methodtype='bayes')

# [bnlearn] >Parameter learning> Computing parameters using [bayes]
# [bnlearn] >Converting [] to BayesianNetwork model.
# [bnlearn] >Converting adjmat to BayesianNetwork.

# [bnlearn] >CPD of TWF:
+--------+-----------+
| TWF(0) | 0.950364  |
+--------+-----------+
| TWF(1) | 0.0496364 |
+--------+-----------+

# [bnlearn] >CPD of Machine failure:
+--------------------+-----+--------+--------+--------+
| HDF                | ... | HDF(1) | HDF(1) | HDF(1) |
+--------------------+-----+--------+--------+--------+
| OSF                | ... | OSF(1) | OSF(1) | OSF(1) |
+--------------------+-----+--------+--------+--------+
| PWF                | ... | PWF(0) | PWF(1) | PWF(1) |
+--------------------+-----+--------+--------+--------+
| TWF                | ... | TWF(1) | TWF(0) | TWF(1) |
+--------------------+-----+--------+--------+--------+
| Machine failure(0) | ... | 0.5    | 0.5    | 0.5    |
+--------------------+-----+--------+--------+--------+
| Machine failure(1) | ... | 0.5    | 0.5    | 0.5    |
+--------------------+-----+--------+--------+--------+

# [bnlearn] >CPD of HDF:
+--------+---------------------+--------------------+
| OSF    | OSF(0)              | OSF(1)             |
+--------+---------------------+--------------------+
| HDF(0) | 0.9654874062680254  | 0.5719063545150501 |
+--------+---------------------+--------------------+
| HDF(1) | 0.03451259373197462 | 0.4280936454849498 |
+--------+---------------------+--------------------+

# [bnlearn] >CPD of PWF:
+--------+-----------+
| PWF(0) | 0.945909  |
+--------+-----------+
| PWF(1) | 0.0540909 |
+--------+-----------+

# [bnlearn] >CPD of OSF:
+--------+---------------------+--------------------+
| PWF    | PWF(0)              | PWF(1)             |
+--------+---------------------+--------------------+
| OSF(0) | 0.9677078327727054  | 0.5596638655462185 |
+--------+---------------------+--------------------+
| OSF(1) | 0.03229216722729457 | 0.4403361344537815 |
+--------+---------------------+--------------------+

# [bnlearn] >CPD of Type:
+---------+---------------------+---------------------+
| OSF     | OSF(0)              | OSF(1)              |
+---------+---------------------+---------------------+
| Type(H) | 0.11225405370762033 | 0.28205128205128205 |
+---------+---------------------+---------------------+
| Type(L) | 0.5844709350765879  | 0.42419175027870676 |
+---------+---------------------+---------------------+
| Type(M) | 0.3032750112157918  | 0.29375696767001114 |
+---------+---------------------+---------------------+

Generate Synthetic Data.

At this point, we have our learned structure in the form of a DAG, and the estimated parameters in the form of CPTs. This means that we captured the system in a probabilistic graphical model, which can now be used to generate synthetic data. We can now use the bn.sampling() function (see the code block below) and generate, for example, 100 samples. The output is a full dataset with all dependent variables.

# Generate synthetic data
X = bn.sampling(model, n=100, methodtype='bayes')

print(X)
+-----+------------------+-----+-----+-----+------+
| TWF | Machine failure | HDF | PWF | OSF | Type  |
+-----+------------------+-----+-----+-----+------+
|  0  |        1         |  1  |  1  |  1  |  L   |
|  0  |        0         |  0  |  0  |  0  |  L   |
|  0  |        0         |  0  |  0  |  0  |  L   |
|  0  |        0         |  0  |  0  |  0  |  M   |
|  0  |        0         |  0  |  0  |  0  |  M   |
|  .. |        ..        |  .. |  .. |  .. |  ..  |
|  0  |        0         |  0  |  0  |  0  |  M   |
|  0  |        1         |  1  |  0  |  0  |  L   |
|  0  |        0         |  0  |  0  |  0  |  M   |
|  0  |        0         |  0  |  0  |  0  |  L   |
+-----+------------------+-----+-----+-----+------+

2. Generate Categorical Synthetic Data That Mimics Expert Knowledge

The aim in this section is to generate synthetic data that closely mirrors the expert knowledge. Or in other words, there is no dataset at the start, only knowledge about the working of a system. The difference with section 2 is that we now aim to generate an entire categorical dataset with multiple variables that are dependent on each other. The final Bayesian model can then be used to generate data and should mimic the knowledge of the expert.

Before we dive into building knowledge-based systems, the steps we need to take are similar to those of the previous section. The difference is that we need to manually define and draw the causal structure (DAG) and define the parameters (CPTs). Alternatively, if a data set is available, we can use it to learn the parameters. So there are multiple possibilities to generate data based on experts’ knowledge. For an in-depth overview, I recommend reading this blog.

For this use case, we will start without a dataset and define the DAG and CPTs ourselves. I will again use predictive maintenance as the use case. Suppose that experts need to understand how Machine failures occur, but there are no physical sensors that measure data. An expert can provide us with the following information about the operational activities:

Machine failures are mainly seen when the process temperature is high or the torque is high. A high torque or tool wear causes overstrain failures (OSF). The proces temperature is influenced by the air temperature.

Define simple one-to-one relationships.

From this point on, we need to convert the expert’s knowledge into a Bayesian model. This can be done systematically by first creating the graph and then defining the CPTs that connect the nodes in the graph.

A complex system is built by combining simpler parts. This means that we don’t need to create or design the whole system at once, but we can define the simpler parts first. These are the one-to-one relationships. In this step, we will convert the expert’s view into relationships. We know from the expert that we can make the following directed one-to-one relationships:

  • Process TemperatureMachine Failure
  • TorqueMachine Failure
  • TorqueOverstrain Failure (OSF)
  • Tool WearOverstrain Failure (OSF)
  • Air TemperatureProcess Temperature
  • Overstrain Failure (OSF)Machine Failure

A DAG is based on one-to-one relationships.

The directed relationships can now be used to build a graph with nodes and edges. Each node corresponds to a variable, and each edge represents a conditional dependency between pairs of variables. In bnlearn, we can assign and graphically represent the relationships between variables.

import bnlearn as bn

# Define the causal dependencies based on your expert/domain knowledge.
# Left is the source, and right is the target node.
edges = [('Process Temperature', 'Machine Failure'),
         ('Torque', 'Machine Failure'),
         ('Torque', 'Overstrain Failure (OSF)'),
         ('Tool Wear', 'Overstrain Failure (OSF)'),
         ('Air Temperature', 'Process Temperature'),
         ('Overstrain Failure (OSF)', 'Machine Failure'),
         ]

# Create the DAG
DAG = bn.make_DAG(edges)

# DAG is stored in an adjacency matrix
DAG["adjmat"]

# Plot the DAG (static)
bn.plot(DAG)

# Plot the DAG
dotgraph = bn.plot_graphviz(DAG, edge_labels='pvalue')
dotgraph.view(filename='bnlearn_predictive_maintanance_expert.pdf')

The resulting DAG is shown in Figure 10. We call this a causal DAG because we have assumed that the edges we encoded represent our causal assumptions about the predictive maintenance system.

Figure 10: DAG for the predictive maintenance system. (image by author).

At this point, the DAG does not know the underlying dependencies. Or in other words, there are no differences in the strength of the relationships between the one-to-one parts, but these need to be defined using the CPTs. We can check the CPTs with bn.print(DAG) which will result in the message that no CPD can be printed. We need to add knowledge to the DAG with so-called Conditional Probabilistic Tables (CPTs) and we will rely on the expert’s knowledge to fill the CPTs.

Setting up the Conditional Probabilistic Tables.

The predictive maintenance system is a simple Bayesian network where the child nodes are influenced by the parent nodes. We now need to associate each node with a probability function that takes, as input, a particular set of values for the node’s parent variables and gives (as output) the probability of the variable represented by the node. Let’s do this for the six nodes.

CPT: Air Temperature

The Air Temperature node has two states: low and high, and no parent dependencies. This means we can directly define the prior distribution based on expert assumptions or historical distributions. Suppose that 70% of the time, machines operate under low air temperature and 30% under high. The CPT is as follows:

cpt_air_temp = TabularCPD(variable='Air Temperature', variable_card=2,
                          values=[[0.7],    # P(Air Temperature = Low)
                                  [0.3]])   # P(Air Temperature = High)

CPT: Tool Wear

Tool Wear represents whether the tool is still in a low wear or high wear state. It also has no parent dependencies, so its distribution is directly specified. Based on domain knowledge, let’s assume 80% of the time, the tools are in low wear, and 20% of the time in high wear:

cpt_toolwear = TabularCPD(variable='Tool Wear', variable_card=2,
                          values=[[0.8],    # P(Tool Wear = Low)
                                  [0.2]])   # P(Tool Wear = High)

CPT: Torque

Torque is a root node as well, with no dependencies. It reflects the rotational force in the process. Let’s assume high torque is relatively rare, occurring only 10% of the time, with 90% of processes running at normal torque:

cpt_torque = TabularCPD(variable='Torque', variable_card=2,
                        values=[[0.9],     # P(Torque = Normal)
                                [0.1]])    # P(Torque = High)

CPT: Process Temperature

Process Temperature depends on Air Temperature. Higher air temperatures generally lead to higher process temperatures, although there’s some variability. The probabilities reflect the following assumptions:

  • If Air Temp is low → 70% chance of low Process Temp, 30% high
  • If Air Temp is high → 20% low, 80% high
cpt_process_temp = TabularCPD(variable='Process Temperature', variable_card=2,
                               values=[[0.7, 0.2],     # P(ProcTemp = Low | AirTemp = Low/High)
                                       [0.3, 0.8]],    # P(ProcTemp = High | AirTemp = Low/High)
                               evidence=['Air Temperature'],
                               evidence_card=[2])

CPT: Overstrain Failure (OSF)

Overstrain Failure (OSF) occurs when either Torque or Tool Wear are high. If both are high, the risk increases. The CPT is structured to reflect:

  • Low Torque & Low Tool Wear → 10% OSF
  • High Torque & High Tool Wear → 90% OSF
  • Mixed combinations → 30–50% OSF
cpt_osf = TabularCPD(variable='Overstrain Failure (OSF)', variable_card=2,
                     values=[[0.9, 0.5, 0.7, 0.1],  # OSF = No | Torque, Tool Wear
                             [0.1, 0.5, 0.3, 0.9]], # OSF = Yes | Torque, Tool Wear
                     evidence=['Torque', 'Tool Wear'],
                     evidence_card=[2, 2])

PT: Machine Failure

The Machine Failure node is the most complicated one because it has the most dependencies: Process Temperature, Torque, and Overstrain Failure (OSF). The risk of failure increases if Process Temp is high, Torque is high, and an OSF occurred. The CPT reflects the additive risk, assigning the highest failure probability when all three are problematic:

cpt_machine_fail = TabularCPD(variable='Machine Failure', variable_card=2,
                               values=[[0.9, 0.7, 0.6, 0.3, 0.8, 0.5, 0.4, 0.2],  # Failure = No
                                       [0.1, 0.3, 0.4, 0.7, 0.2, 0.5, 0.6, 0.8]], # Failure = Yes
                               evidence=['Process Temperature', 'Torque', 'Overstrain Failure (OSF)'],
                               evidence_card=[2, 2, 2])

Update the DAG with CPTs:

This is it! At this point, we defined the strength of the relationships in the DAG with the CPTs. Now we need to connect the DAG with the CPTs. As a sanity check, the CPTs can be examined using the bn.print_CPD() functionality.

# Update DAG with the CPTs
model = bn.make_DAG(DAG, CPD=[cpt_process_temp, cpt_machine_fail, cpt_torque, cpt_osf, cpt_toolwear, cpt_air_temp])

# Print the CPDs (Conditional Probability Distributions)
bn.print_CPD(model)

Generate Synthetic Data.

At this point, we have our manually defined DAG, and we have estimated the parameters for the CPTs. This means that we captured the system in a probabilistic graphical model, which can now be used to generate synthetic data. We can now use the bn.sampling() function (see the code block below) and generate for example 100 samples. The output is a full dataset with all dependent variables.

---

# Generate synthetic data
X = bn.sampling(model, n=100, methodtype='bayes')

print(X)
+---------------------+------------------+--------+----------------------------+----------+---------------------+
| Process Temperature | Machine Failure  | Torque | Overstrain Failure (OSF)   | ToolWear | Air Temperature     |
+---------------------+------------------+--------+----------------------------+----------+---------------------+
|         1           |        0         |   1    |             0              |    0     |         1           |
|         0           |        0         |   1    |             1              |    1     |         1           |
|         1           |        0         |   1    |             0              |    0     |         1           |
|         1           |        1         |   1    |             1              |    1     |         1           |
|         0           |        0         |   0    |             0              |    0     |         0           |
|        ...          |       ...        |  ...   |            ...             |   ...    |        ...          |
|         0           |        0         |   1    |             1              |    1     |         0           |
|         1           |        1         |   1    |             1              |    1     |         0           |
|         0           |        0         |   0    |             0              |    1     |         0           |
|         1           |        1         |   1    |             1              |    1     |         0           |
|         1           |        0         |   0    |             0              |    1     |         0           |
+---------------------+------------------+--------+----------------------------+----------+---------------------+

The bnlearn library

A few words about the bnlearn library that is used for the analyses. The bnlearn library is designed to tackle the following challenges:

  • Structure learning. Given the data, estimate a DAG that captures the dependencies between the variables.
  • Parameter learning. Given the data and DAG, estimate the (conditional) probability distributions of the individual variables.
  • Inference. Given the learned model, determine the exact probability values for your queries.
  • Sampling. Given the learned model, we can generate synthetic data.

What benefits does bnlearn offer over other Bayesian analysis implementations?


Wrapping up

Synthetic data enables modeling when real data is unavailable, sensitive, or incomplete. I demonstrated the use case in predictive maintenance but other fields of interest are, for example, in the privacy domain or rare event modeling in the cybersecurity domain.

I demonstrated how to create synthetic data using probabilistic models through Probability Density Functions (PDFs) and Bayesian Sampling. These two approaches differ fundamentally. PDFs are typically used to generate synthetic data from univariate continuous distributions, assuming that variables are independent of one another. In contrast, Bayesian Sampling is suited for categorical data, where we sample from multinomial (or categorical) distributions, and crucially, can model and preserve the dependencies between variables using a Bayesian Network. We can thus use univariate sampling for independent continuous features, and Bayesian sampling when modeling variable dependencies is critical.

While synthetic data offers many advantages, it also comes with important limitations. First, it may not fully capture the complexity and variability of real-world phenomena, which can result in models that fail to generalize when trained solely on synthetic samples. Additionally, synthetic data can inadvertently introduce biases due to incorrect assumptions, oversimplified models, or poorly estimated parameters. It is therefore essential to perform thorough sanity checks and validation to ensure that the generated data aligns with domain expectations and does not mislead downstream analysis. Always compare the distribution, dependency structure, and outcome patterns with real data or expert knowledge.

Be safe. Stay frosty.

Cheers, E.


Software

Let’s connect!

References

  1. Gartner, Maverick Research: Forget About Your Real Data — Synthetic Data Is the Future of AI, Leinar Ramos, Jitendra Subramanyam, 24 June 2021.
  2. E. Taskesen, distfit Python library, How to Find the Best Theoretical Distribution for Your Data.
  3. AI4I 2020 Predictive Maintenance Dataset. (2020). UCI Machine Learning Repository. Licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0).
  4. E.Taskesen, bnlearn for Pythyon library. An Extensive Starter Guide For Causal Discovery Using Bayesian Modeling.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button