ANI

Building Modern EDA Pipelines with Pingouin

0 0 5 minutes read

Building Modern EDA Pipelines with Pingouin

# Introduction

Anyone who has spent enough time doing data science is likely to learn something sooner or later: the golden rule for low-level machine learning modeling, known as garbage in, garbage out (GIGO).

For example, feeding a linear regression model with highly collinear data, or using ANOVA tests on heteroscedastic variance, is a great recipe… for ineffective models that will not learn well.

Analyzing experimental data (EDA) has a lot to say about concepts such as scatter plots and histograms, however it is not enough if we need a strong validation of the data against the statistical assumptions required for river analysis or models. The Penguin helps to do this by bridging the gap between two well-known libraries in data science and statistics: SciPy again the pandas. In addition, it can be a great help to build robust, automated EDA pipelines. This article teaches you how to build a complete, robust, statistical EDA pipeline that validates several key data properties.

# Initial Setup

Let's start by making sure we install Pingouin in our Python environment (and pandas, if you don't have it yet):

!pip install pingouin pandas

After that, it's time to import these important libraries and load our data. As an example open a data set, we will use one that contains samples of wine properties and their quality.

import pandas as pd
import pingouin as pg

# Loading the wine dataset from an open dataset GitHub repository
url = "
df = pd.read_csv(url)

# Displaying the first few rows to understand our features
df.head()

# Testing for Univariate Normality

The first test analysis we will do is about testing for consistent normality. Many traditional algorithms for training machine learning models – and statistical tests like ANOVAs and t-tests, for that matter – require the assumption that continuous variables follow a normal, Gaussian distribution. Pingouin's pg.normality() function helps to perform this test with the Shapiro-Wilk test for the entire data frame:

# Selecting a subset of continuous features for normality checks
features = ['fixed acidity', 'volatile acidity', 'citric acid', 'pH', 'alcohol']

# Running the normality test
normality_results = pg.normality(df[features])
print(normality_results)

Output:

                         W          pval  normal
fixed acidity     0.879789  2.437973e-57   False
volatile acidity  0.875867  6.255995e-58   False
citric acid       0.964977  5.262332e-37   False
pH                0.991448  2.204049e-19   False
alcohol           0.953532  2.918847e-41   False

It seems that there are no numerical properties available that satisfy the generality. This is by no means a bad thing about the data; it is simply part of its characteristics. We recently received a message that, in the later stages of data processing beyond our EDA, we might like to consider using data transformations such as log-transform or Box-Cox that make the raw data look “normal” and thus more suitable for models that assume normality.

# Tests Multivariate Normality

Similarly, testing normality that does not include a factor, but calculates interactions between factors, is another interesting thing to explore. Let's see how we can test multivariate normality: a key requirement in techniques like multivariate ANOVA (MANOVA), for example.

# Henze-Zirkler multivariate normality test
multivariate_normality_results = pg.multivariate_normality(df[features])
print(multivariate_normality_results)

Output:

HZResults(hz=np.float64(23.72107048442373), pval=np.float64(0.0), normal=False)

And guess what: you can get something like this HZResults(hz=np.float64(23.72107048442373), pval=np.float64(0.0), normal=False)which means that the general multivariate does not hold either. If you are going to train a machine learning model on this dataset, this means non-parametric, tree-based models like gradient boosting and random forests may be a more powerful alternative to parametric, weight-based models like SVM, linear regression, and so on.

# Testing for Homoscedasticity

Next comes a deceptively simple word: homosexuality. This refers to the equal or constant variance of all forecast errors, and is defined as a measure of reliability. We'll test this build (sorry, it's too hard to write its name again!) using Levene's test Pingouin, like this:

# Levene's test for equal variances across groups
# 'dv' is the target, dependent variable, 'group' is the categorical variable
homoscedasticity_results = pg.homoscedasticity(data=df, dv='alcohol', group='quality')
print(homoscedasticity_results)

Result:

                W          pval  equal_var
levene  66.338684  2.317649e-80      False

We have since found out False Also, we have a problem called heteroscedasticity, which must be accounted for in the downstream analysis. Another possible approach would be to use robust errors when training regression models.

# Checks for Sphericity

Another statistical feature to be analyzed is sphericity, which indicates whether the variance of the difference between the possible combinations of pairs of both conditions is equal. Examining this structure is often desirable before performing principal component analysis (PCA) for dimensionality reduction, as it helps us understand whether there is a correlation between variables. PCA will be disabled if:

# Mauchly's sphericity test
sphericity_results = pg.sphericity(df[features])
print(sphericity_results)

Result:

SpherResults(spher=False, W=np.float64(0.004437706589942777), chi2=np.float64(35184.26583883276), dof=9, pval=np.float64(0.0))

It seems that we have chosen an invincible, dry data set! But fear not – this article is purposely designed to focus on the EDA process and help you identify more data problems like these. At the end of the day, finding them and knowing what to do with them before downstream, machine learning analysis is much better than building a potentially flawed model. In this case, there is a catch: we have a p-value of 0.0, which means that the null hypothesis of the identity correlation matrix is rejected, which means that there is a meaningful connection between the variables. So if we had many features and wanted to reduce the dimension, using PCA would be a good idea.

# Tests for Multicollinearity

Finally, we will examine multicollinearity: a property that indicates whether there are highly correlated predictors. This can be, at times, an undesirable point in models interpreted as linear regressors. Let's check it out:

# Calculating a robust correlation matrix with p-values
correlation_matrix = pg.rcorr(df[features], method='pearson')
print(correlation_matrix)

Output matrix:

                 fixed acidity volatile acidity citric acid     pH alcohol
fixed acidity                -              ***         ***    ***     ***
volatile acidity         0.219                -         ***    ***      **
citric acid              0.324           -0.378           -    ***        
pH                      -0.253            0.261       -0.33      -     ***
alcohol                 -0.095           -0.038       -0.01  0.121       -

While the pandas' corr() can also be used, its Pingouin counterpart uses asterisks to indicate the statistical significance of each correlation (* for p < 0.05, ** for p < 0.01, and *** p < 0.001). Correlations may be statistically significant but small in magnitude - multicollinearity becomes a concern when the overall correlation value is high (usually greater than 0.8). In our case, no pairwise correlation is dangerously large, and all five factors tested provide largely non-overlapping, separate information for further analysis.

# Wrapping up

Through a series of examples used and explained one by one, we saw how we can unleash the power of Pingouin, an open source Python library, to create robust, modern EDA pipelines that help you make better decisions in data preprocessing and stream analysis based on advanced statistical tests or machine learning models, helping you choose the right actions to take and the right models to use.

Iván Palomares Carrascosa is a leader, author, speaker, and consultant in AI, machine learning, deep learning and LLMs. He trains and guides others in using AI in the real world.

Source link

nimda 5 hours ago

0 0 5 minutes read