ANI

The “Tough” Data Scientist: Winning with Messy Data and Pingouin

0 0 4 minutes read

The “Tough” Data Scientist: Winning with Messy Data and Pingouin

Photo by Editor

# Introduction

The hard truth to start with: the textbook data science it is often false in the real world. Concepts and techniques are taught in the selection of beautiful, flexible curved steel, but as soon as we enter the wilderness of real projects, we are bombarded with a multitude of outliers, unfairly skewed distributions, and irrefutable differences.

A previous article on building an experimental data analysis (EDA) pipeline with The Penguin showed how to find, experimentally, cases where the data violate various assumptions such as normality and normality. But what if the tests fail? Dumping data is not the solution: making it powerful.

This article explores the art of applying robust statistics to data science processes. These are statistical methods designed to produce reliable and valid results even when the data do not meet classical assumptions or are full of outliers and noise. Adopting a “choose your own adventure” approach, we will create three scenarios using Python's Pingouin to control the most extreme aspects of the data you may encounter in your daily work.

# Initial Setup

Let's start by installing (if needed) and import Pingouin once Pandasafter that we will upload the wine quality dataset found here.

!pip install pingouin pandas

import pandas as pd
import pingouin as pg

# Loading our messy, real-world-like dataset, containing red and white wine samples
url = "
df = pd.read_csv(url)

# Take a small peek at what we are about to deal with
df.head()

If you've looked at Pingouin's previous article, you already know that this is a messy data set that failed to meet a few common assumptions. We will now begin three separate “events,” each highlighting a situation, a core problem, and a proposed strong fix to fix it.

// Adventure 1: When Standard Experiments Fail

Suppose we perform normality tests on two groups: white wine samples and red wine samples.

white_wine_alcohol = df[df['type'] == 'white']['alcohol']
red_wine_alcohol = df[df['type'] == 'red']['alcohol']

print("Normality test for White Wine Alcohol content:")
print(pg.normality(white_wine_alcohol))
print("nNormality test for Red Wine Alcohol content:")
print(pg.normality(red_wine_alcohol))

You will find that there is no normal distribution, with very low p-values. Although non-normality itself does not directly indicate outliers or skewness, strong deviations from normality often suggest that such features may be present in the data. Comparing methods using the it-test in this situation can be dangerous and may produce unreliable results.

A strong fix for a situation like this is Mann-Whitney U test. Instead of comparing averages, this test compares levels in the data – ranking all wines in a group from lowest to highest alcohol content, for example. This level-based approach is a key strategy that strips outliers of their sometimes dangerous size. Here's how:

# Separating our two groups
red_wine = df[df['type'] == 'red']['alcohol']
white_wine = df[df['type'] == 'white']['alcohol']

# Running the robust Mann-Whitney U test
mwu_results = pg.mwu(x=red_wine, y=white_wine)
print(mwu_results)

Output:

         U_val alternative     p_val       RBC      CLES
MWU  3829043.5   two-sided  0.181845 -0.022193  0.488903

Since the p value is not less than 0.05, there is no statistically significant difference in alcohol content between the two types of wine – and this conclusion is confirmed to be outlier proof and bias proof.

// Adventure 2: When the Paired T Test Fails

Suppose you now want to compare two measurements taken from the same subject – eg. the patient's sugar level before and after a sample of medicine, or two factors are measured in one bottle of wine. Here the focus is on the the difference between paired measurements are distributed. If such differences are not normally distributed, a standard paired t-test will yield unreliable confidence intervals.

A suitable solution in this situation is Wilcoxon Signed-Rank Test: a strong sibling of the paired t-test, which works by looking at the differences between columns and measuring their absolute values. In Pingouin, this test is called using pg.wilcoxon()passing through two columns containing paired measurements within the same heading — eg two types of wine acid.

# Run the robust Wilcoxon signed-rank test for paired data
wilcoxon_results = pg.wilcoxon(x=df['fixed acidity'], y=df['volatile acidity'])
print(wilcoxon_results)

Result:

          W_val alternative  p_val  RBC  CLES
Wilcoxon    0.0   two-sided    0.0  1.0   1.0

The above result indicates a statistically significant difference, or “absolute difference,” between the two measurements. Not only are the two wine features different, but they also operate on completely different categories in the dataset.

// Adventure 3: When ANOVA Fails

In this third and last trip, we want to check if the levels of residual sugar in wine differ significantly in different quality measures – note that the latest range between 3 and 9, takes absolute values, so it can be treated as different categories.

If Pingouin Levene's test of homoscedasticity fails significantly – for example, because the variation of sugar in average wines is large but very small in high quality wines – the classical one-way ANOVA may produce misleading results, since this test assumes equal variation between groups.

Repair of course Welch's ANOVAwhich penalizes groups with high variance, thus equalizing the scales and making comparisons fair across several categories. Here's how to use this more robust method than traditional ANOVA using Pingouin:

# Run Welch's ANOVA to compare sugar across quality ratings
welch_results = pg.welch_anova(data=df, dv='residual sugar', between='quality')
print(welch_results)

Result:

    Source  ddof1      ddof2          F         p_unc       np2
0  quality      6  54.507934  10.918282  5.937951e-08  0.008353

Even where a one-way ANOVA may struggle due to unequal variances, Welch's ANOVA delivers a strong conclusion. A very small p-value is a clear evidence that the levels of residual sugar vary greatly in all wine quality measurements. Remember, however, that sugar is only a small part of the puzzle that influences wine quality – a point emphasized by the low eta-squared value of 0.008.

# Wrapping up

Through three case studies, each matching the problem of dirty data to a strong statistical strategy, we learned that being a good data scientist does not mean having perfect data or fine-tuning it – it means knowing what to do when the data becomes difficult for different reasons. Pingouin functions use a variety of robust tests that help avoid the guesswork trap and extract meaningful statistical information with little extra effort.

Iván Palomares Carrascosa is a leader, author, speaker, and consultant in AI, machine learning, deep learning and LLMs. He trains and guides others in using AI in the real world.

Source link

nimda 4 hours ago

0 0 4 minutes read