ANI

5 Key Ways to Find a Robust Outlier

0 4 4 minutes read

# Introduction

He has met others strange data points in your dataset while testing it? One or a few that seem unnecessarily different from most observations, thus distorting your methods and increasing variance? I have been there too. These points outsiders. Their impact is not limited to changing data analytics: outliers can easily corrupt the performance of any predictive analytics models you build, so finding and managing them tightly is critical to any data project. This article lists and compares five important visualization methods, along with a brief Python example for each.

# 1. Z-Score method

The Z-score calculation is a simple method that works best for normally distributed data variables. It measures how many standard deviations each point is from the mean. Basically, a data point with a Z-score of 3 or more (or -3 or less) is marked as an outlier: that means there is more than three standard deviations between that point and the mean. In spite of its simplicity, it has a hidden value which means that even the standard deviation is very sensitive to extreme values.

import numpy as np
from scipy import stats

data = np.array([10, 12, 11, 13, 12, 11, 10, 12, 11, 13, 250])

z_scores = np.abs(stats.zscore(data))
outliers = data[z_scores > 3]

print(outliers)

Output:

# 2. Interquartile Range (IQR) method.

Are your data variables not normally distributed? Then IQR is a better and more robust bet than Z-score statistics. This method uses percentiles, mainly by determining the spread between the first quartile (Q1, 25th percentile) and the third quartile (Q3, 75th percentile). Cutoff points that lie 1.5 times the IQR below Q1 and above Q3 are calculated, as shown below, and serve as a “call.” In other words, any area that falls outside of these two sites on either side is marked as an outlier. The good news: the robustness of the IQR comes from the fact that higher values don't change the quartiles the way they change the means and standard deviations.

import numpy as np

data = np.array([10, 12, 11, 13, 12, 11, 10, 12, 11, 13, 250])

q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1
lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr
outliers = data[(data < lower_fence) | (data > upper_fence)]

print(outliers)

Output:

# 3. Isolation Forests

When handling complex data sets with high dimensions, traditional methods such as Z-scores and IQR are no longer applicable. Enter classification forests, a machine learning technique that learns to distinguish anomalies from “normal” data. The idea is similar to that of classical decision trees for classification and descent: outliers are rare data points, so separating them by splitting trees is very easy. Therefore, when a point is easily distinguished from others by the tree algorithm, it is likely to be an outlier.

import numpy as np
from sklearn.ensemble import IsolationForest

data = np.array([10, 12, 11, 13, 12, 11, 10, 12, 11, 13, 250]).reshape(-1, 1)

model = IsolationForest(contamination=0.1, random_state=42)
predictions = model.fit_predict(data)
outliers = data[predictions == -1]

print(outliers)

Output:

# 4. Median Absolute Deviation (MAD)

This is a more robust version of the Z-score, so to speak: MAD uses the median – protected from extreme values - and the absolute deviation from it to calculate the “Z-score.” Note, however, that although it can be used for random variables, it is usually used for one-dimensional data, i.e. it is a fixed order.

import numpy as np
from scipy.stats import median_abs_deviation

data = np.array([10, 12, 11, 13, 12, 11, 10, 12, 11, 13, 250])

mad = median_abs_deviation(data, scale="normal")
median = np.median(data)
modified_z_scores = np.abs(data - median) / mad
outliers = data[modified_z_scores > 3]

print(outliers)

Output:

# 5. Compression Based Clustering: DBSCAN

This is a good way to identify outliers in spatial data or data sets with complex groups. I DBSCAN The algorithm creates clusters around points that are close to each other in areas of high density. During its use, data points isolated from less dense areas are automatically identified as noise, i.e. outliers. Like method number 3 (disjunction forests), this is a multivariate procedure that allows multivariate data points to be evaluated in the outlier detection process.

import numpy as np
from sklearn.cluster import DBSCAN

data = np.array([10, 12, 11, 13, 12, 11, 10, 12, 11, 13, 250]).reshape(-1, 1)

model = DBSCAN(eps=5, min_samples=2)
labels = model.fit_predict(data)
outliers = data[labels == -1]

print(outliers)

Output:

# Wrapping up

Choosing the right method for outlier detection comes down to understanding your data. I-score and IQR are quick, easy options for non-variable data, and IQR is a safe option if your variable is not normally distributed. MAD provides a robust alternative method in cases where extreme values may distort the result. If your data has large dimensions or complex structure, classification forests and DBSCAN extend outlier detection beyond the limits of simple statistics, capturing relationships that simple methods completely miss. There is no single best method, only one that is most suitable for the shape and scale of your data.

Iván Palomares Carrascosa is a leader, author, speaker, and consultant in AI, machine learning, deep learning and LLMs. He trains and guides others in using AI in the real world.

Source link

nimda 3 weeks ago

0 4 4 minutes read