Machine Learning

Spearman Correlation Coefferation where Pearson is not sufficient

In the Pearson Clieffection coefficient, we discussed how it is used to measure the strength of the multivariate relationship between two variables (years of experience and salary).

Not all relationships between variables are linear, and pearson correlations work best when relationships follow a linear pattern.

When the relationship is not direct but moves consistently in the same direction, we use coeffect species to capture that pattern.

To understand spearman's colseelition coefficient, let's consider a fish market dataset.

This data includes physical characteristics of each fish, such as:

  • Weight – the weight of the fish in grams (this will be our target variable)
  • Length1, Length2, Length3 – Various Measurements (in CM)
  • Height – Fish Height (in CM)
  • Width – diagonal width of the fish body (in cm)

We need to predict the weight of fish based on various measurements of length, height and width.

This was the same example we used to understand the Math behind Multiple Linear Rection in the previous blog but we used only height and width as different variables first to find the individual calculations for the slopes and intercepts.

Here we are trying to fit a linear regression model, and we have five variables and one target variable.

Now let's calculate the Pearson Collocation Coefflection between each independent variable and the target variable.

Code:

import pandas as pd

# Load the Fish Market dataset
df = pd.read_csv("C:/Fish.csv")

# Drop the categorical 'Species' column 
if 'Species' in df.columns:
    df_numeric = df.drop(columns=['Species'])
else:
    df_numeric = df.copy()

# Calculate Pearson correlation between each independent variable and the target (Weight)
target = 'Weight'
pearson_corr = df_numeric.corr(method='pearson')[target].drop(target)  # drop self-correlation

pearson_corr.sort_values(ascending=False)

Pearson collocience colselation between weights and

  • Length3 is 0.923044
  • Length2 is 0.918618
  • Length1 is 0.915712
  • The range is 0.886507
  • The height is 0.724345

Among all the variables, Height It has a very good weakness, and we can think that we should discard this variable before using the multi-line regression model.

But before this, is it correct to drop the independent variable based on the Pearson condicieftion coefficient?

No.

First, let's look at the distribution farm between height and weight.

Photo by the Author

From the scatter plot we can see that as the height increases, the weight also increases, but the relationship is not linear.

With small Heights, the weight increases slightly. At the extremes, it increases very quickly.

Here the trend is not linear but nevertheless monotonic, because it moves in the same direction.

Since the Pearson Colloretation Coeffled assumes a direct (linear) relationship, it gives a lower value here.

This is where the Spearman Cliefferation coefficient comes into play.

Now let's calculate the Spearman Corselation Coefflection between height and weight.

Code:

import pandas as pd
from scipy.stats import spearmanr

# Load the dataset
df = pd.read_csv("C:/Fish.csv") 

# Calculate Spearman correlation coefficient between Height and Weight
spearman_corr = spearmanr(df["Height"], df["Weight"])[0]

print(f"Spearman Correlation Coefficient: {spearman_corr:.4f}")

This page The Spearman Correlation Coefficient is 0.8586which shows a positive correlation between height and weight.

This means that as the height of the fish increases, the weight also tends to increase.

Earlier, we found a Pearson correlation coefficient of 0.72 Between height and weight, which underestimates the real relationship between these variables.

If we select features based only on the Pearson Filter and remove the height feature, we may lose important variables that actually have a strong relationship with the target, leading to poor predictions.

That's it Colselvation colselation is useful, as it captures non-linear but monotonic trends.

Using the Spearman test, we can also determine the next steps, such as entering variables such as log or lag values ​​or processing algorithms such as decisions such as direct relationships and random personalities.


Now that we have understood the importance of the Spearman Correlation Coefficient, it is now time to understand the math behind it.

How does Speeffent coltiefry calculate in such a way that it captures the relationship even if the data is not Linear and Monotonic?

To understand this, let's look at a sample of 10 points from the database.

Photo by the Author

Now, we sort the values ​​in ascending order of each column and give the ranks.

Photo by the Author

Now that we have given the height and weight standards, we are not keeping them in a fixed order.

Each value needs to be returned to its original position in the dataset so that all elevation fishes deserve their weight.

We only arrange columns to give ranks. After that, we put the levels back in their original order and calculate the connection of the spear using these two levels.

Photo by the Author

Here, while assigning the levels after sorting the values ​​in ascending order to Weight Column, we met the tile in the posts 5 and 6so we assign both values Average rating is 5.5.

Similarly, we found some fur on the other side of the positions 7, 8, 9, and 10so we shared all these Average rating is 8.5.

Now, we calculate the Colselvation colselationactually i Pearson color correction is applied to positions.

We already know the formula for calculating Pearson Colloefflication coefflication.

[
r = frac{text{Cov}(X, Y)}{s_X cdot s_Y}
= frac{frac{1}{n – 1} sum_{i=1}^{n} (X_i – bar{X})(Y_i – bar{Y})}
{sqrt{frac{1}{n – 1} sum_{i=1}^{n} (X_i – bar{X})^2} cdot sqrt{frac{1}{n – 1} sum_{i=1}^{n} (Y_i – bar{Y})^2}}
]

[
= frac{sum_{i=1}^{n} (X_i – bar{X})(Y_i – bar{Y})}
{sqrt{sum_{i=1}^{n} (X_i – bar{X})^2} cdot sqrt{sum_{i=1}^{n} (Y_i – bar{Y})^2}}
]

Now, the formula for Spearman conditional coefficients is:

[
r_s =
frac{
sum_{i=1}^{n}
underbrace{(R_{X_i} – bar{R}_X)}_{text{Rank deviation of } X_i}
cdot
underbrace{(R_{Y_i} – bar{R}_Y)}_{text{Rank deviation of } Y_i}
}{
sqrt{
sum_{i=1}^{n}
underbrace{(R_{X_i} – bar{R}_X)^2}_{text{Squared rank deviations of } X}
}
cdot
sqrt{
sum_{i=1}^{n}
underbrace{(R_{Y_i} – bar{R}_Y)^2}_{text{Squared rank deviations of } Y}
}
}
]

[
begin{aligned}
text{Where:} \
R_{X_i} & = text{ rank of the } i^text{th} text{ value in variable } X \
R_{Y_i} & = text{ rank of the } i^text{th} text{ value in variable } Y \
bar{R}_X & = text{ mean of all ranks in } X \
bar{R}_Y & = text{ mean of all ranks in } Y
end{aligned}
]

Now, let's calculate the Spearman correlation coefficient for the sample data.

[
textbf{Step 1: Ranks from the original data}
]

[
begin{array}{c|cccccccccc}
R_{x_i} & 3 & 1 & 2 & 5 & 8 & 4 & 7 & 9 & 10 & 6 \[2pt]
R_ {y_i} & 1 & 2 & 4 & 4 & 5.5 & 3.5 & 3 & 5.5 & 8.5
end {list}
]

[
textbf{Step 2: Formula of Spearman’s correlation (Pearson on ranks)}
]

[
rho_s =
frac{sum_{i=1}^{n}bigl(R_{x_i}-bar{R_x}bigr)bigl(R_{y_i}-bar{R_y}bigr)}
{sqrt{sum_{i=1}^{n}bigl(R_{x_i}-bar{R_x}bigr)^2} ;
sqrt{sum_{i=1}^{n}bigl(R_{y_i}-bar{R_y}bigr)^2}},
qquad n = 10
]

[
textbf{Step 3: Mean of rank variables}
]

[
bar{R_x} = frac{3+1+2+5+8+4+7+9+10+6}{10} = frac{55}{10} = 5.5
]

[
bar{R_y} = frac{1+2+4+5.5+8.5+3+5.5+8.5+8.5+8.5}{10}
= frac{55.5}{10} = 5.55
]

[
textbf{Step 4: Deviations and cross-products}
]

[
begin{array}{c|c|c|c}
i & R_{x_i}-bar{R_x} & R_{y_i}-bar{R_y} & (R_{x_i}-bar{R_x})(R_{y_i}-bar{R_y}) \ hline
1 & -2.5 & -4.55 & 11.38 \
2 & -4.5 & -3.55 & 15.98 \
3 & -3.5 & -1.55 & 5.43 \
4 & -0.5 & -0.05 & 0.03 \
5 & 2.5 & 2.95 & 7.38 \
6 & -1.5 & -2.55 & 3.83 \
7 & 1.5 & -0.05 & -0.08 \
8 & 3.5 & 2.95 & 10.33 \
9 & 4.5 & 2.95 & 13.28 \
10 & 0.5 & 2.95 & 1.48
end{array}
]

[
sum (R_{x_i}-bar{R_x})(R_{y_i}-bar{R_y}) = 68.0
]

[
textbf{Step 5: Sum of squares for each rank variable}
]

[
sum (R_{x_i}-bar{R_x})^2 = 82.5,
qquad
sum (R_{y_i}-bar{R_y})^2 = 82.5
]

[
textbf{Step 6: Substitute into the formula}
]

[
rho_s
= frac{68.0}{sqrt{(82.5)(82.5)}}
= frac{68.0}{82.5}
= 0.824
]

[
textbf{Step 7: Interpretation}
]

[
rho_s = 0.824
]

The value of ( rho_s = 0.824 ) shows the strong relationship that exists between height and weight as height increases, weight also tends to increase.

This is how we measure to calculate the Spearman Correlation Coefflecient.

We also have another formula to calculate the Spearman Cortenation Coeffled, but it is only used when there are no bounds.

[
rho_s = 1 – frac{6sum d_i^2}{n(n^2 – 1)}
]

where:

[
begin{aligned}
rho_s & : text{ Spearman correlation coefficient} \[4pt]
d_a &: text {the difference between the positions of each view,} (R_{{x_i} – r_{y[4pt]
n &:text {total number of paired observations}
finally {aligned}
]

If Ties are present, the difference in position no longer represents certain distances between the positions, and instead we calculate 'ρ' using the 'Petyson Correlation in the formula 'up.


Dataset measurement

The dataset used in this blog is Fish Market Datasetwhich contains measurements of the types of fish sold in the market, including attributes such as weight, height, and width.

It is publicly available on kaggle and licensed under the Creative Commons Zero (CC0 COMEND CICTOM) License. This means that it can be freely used, modified, and shared by both Non-commercial and commercial purposes Without restriction.


Spearman's Colloeffcial Coefflition helps us understand how two variables go together when the relationship is not completely linear.

By converting the data into positions, it shows how one variable increases as the other increases, capturing any pattern above or below.

It is particularly useful when the data have outliers, are not normally distributed or when the relationship is monotonic but skewed.

I hope this post helped you understand not only how to calculate the Spearman conditional coefficient, but also when to use it and why it is an important tool in data analysis.

Thanks for reading!

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button