The Pearson collocience coefficient, just explained

Build a regression model, which means that it is suitable for a straight line in the data to predict future values, first we visualize our data to get an idea of how it looks and see patterns and relationships.
The data may seem to show a positive direct relationship, but we confirm it by calculating the coeffection pearson colselation, which tells us how close our data is to the resistance.
Let's look at a simple income dataset to understand coeffect pearson colselation.
The data consists of two columns:
It's been a year: The number of years a person has been working
Salary (Target): Salary corresponding to the year corresponding to US dollars
Now we need to build a model that predicts salary based on years of experience.
We understand that this can be done with a simple linear regression model because we have one predictor and one continuous variable.
But can we directly use such a simple regression algorithm?
No.
We have several items for specific refunds to apply, and one of them you are a beauty.
We need to check the line, and for that, we count colselation coeffnty.
But what is understanding?
Let's understand this with an example.
From the table above, we see that for every one-year increase, there is an increase of $5,000 in salary.
Change is constant, and when we organize these values, we find a straight line.
This type of relationship is called direct relationship.
Now in linear regression, we already know that we fit a linear regression to the data to predict future values, and this can only work when the data has a linear relationship.
Therefore, we need to check the sensitivity of our data.
For that, let's calculate the integration.
Before that, we first visualize the data using a scatter plot to get an idea of the relationship between these two variables.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Load the dataset
df = pd.read_csv("C:/Salary_dataset.csv")
# Set plot style
sns.set(style="whitegrid")
# Create scatter plot
plt.figure(figsize=(8, 5))
sns.scatterplot(x='YearsExperience', y='Salary', data=df, color='blue', s=60)
plt.title("Scatter Plot: Years of Experience vs Salary")
plt.xlabel("Years of Experience")
plt.ylabel("Salary (USD)")
plt.tight_layout()
plt.show()

We can see from the scatterplot that as years of experience increasing, salary and often increases.
Although the points do not form a straight line, a relationship appears to exist strong and linear.
To confirm this, now let's count Pearson condiciection cocciffliction.
import pandas as pd
# Load the dataset
df = pd.read_csv("C:/Salary_dataset.csv")
# Calculate Pearson correlation
pearson_corr = df['YearsExperience'].corr(df['Salary'], method='pearson')
print(f"Pearson correlation coefficient: {pearson_corr:.4f}")
Pearson condiciection coeffled with 0.9782.
We find the correlation coefficient between 1 and +1.
If so…
Close to 1: Good good good relationship
near 0: No equal relationship
Close to -1: Bad relationship
Here, we found a limited number of encounters 0.9782which means that the data most closely follows a a definite patternand there is a The best relationship is the best between variables.
In this case, we can see that A simple straightforward restoration is fine by modeling this relationship.
But how do we explain this Pearson Colloefflition?
Let's look at 10 sample data from our database.

Now, let's calculate the Pearson collocience coefficient.
When both X and Y are increasing, the connection is said to be – that's right. On the other hand, if one increases while the other decreases, the connection – negative.
First, let's calculate the variance of each variable.
Variance helps us understand how much values spread from what they mean.
We will start by calculating the difference of X (years of experience).
To do that, first we need to compile i said x.
[
bar{X} = frac{1}{n} sum_{i=1}^{n} X_i
]
[
= frac{1.2 + 3.3 + 3.8 + 4.1 + 5.0 + 5.4 + 8.3 + 8.8 + 9.7 + 10.4}{10}
]
[
= frac{70.0}{10}
]
[
= 7.0
]
Next, we subtract each value from that and square it to cancel out the oddity.

We calculated the deviation content of each value from the mean.
Now, we can find the difference of X by taking the average of those limited cases.
[
text{Sample Variance of } X = frac{1}{n – 1} sum_{i=1}^{n} (X_i – bar{X})^2
]
[
= frac{33.64 + 13.69 + 10.24 + 8.41 + 4.00 + 2.56 + 1.69 + 3.24 + 7.29 + 11.56}{10 – 1}
]
[
= frac{96.32}{9} approx 10.70
]
Here we are divided by 'N-1
The sample variance of x 10.70which tells us that the number of years of knowledge, on average, 10.70 molded units far from explanation.
Since the variance is a standard value, we take the square root to translate to the same unit as the original data.
This is called Standard deviation.
[
s_X = sqrt{text{Sample Variance}} = sqrt{10.70} approx 3.27
]
The standard deviation of x 3.27which means that the values of years of experience fall 3.27 years above or below the definition.
In the same way we calculate the variance and standard deviation of 'y'.
[
bar{Y} = frac{1}{n} sum_{i=1}^{n} Y_i
]
[
= frac{39344 + 64446 + 57190 + 56958 + 67939 + 83089 + 113813 + 109432 + 112636 + 122392}{10}
]
[
= frac{827239}{10}
]
[
= 82,!723.90
]
[
text{Sample Variance of } Y = frac{1}{n – 1} sum (Y_i – bar{Y})^2
]
[
= frac{7,!898,!632,!198.90}{9} = 877,!625,!799.88
]
[
text{Standard Deviation of } Y text{ is } s_Y = sqrt{877,!625,!799.88} approx 29,!624.75
]
We calculate the variance and standard deviation of 'X' and 'y'.
Now, the next step is to calculate the cover between X and Y.
We already have the means of X and Y, and the deviation of each value in their respective ways.
Now, we multiply these deviations to see how the two variables vary together.

By multiplying these deviations, we try to capture how x and y interact.
If both X and y are above their means, then the deviation is positive, which means the product is positive.
If both X and y are below their means, then the deviation is meaningless, but since there are negative times, the product is positive.
If one person is above the definition and the other is below, the product is missing.
This product tells us that these two variables tend to go to same direction (both increase or both decrease) or opposite directions.
Using the sum of the product deviations, we now calculate the sample coverage.
[
text{Sample Covariance} = frac{1}{n – 1} sum_{i=1}^{n}(X_i – bar{X})(Y_i – bar{Y})
]
[
= frac{808771.5}{10 – 1}
]
[
= frac{808771.5}{9} = 89,!863.5
]
We found a sample cover of 89863.5. This shows that as the experience increases, the salary also increases.
But the magnitude of the covariance depends on the units of the variables (years in dollars), so it does not tell exactly.
This value only shows the method.
Now we divide the covariances by the product of the standard deviations of x and Y.
This gives us the Pearson Collorelation Coefficient which can be called a standard version of covariance.
Since the standard deviation of X has units of years and y has units of dollars, it adds up to give us Dollars Dollars.
These units cancel out when we divide, resulting in a pearson correlation, which is a unitler.
But the main reason is to divide the covariance by the standard deviation to use it normally, so the result is easy to interpret and can be compared with all different datasets.
[
r = frac{text{Cov}(X, Y)}{s_X cdot s_Y}
= frac{89,!863.5}{3.27 times 29,!624.75}
= frac{89,!863.5}{96,!992.13} approx 0.9265
]
Therefore, the Pearson Colloefferation coeffientry (R) was calculated 0.9265.
This tells us that there is a good good good good relationship between years of experience and salary.
In this way we get the Pearson Clieffection coefficient.
Formula for Pearson Colloeffless Coefflent:
[
r = frac{text{Cov}(X, Y)}{s_X cdot s_Y}
= frac{frac{1}{n – 1} sum_{i=1}^{n} (X_i – bar{X})(Y_i – bar{Y})}
{sqrt{frac{1}{n – 1} sum_{i=1}^{n} (X_i – bar{X})^2} cdot sqrt{frac{1}{n – 1} sum_{i=1}^{n} (Y_i – bar{Y})^2}}
]
[
= frac{sum_{i=1}^{n} (X_i – bar{X})(Y_i – bar{Y})}
{sqrt{sum_{i=1}^{n} (X_i – bar{X})^2} cdot sqrt{sum_{i=1}^{n} (Y_i – bar{Y})^2}}
]
We need to make sure that certain conditions are met before calculating the Pearson correlation coefficient:
- The relationship between the variables should be lita and.
- Both of these variables must be – continuous and numbers.
- There should be There are no strong sellers.
- Details should be it is often distributed.
Dataset measurement
The dataset used in this blog is Salary dataset.
It is publicly available on kaggle and licensed under the Creative Commons Zero (CC0 COMEND CICTOM) License. This means that it can be freely used, modified, and shared by both Non-commercial and commercial purposes Without restriction.
I hope this has given you a clear understanding of how the Pearson Coloration Coefflication is calculated and when it is used.
Thanks for reading!



