Multiple Linear Regression, Explained Simply (Part 1)

In this blog post, we discuss multiple linear regression.
this is one of the first algorithms to learn in our Machine Learning journey, as it is an extension of simple linear regression.
We know that in simple linear regression we have one independent variable and one target variable, and in multiple linear regression we have two or more independent variables and one target variable.
Instead of just applying the algorithm using Python, in this blog, let’s explore the math behind the multiple linear regression algorithm.
Let’s consider the Fish Market dataset to understand the math behind multiple linear regression.
This dataset includes physical attributes of each fish, such as:
- Species – the type of fish (e.g., Bream, Roach, Pike)
- Weight – the weight of the fish in grams (this will be our target variable)
- Length1, Length2, Length3 – various length measurements (in cm)
- Height – the height of the fish (in cm)
- Width – the diagonal width of the fish body (in cm)
To understand multiple linear regression, we’ll use two independent variables to keep it simple and easy to visualize.
We will consider a 20-point sample from this dataset.
We considered a 20-point sample from the Fish Market dataset, which includes measurements of 20 individual fish, specifically their height and width along with the corresponding weight. These three values will help us understand how multiple linear regression works in practice.
First, let’s use Python to fit a multiple linear regression model on our 20-point sample data.
Code:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
# 20-point sample data from Fish Market dataset
data = [
[11.52, 4.02, 242.0],
[12.48, 4.31, 290.0],
[12.38, 4.70, 340.0],
[12.73, 4.46, 363.0],
[12.44, 5.13, 430.0],
[13.60, 4.93, 450.0],
[14.18, 5.28, 500.0],
[12.67, 4.69, 390.0],
[14.00, 4.84, 450.0],
[14.23, 4.96, 500.0],
[14.26, 5.10, 475.0],
[14.37, 4.81, 500.0],
[13.76, 4.37, 500.0],
[13.91, 5.07, 340.0],
[14.95, 5.17, 600.0],
[15.44, 5.58, 600.0],
[14.86, 5.29, 700.0],
[14.94, 5.20, 700.0],
[15.63, 5.13, 610.0],
[14.47, 5.73, 650.0]
]
# Create DataFrame
df = pd.DataFrame(data, columns=["Height", "Width", "Weight"])
# Independent variables (Height and Width)
X = df[["Height", "Width"]]
# Target variable (Weight)
y = df["Weight"]
# Fit the model
model = LinearRegression().fit(X, y)
# Extract coefficients
b0 = model.intercept_ # β₀
b1, b2 = model.coef_ # β₁ (Height), β₂ (Width)
# Print results
print(f"Intercept (β₀): {b0:.4f}")
print(f"Height slope (β₁): {b1:.4f}")
print(f"Width slope (β₂): {b2:.4f}")
Results:
Intercept (β₀): -1005.2810
Height slope (β₁): 78.1404
Width slope (β₂): 82.0572
Here, we haven’t done a train-test split because it is a small dataset, and we are trying to understand the math behind the model but not build the model.
We applied multiple linear regression using Python on our sample dataset and we got the results.
What’s the next step?
To evaluate the model to see how good it is at predictions?
Not today!
We are not going to evaluate the model until we understand how we got those slope and intercept values in the first place.
First, we will understand how the model works behind the scenes and then approach those slope and intercept values using math.
First, let’s plot our sample data.

When it comes to simple linear regression, we only have one independent variable, and the data is two-dimensional. We try to find the line that best fits the data.
In multiple linear regression, we may have two or more independent variables, and the data is three-dimensional. We try to find a plane that best fits the data.
Here, we considered two independent variables, which means we have to find a plane that best fits the data.

The Equation of the Plane is:
[
y = beta_0 + beta_1 x_1 + beta_2 x_2
]
where
y: the predicted value of the dependent (target) variable
β₀: the intercept (the value of y when all x’s are 0)
β₁: the coefficient (or slope) for feature x₁
β₂: the coefficient for feature x₂
x₁, x₂: the independent variables (features)
Let’s say we calculated the intercept and slope values, and we want to calculate the weight at a particular point i.
For that, we substitute the respective values, and we call it the predicted value, whereas the actual value is in our dataset. We are now calculating the predicted value at that point.
Let us denote the predicted value by ŷᵢ.
[
hat{y}_i = beta_0 + beta_1 x_{i1} + beta_2 x_{i2}
]
yᵢ represents the actual value and ŷᵢ represents the predicted value.
Now at point i, let’s find the difference between the actual value and the predicted value i.e. Residual.
[
text{Residual}_i = y_i – hat{y}_i
]
For n data points, the total residual will be
[
sum_{i=1}^{n} (y_i – hat{y}_i)
]
If we calculate just the sum of residuals, the positive and negative errors can cancel out, resulting in a misleadingly small total error.
Squaring the residuals solves this by ensuring all errors contribute positively, while also giving more importance to larger deviations.
So, we calculate the sum of squared residuals:
[
text{SSR} = sum_{i=1}^{n} (y_i – hat{y}_i)^2
]

Here in multiple linear regression, the model tries to fit a plane through the data such that the sum of squared residuals is minimized.
We already know the equation of the plane:
[
hat{y} = beta_0 + beta_1 x_1 + beta_2 x_2
]
Now we need to find the equation of the plane that best fits our sample data, minimizing the sum of squared residuals.
We already know that ŷ is the predicted value and x1 and x2 are the values from the dataset.
Now the remaining terms β₀, β₁ and β₂.
How can we find these slopes and intercept values?
Before that, let’s see what happens to the plane when we change the intercept (β₀).

Now, let’s see what happens when we change the slopes β₁ and β₂.


We can observe how changing the slopes and intercept affects the regression plane.
We need to find those exact values of slopes and intercept, where the sum of squared residuals is minimum.
Now, we want to find the best fitting plane
[
hat{y} = beta_0 + beta_1 x_1 + beta_2 x_2
]
that minimizes the Sum of Squared Residuals (SSR):
[
SSR = sum_{i=1}^{n} (y_i – hat{y}_i)^2 = sum_{i=1}^{n} (y_i – beta_0 – beta_1 x_{i1} – beta_2 x_{i2})^2
]
where
[
hat{y}_i = beta_0 + beta_1 x_{i1} + beta_2 x_{i2}
]
How can we find this equation of best fitting plane?
Before proceeding further, let’s go back to our school days.
I used to wonder why we needed to learn topics like differentiation, integration, and limits. Do we really use them in real life?
I thought that way because I found these topics difficult to understand. But when it came to relatively simpler topics like matrices (at least to some extent), I never questioned why we were learning them or what their use was.
It was when I began learning about Machine Learning that I started focusing on these topics.
Now coming back to the discussion, let’s consider a straight line.
y = 2x+1

Let’s plot these values

Let’s consider two points on the straight line.
(x1, y1) = (2,3) and (x2, y2) = (3,5)
Now we find the slope.
[
m = frac{y_2 – y_1}{x_2 – x_1} = frac{text{change in } y}{text{change in } x}
]
[
m = frac{y_2 – y_1}{x_2 – x_1} = frac{5 – 3}{3 – 2} = frac{2}{1} = 2
]
The slope is ‘2’.
If we consider any two points and calculate the slope, the value remains the same, which means the change in y with respect to the change in x is the same throughout the line.
Now, let’s consider the equation y=x2.

let’s plot these values

y=x2 represents a curve (parabola).
What’s the slope of this curve?
Do we have a single slope for this curve?
NO.
We can observe that the slope changes continuously, meaning the rate of change in y with respect to x is not the same throughout the curve.
This shows that the slope changes from one point on the curve to another.
In other words, we can find the slope at each specific point, but there isn’t one single slope that represents the entire curve.
So, how do we find the slope of this curve?
This is where we introduce Differentiation.
First, let’s consider a point x on the x-axis and another point that is at a distance h from it, i.e., the point x+h.
The corresponding y-coordinates for these x-values would be f(x) and f(x+h), since y is a function of x.
Now we considered two points on the curve (x, f(x)) and (x+h, f(x+h)).
Now we join those two points and the line which joins the two points on a curve is called Secant Line.
Let’s find the slope between those two points.
[
text{slope} = frac{f(x + h) – f(x)}{(x + h) – x}
]
This gives us the average rate of change of ‘y’ with respect to ‘x’ over that interval.
But since we want to find the slope at a particular point, we gradually decrease the distance ‘h’ between the two points.
As these two points come closer and eventually coincide, the secant line (which joins the two points) becomes a tangent line to the curve at that point. This limiting value of the slope can be found using the concept of limits.
A tangent line is a straight line that just touches a curve at one single point.
It shows the instantaneous slope of the curve at that point.
[
frac{dy}{dx} = lim_{h to 0} frac{f(x + h) – f(x)}{h}
]


This is the concept of differentiation.
Now let’s find the slope of the curve y=x2.
[
text{Given: } f(x) = x^2
]
[
text{Derivative: } f'(x) = lim_{h to 0} frac{f(x + h) – f(x)}{h}
]
[
= lim_{h to 0} frac{(x + h)^2 – x^2}{h}
]
[
= lim_{h to 0} frac{x^2 + 2xh + h^2 – x^2}{h}
]
[
= lim_{h to 0} frac{2xh + h^2}{h}
]
[
= lim_{h to 0} (2x + h)
]
[
= 2x
]
2x is the slope of the curve y=x2.
For example, for x=2 on the curve y=x2, the slope is 2x=2×2=4.
At this point, we have the coordinate (2,4) on the curve, and the slope at that point is 4.
This means that at that exact point, for every 1 unit change in x, there is a 4 unit change in y.
Now consider at x=0, the slope is 2×0 = 0.
Which means there is no change in y with respect to x.
then y = 0.
At point (0,0) we get the slope 0, which means (0,0) is the minimum point.
Now that we’ve understood the basics of differentiation, let’s proceed to find the best-fitted plane.
Now, let’s go back to the cost function
[
SSR = sum_{i=1}^{n} (y_i – hat{y}_i)^2 = sum_{i=1}^{n} (y_i – beta_0 – beta_1 x_{i1} – beta_2 x_{i2})^2
]
This also represents a curve, since it contains squared terms.
In simple linear regression the cost function is:
[
SSR = sum_{i=1}^{n} (y_i – hat{y}_i)^2 = sum_{i=1}^{n} (y_i – beta_0 – beta_1 x_i)^2
]
When we consider random slope and intercept values and plot them, we can see a bowl-shaped curve.

In the same way as in simple linear regression, we need to find the point where the slope equals zero, which means the point at which we get the minimum value of the Sum of Squared Residuals (SSR).
Here, this corresponds to finding the values of β₀, β₁, and β₂ where the SSR is minimum. This happens when the derivatives of SSR with respect to each coefficient are equal to zero.
In other words, at this point, there is no change in SSR even with a slight change in β₀, β₁ or β₂, indicating that we have reached the minimum point of the cost function.
In simple words, we can say that in our example of y=x2, we got the derivative (slope) 2x=0 at x=0, and at that point, y is minimum, which in this case is zero.
Now, in our loss function, let’s say SSR=y. Here, we are finding the slope of the loss function at the point where the slope becomes zero.
In the y=x2 example, the slope depends on only one variable x, but in our loss function, the slope depends on three variables: β0, β1 and β2.
So, we need to find the point in a four-dimensional space. Just like we got (0,0) as the minimum point for y=x2, in MLR we need to find the point (β0,β1,β2,SSR) where the slope (derivative) equals zero.
Now let’s proceed with the derivation.
Since the Sum of Squared Residuals (SSR) depends on the parameters β₀, β₁ and β₂.
we can represent it as a function of these parameters:
[
L(beta_0, beta_1, beta_2) = sum_{i=1}^{n} (y_i – beta_0 – beta_1 x_{i1} – beta_2 x_{i2})^2
]
Derivation:
Here, we are working with three variables, so we cannot use regular differentiation. Instead, we differentiate each variable separately while keeping the others constant. This process is called Partial Differentiation.
Partial Differentiation w.r.t β₀
[
textbf{Loss:}quad L(beta_0,beta_1,beta_2)=sum_{i=1}^{n}big(y_i-beta_0-beta_1 x_{i1}-beta_2 x_{i2}big)^2
]
[
textbf{Let } e_i = y_i-beta_0-beta_1 x_{i1}-beta_2 x_{i2}quadRightarrowquad L=sum e_i^2.
]
[
textbf{Differentiate:}quad
frac{partial L}{partial beta_0}
= sum_{i=1}^{n} 2 e_i cdot frac{partial e_i}{partial beta_0}
quadtext{(chain rule: } frac{d}{dtheta}u^2=2u,frac{du}{dtheta}text{)}
]
[
text{But }frac{partial e_i}{partial beta_0}
=frac{partial}{partial beta_0}(y_i-beta_0-beta_1 x_{i1}-beta_2 x_{i2})
=frac{partial y_i}{partial beta_0}
-frac{partial beta_0}{partial beta_0}
-frac{partial (beta_1 x_{i1})}{partial beta_0}
-frac{partial (beta_2 x_{i2})}{partial beta_0}.
]
[
text{Since } y_i,; x_{i1},; x_{i2} text{ are constants w.r.t. } beta_0,;
text{their derivatives are zero. Hence } frac{partial e_i}{partial beta_0}=-1.
]
[
Rightarrowquad frac{partial L}{partial beta_0}
= sum 2 e_i cdot (-1) = -2sum_{i=1}^{n} e_i.
]
[
textbf{Set to zero (first-order condition):}quad
frac{partial L}{partial beta_0}=0 ;Rightarrow; sum_{i=1}^{n} e_i = 0.
]
[
textbf{Expand } e_i:quad
sum_{i=1}^{n}big(y_i-beta_0-beta_1 x_{i1}-beta_2 x_{i2}big)=0
Rightarrow
sum y_i – nbeta_0 – beta_1sum x_{i1} – beta_2sum x_{i2}=0.
]
[
textbf{Solve for } beta_0:quad
beta_0=bar{y}-beta_1 bar{x}_1-beta_2 bar{x}_2
quadtext{(divide by }ntext{ and use } bar{y}=frac{1}{n}sum y_i,; bar{x}_k=frac{1}{n}sum x_{ik}).
]
Partial Differentiation w.r.t β1
[
textbf{Differentiate:}quad
frac{partial L}{partial beta_1}
= sum_{i=1}^{n} 2 e_i cdot frac{partial e_i}{partial beta_1}.
]
[
text{Here }frac{partial e_i}{partial beta_1}
=frac{partial}{partial beta_1}(y_i-beta_0-beta_1 x_{i1}-beta_2 x_{i2})=-x_{i1}.
]
[
Rightarrowquad
frac{partial L}{partial beta_1}
= sum 2 e_i (-x_{i1})
= -2sum_{i=1}^{n} x_{i1} e_i.
]
[
textbf{Set to zero:}quad
frac{partial L}{partial beta_1}=0
;Rightarrow; sum_{i=1}^{n} x_{i1} e_i = 0.
]
[
textbf{Expand } e_i:quad
sum x_{i1}big(y_i-beta_0-beta_1 x_{i1}-beta_2 x_{i2}big)=0
]
[
Rightarrow;
sum x_{i1}y_i – beta_0sum x_{i1} – beta_1sum x_{i1}^2 – beta_2sum x_{i1}x_{i2}=0.
]
Partial Differentiation w.r.t β2
[
textbf{Differentiate:}quad
frac{partial L}{partial beta_2}
= sum_{i=1}^{n} 2 e_i cdot frac{partial e_i}{partial beta_2}.
]
[
text{Here }frac{partial e_i}{partial beta_2}
=frac{partial}{partial beta_2}(y_i-beta_0-beta_1 x_{i1}-beta_2 x_{i2})=-x_{i2}.
]
[
Rightarrowquad
frac{partial L}{partial beta_2}
= sum 2 e_i (-x_{i2})
= -2sum_{i=1}^{n} x_{i2} e_i.
]
[
textbf{Set to zero:}quad
frac{partial L}{partial beta_2}=0
;Rightarrow; sum_{i=1}^{n} x_{i2} e_i = 0.
]
[
textbf{Expand } e_i:quad
sum x_{i2}big(y_i-beta_0-beta_1 x_{i1}-beta_2 x_{i2}big)=0
]
[
Rightarrow;
sum x_{i2}y_i – beta_0sum x_{i2} – beta_1sum x_{i1}x_{i2} – beta_2sum x_{i2}^2=0.
]
We obtained these three equations after performing partial differentiation.
[
sum y_i – nbeta_0 – beta_1sum x_{i1} – beta_2sum x_{i2} = 0 quad (1)
]
[
sum x_{i1}y_i – beta_0sum x_{i1} – beta_1sum x_{i1}^2 – beta_2sum x_{i1}x_{i2} = 0 quad (2)
]
[
sum x_{i2}y_i – beta_0sum x_{i2} – beta_1sum x_{i1}x_{i2} – beta_2sum x_{i2}^2 = 0 quad (3)
]
Now we solve these three equations to get the values of β₀, β₁ and β₂.
From equation (1):
[
sum y_i – nbeta_0 – beta_1sum x_{i1} – beta_2sum x_{i2} = 0
]
Rearranged:
[
nbeta_0 = sum y_i – beta_1sum x_{i1} – beta_2sum x_{i2}
]
Divide both sides by ( n ):
[
beta_0 = frac{1}{n}sum y_i – beta_1frac{1}{n}sum x_{i1} – beta_2frac{1}{n}sum x_{i2}
]
Define the averages:
[
bar{y} = frac{1}{n}sum y_i,quad
bar{x}_1 = frac{1}{n}sum x_{i1},quad
bar{x}_2 = frac{1}{n}sum x_{i2}
]
Final form for the intercept:
[
beta_0 = bar{y} – beta_1bar{x}_1 – beta_2bar{x}_2
]
Let’s substitute ‘β₀’ in equation 2
Step 1: Start with Equation (2)
[
sum x_{i1}y_i – beta_0sum x_{i1} – beta_1sum x_{i1}^2 – beta_2sum x_{i1}x_{i2} = 0
]
Step 2: Substitute the expression for ( beta_0 )
[
beta_0 = frac{sum y_i – beta_1sum x_{i1} – beta_2sum x_{i2}}{n}
]
Step 3: Substitute into Equation (2)
[
sum x_{i1}y_i
– left( frac{sum y_i – beta_1sum x_{i1} – beta_2sum x_{i2}}{n} right)sum x_{i1}
– beta_1 sum x_{i1}^2
– beta_2 sum x_{i1}x_{i2} = 0
]
Step 4: Expand and simplify
[
sum x_{i1}y_i
– frac{ sum x_{i1} sum y_i }{n}
+ beta_1 cdot frac{ ( sum x_{i1} )^2 }{n}
+ beta_2 cdot frac{ sum x_{i1} sum x_{i2} }{n}
– beta_1 sum x_{i1}^2
– beta_2 sum x_{i1}x_{i2}
= 0
]
Step 5: Rearranged form (Equation 4)
[
beta_1 left( sum x_{i1}^2 – frac{ ( sum x_{i1} )^2 }{n} right)
+
beta_2 left( sum x_{i1}x_{i2} – frac{ sum x_{i1} sum x_{i2} }{n} right)
=
sum x_{i1}y_i – frac{ sum x_{i1} sum y_i }{n}
quad text{(4)}
]
Now substituting ‘β₀’ in equation 3:
Step 1: Start with Equation (3)
[
sum x_{i2}y_i – beta_0sum x_{i2} – beta_1sum x_{i1}x_{i2} – beta_2sum x_{i2}^2 = 0
]
Step 2: Use the expression for ( beta_0 )
[
beta_0 = frac{sum y_i – beta_1sum x_{i1} – beta_2sum x_{i2}}{n}
]
Step 3: Substitute ( beta_0 ) into Equation (3)
[
sum x_{i2}y_i
– left( frac{sum y_i – beta_1sum x_{i1} – beta_2sum x_{i2}}{n} right)sum x_{i2}
– beta_1 sum x_{i1}x_{i2}
– beta_2 sum x_{i2}^2 = 0
]
Step 4: Expand the expression
[
sum x_{i2}y_i
– frac{ sum x_{i2} sum y_i }{n}
+ beta_1 cdot frac{ sum x_{i1} sum x_{i2} }{n}
+ beta_2 cdot frac{ ( sum x_{i2} )^2 }{n}
– beta_1 sum x_{i1}x_{i2}
– beta_2 sum x_{i2}^2 = 0
]
Step 5: Rearranged form (Equation 5)
[
beta_1 left( sum x_{i1}x_{i2} – frac{ sum x_{i1} sum x_{i2} }{n} right)
+
beta_2 left( sum x_{i2}^2 – frac{ ( sum x_{i2} )^2 }{n} right)
=
sum x_{i2}y_i – frac{ sum x_{i2} sum y_i }{n}
quad text{(5)}
]
We got these two equations:
[
beta_1 left( sum x_{i1}^2 – frac{ left( sum x_{i1} right)^2 }{n} right)
+
beta_2 left( sum x_{i1}x_{i2} – frac{ sum x_{i1} sum x_{i2} }{n} right)
=
sum x_{i1}y_i – frac{ sum x_{i1} sum y_i }{n}
quad text{(4)}
]
[
beta_1 left( sum x_{i1}x_{i2} – frac{ sum x_{i1} sum x_{i2} }{n} right)
+
beta_2 left( sum x_{i2}^2 – frac{ left( sum x_{i2} right)^2 }{n} right)
=
sum x_{i2}y_i – frac{ sum x_{i2} sum y_i }{n}
quad text{(5)}
]
Now, we use Cramer’s rule to get the formulas for β₁ and β₂.
We start from the simplified equations (4) and (5):
[
beta_1 left( sum x_{i1}^2 – frac{ ( sum x_{i1} )^2 }{n} right)
+
beta_2 left( sum x_{i1}x_{i2} – frac{ sum x_{i1} sum x_{i2} }{n} right)
=
sum x_{i1}y_i – frac{ sum x_{i1} sum y_i }{n}
quad text{(4)}
]
[
beta_1 left( sum x_{i1}x_{i2} – frac{ sum x_{i1} sum x_{i2} }{n} right)
+
beta_2 left( sum x_{i2}^2 – frac{ ( sum x_{i2} )^2 }{n} right)
=
sum x_{i2}y_i – frac{ sum x_{i2} sum y_i }{n}
quad text{(5)}
]
Let us define:
( A = sum x_{i1}^2 – frac{(sum x_{i1})^2}{n} )
( B = sum x_{i1}x_{i2} – frac{(sum x_{i1})(sum x_{i2})}{n} )
( D = sum x_{i2}^2 – frac{(sum x_{i2})^2}{n} )
( C = sum x_{i1}y_i – frac{(sum x_{i1})(sum y_i)}{n} )
( E = sum x_{i2}y_i – frac{(sum x_{i2})(sum y_i)}{n} )
Now, rewrite the system:
[
begin{cases}
beta_1 A + beta_2 B = C \
beta_1 B + beta_2 D = E
end{cases}
]
We solve this 2×2 system using Cramer’s Rule.
First, compute the determinant:
[
Delta = AD – B^2
]
Then apply Cramer’s Rule:
[
beta_1 = frac{CD – BE}{AD – B^2}, qquad
beta_2 = frac{AE – BC}{AD – B^2}
]
Now substitute back the original summation terms:
[
beta_1 =
frac{
left( sum x_{i2}^2 – frac{(sum x_{i2})^2}{n} right)
left( sum x_{i1}y_i – frac{(sum x_{i1})(sum y_i)}{n} right)
–
left( sum x_{i1}x_{i2} – frac{(sum x_{i1})(sum x_{i2})}{n} right)
left( sum x_{i2}y_i – frac{(sum x_{i2})(sum y_i)}{n} right)
}{
left[
left( sum x_{i1}^2 – frac{(sum x_{i1})^2}{n} right)
left( sum x_{i2}^2 – frac{(sum x_{i2})^2}{n} right)
–
left( sum x_{i1}x_{i2} – frac{(sum x_{i1})(sum x_{i2})}{n} right)^2
right]
}
]
[
beta_2 =
frac{
left( sum x_{i1}^2 – frac{(sum x_{i1})^2}{n} right)
left( sum x_{i2}y_i – frac{(sum x_{i2})(sum y_i)}{n} right)
–
left( sum x_{i1}x_{i2} – frac{(sum x_{i1})(sum x_{i2})}{n} right)
left( sum x_{i1}y_i – frac{(sum x_{i1})(sum y_i)}{n} right)
}{
left[
left( sum x_{i1}^2 – frac{(sum x_{i1})^2}{n} right)
left( sum x_{i2}^2 – frac{(sum x_{i2})^2}{n} right)
–
left( sum x_{i1}x_{i2} – frac{(sum x_{i1})(sum x_{i2})}{n} right)^2
right]
}
]
If the data are centered (means are zero), then the second terms vanish and we get the simplified form:
[
beta_1 =
frac{
(sum x_{i2}^2)(sum x_{i1}y_i)
–
(sum x_{i1}x_{i2})(sum x_{i2}y_i)
}{
(sum x_{i1}^2)(sum x_{i2}^2) – (sum x_{i1}x_{i2})^2
}
]
[
beta_2 =
frac{
(sum x_{i1}^2)(sum x_{i2}y_i)
–
(sum x_{i1}x_{i2})(sum x_{i1}y_i)
}{
(sum x_{i1}^2)(sum x_{i2}^2) – (sum x_{i1}x_{i2})^2
}
]
Finally, we have derived the formulas for β₁ and β₂.
Let us compute β₀, β₁ and β₂ for our sample dataset, but before that let’s understand what centering actually means.
We start with a small dataset of 3 observations and 2 features:
[
begin{array}{|c|c|c|c|}
hline
text{i} & x_{i1} & x_{i2} & y_i \
hline
1 & 2 & 3 & 10 \
2 & 4 & 5 & 14 \
3 & 6 & 7 & 18 \
hline
end{array}
]
Step 1: Compute means
[
bar{x}_1 = frac{2 + 4 + 6}{3} = 4, quad
bar{x}_2 = frac{3 + 5 + 7}{3} = 5, quad
bar{y} = frac{10 + 14 + 18}{3} = 14
]
Step 2: Center the data (subtract the mean)
[
x’_{i1} = x_{i1} – bar{x}_1, quad
x’_{i2} = x_{i2} – bar{x}_2, quad
y’_i = y_i – bar{y}
]
[
begin{array}{|c|c|c|c|}
hline
text{i} & x’_{i1} & x’_{i2} & y’_i \
hline
1 & -2 & -2 & -4 \
2 & 0 & 0 & 0 \
3 & +2 & +2 & +4 \
hline
end{array}
]
Now check the sums:
[
sum x’_{i1} = -2 + 0 + 2 = 0, quad
sum x’_{i2} = -2 + 0 + 2 = 0, quad
sum y’_i = -4 + 0 + 4 = 0
]
Step 3: Understand what centering does to certain terms
In the normal equations, we see terms like:
[
sum x_{i1} y_i – frac{ sum x_{i1} sum y_i }{n}
]
If the data are centered:
[
sum x_{i1} = 0, quad sum y_i = 0 quad Rightarrow quad frac{0 cdot 0}{n} = 0
]
So the term becomes:
[
sum x_{i1} y_i
]
And if we directly use the centered values:
[
sum x’_{i1} y’_i
]
These are equivalent:
[
sum (x_{i1} – bar{x}_1)(y_i – bar{y}) = sum x_{i1} y_i – frac{ sum x_{i1} sum y_i }{n}
]
Step 4: Compare raw and centered calculation
Using original values:
[
sum x_{i1} y_i = (2)(10) + (4)(14) + (6)(18) = 184
]
[
sum x_{i1} = 12, quad sum y_i = 42, quad n = 3
]
[
frac{12 cdot 42}{3} = 168
]
[
sum x_{i1} y_i – frac{ sum x_{i1} sum y_i }{n} = 184 – 168 = 16
]
Now using centered values:
[
sum x’_{i1} y’_i = (-2)(-4) + (0)(0) + (2)(4) = 8 + 0 + 8 = 16
]
Same result.
Step 5: Why we center
– Simplifies the formulas by removing extra terms
– Ensures mean of all variables is zero
– Improves numerical stability
– Makes intercept easier to calculate:
[
beta_0 = bar{y} – beta_1 bar{x}_1 – beta_2 bar{x}_2
]
Step 6:
After centering, we can directly use:
[
sum (x’_{i1})(y’_i), quad
sum (x’_{i2})(y’_i), quad
sum {(x’_{i1})}^2, quad
sum {(x’_{i2})}^2, quad
sum (x’_{i1})(x’_{i2})
]
And the simplified formulas for ( beta_1 ) and ( beta_2 ) become easier to compute.
This is how we derived the formulas for β₀, β₁ and β₂.
[
beta_1 =
frac{
left( sum x_{i2}^2 right)left( sum x_{i1} y_i right)
–
left( sum x_{i1} x_{i2} right)left( sum x_{i2} y_i right)
}{
left( sum x_{i1}^2 right)left( sum x_{i2}^2 right)
–
left( sum x_{i1} x_{i2} right)^2
}
]
[
beta_2 =
frac{
left( sum x_{i1}^2 right)left( sum x_{i2} y_i right)
–
left( sum x_{i1} x_{i2} right)left( sum x_{i1} y_i right)
}{
left( sum x_{i1}^2 right)left( sum x_{i2}^2 right)
–
left( sum x_{i1} x_{i2} right)^2
}
]
[
beta_0 = bar{y}
quad text{(since the data is centered)}
]
Note: After centering, we continue using the same symbols ( x_{i1}, x_{i2}, y_i ) to represent the centered variables.
Now, let’s compute β₀, β₁ and β₂ for our sample dataset.
Step 1: Compute Means (Original Data)
$$
bar{x}_1 = frac{1}{n} sum x_{i1} = 13.841, quad
bar{x}_2 = frac{1}{n} sum x_{i2} = 4.9385, quad
bar{y} = frac{1}{n} sum y_i = 481.5
$$
Step 2: Center the Data
$$
x’_{i1} = x_{i1} – bar{x}_1, quad
x’_{i2} = x_{i2} – bar{x}_2, quad
y’_i = y_i – bar{y}
$$
Step 3: Compute Centered Summations
$$
sum x’_{i1} y’_i = 2465.60, quad
sum x’_{i2} y’_i = 816.57
$$
$$
sum (x’_{i1})^2 = 24.3876, quad
sum (x’_{i2})^2 = 3.4531, quad
sum x’_{i1} x’_{i2} = 6.8238
$$
Step 4: Compute Shared Denominator
$$
Delta = (24.3876)(3.4531) – (6.8238)^2 = 37.6470
$$
Step 5: Compute Slopes
$$
beta_1 =
frac{
(3.4531)(2465.60) – (6.8238)(816.57)
}{
37.6470
}
=
frac{2940.99}{37.6470}
= 78.14
$$
$$
beta_2 =
frac{
(24.3876)(816.57) – (6.8238)(2465.60)
}{
37.6470
}
=
frac{3089.79}{37.6470}
= 82.06
$$
Note: While the slopes were computed using centered variables, the final model uses the original variables.
So, compute the intercept using:
$$
beta_0 = bar{y} – beta_1 bar{x}_1 – beta_2 bar{x}_2
$$
Step 6: Compute Intercept
$$
beta_0 = 481.5 – (78.14)(13.841) – (82.06)(4.9385)
$$
$$
= 481.5 – 1081.77 – 405.01 = -1005.28
$$
Final Regression Equation:
$$
y_i = -1005.28 + 78.14 cdot x_{i1} + 82.06 cdot x_{i2}
$$
This is how we get the final slope and intercept values when applying multiple linear regression in Python.
Dataset
The dataset used in this blog is the Fish Market dataset, which contains measurements of fish species sold in markets, including attributes like weight, height, and width.
It is publicly available on Kaggle and is licensed under the Creative Commons Zero (CC0 Public Domain) license. This means it can be freely used, modified, and shared for both non-commercial and commercial purposes without restriction.
Whether you’re new to machine learning or simply interested in understanding the math behind multiple linear regression, I hope this blog gave you some clarity.
Stay tuned for Part 2, where we’ll see what changes when more than two predictors come into play.
Meanwhile, if you’re interested in how credit scoring models are evaluated, my recent blog on the Gini Coefficient explains it in simple terms. You can read it here.
Thanks for reading!



