Lasso Regression: Why the Solution Lives on a Diamond

on linear regression, we solved a linear regression problem using the concept of vectors and projections instead of calculus.
Now in this blog, we once again use those same concepts of vectors and projections to understand Lasso Regression.
While I was learning this topic, I was stuck at explanations like “we add a penalty term” and “Lasso shrinks the coefficients to zero.“
I was unable to grasp what’s actually happening behind this method.
I am sure many of you might have felt like me, and I think it’s common for beginners and, for that matter, anyone solving real-world problems using linear regression.
But today, we are once again taking a new way to approach this classic topic so that we can clearly see what is really happening behind the scenes.
When a Perfect Model Starts to Fail
Before proceeding further, let’s get a basic idea of why we actually use Lasso regression.
For example, consider we have some data and we apply linear regression to it and get zero error.
We might think we have a perfect model, but when we test that model on new data, we get predicted values that are not reliable or not according to reality.
In this case, we can say that our model has low bias and high variance.
Generally, we use Lasso when there are a large number of features, especially when they are comparable to or more than the number of observations, which can lead to overfitting.
This means the model, instead of learning patterns from the data, simply memorizes it.
Lasso helps in selecting only the important features by shrinking some coefficients to zero.
Now, to make the model more reliable, we use Lasso regression, and you will understand it in detail once we solve the actual problem.
Let’s say we have this house data. Now we need to build a model that predicts the price of a house using its size and age.
Let’s Build the Model First
First, let’s use Python to build this linear regression model.
Code:
import numpy as np
from sklearn.linear_model import LinearRegression
# Data
# Features: Size (1000 sqft), Age (years)
X = np.array([
[1, 1],
[2, 3],
[3, 2]
])
# Target: Price ($100k)
y = np.array([4, 8, 9])
# Create model
model = LinearRegression()
# Fit model
model.fit(X, y)
# Coefficients
print("Intercept:", model.intercept_)
print("Coefficients [Size, Age]:", model.coef_)
Result:

We got the result: β₀ = 1, β₁ = 2, β₂ = 1
Understanding Regression as Movement in Space
Now, let’s solve this using vectors and projections.
We already know how to solve this linear regression problem using vectors, and now we will use this data to understand the geometry behind it using vectors.
We already know how to do the math to find the solution, which we previously discussed in part 2 of my linear regression blog.
So we will not do the math here, as we already have the solution which we found using Python.
Let’s understand what the actual geometry is behind this data.
If you remember, we used this same data when we discussed linear regression using vectors.

Let’s consider this data as old data.
Now, to explain Lasso regression, we will use this data.

We just added a new feature, “Age”, to our data.
Now, let’s look at this GIF for our old data.

From Lines to Planes
Let’s just recall what we have done here. We considered each house as an axis and plotted the points, and we considered them as vectors.
We got the price vector and the size vector, and we realized the need for an intercept and added the intercept vector.
Now we had two directions in which we could move to reach the tip of the price vector. Based on these two directions, there are many possible points we can reach, and those points form a plane.
Now our target point, the price vector, is not on this plane, so we need to find the point on the plane that is closest to the tip of the price vector.
We calculate that closest point using the concept of projection, where the shortest distance occurs when we are perpendicular to the plane.
For that point, we use the concept of orthogonal projection, where the dot product between two orthogonal vectors is zero.
Here, projection is the key, and this is how we find the closest point on the plane, later using the math.
Now, let’s observe the GIF below for our new data.

What Changes When We Add One More Feature
We have the same goal here as well.
We want to reach the tip of the price vector, but now we have a new direction to move, which is the direction of the age vector, and that means we can now move in three different directions to reach our destination.
In our old data, we had two directions, and by combining both directions to reach the tip of the price vector, we got many points which collectively formed a 2D plane in that 3D space.
But now we have three directions to move in that 3D space, and what does that mean?
That means, if these directions are independent, we can reach every point in that 3D space using those directions, and that means we can also reach the tip of the price vector directly.
In this specific case, since the feature vectors span the space of the target, we can reach it exactly without needing projection.
We already have β₀ = 1, β₁ = 2, β₂ = 1
[
text{Now, let’s represent our new data in matrix form.}
]
[
X =
begin{bmatrix}
1 & 1 & 1 \
1 & 2 & 3 \
1 & 3 & 2
end{bmatrix}
quad
y =
begin{bmatrix}
4 \
8 \
9
end{bmatrix}
quad
beta =
begin{bmatrix}
b_0 \
b_1 \
b_2
end{bmatrix}
]
[
text{Here, the columns of } X text{ represent the base, size, and age directions.}
]
[
text{And we are trying to combine them to reach } y.
]
[
hat{y} = Xbeta
]
[
=
b_0
begin{bmatrix}
1 \
1 \
1
end{bmatrix}
+
b_1
begin{bmatrix}
1 \
2 \
3
end{bmatrix}
+
b_2
begin{bmatrix}
1 \
3 \
2
end{bmatrix}
]
[
text{let’s check if we can reach } y text{ directly.}
]
[
text{Using the values } b_0 = 1, b_1 = 2, b_2 = 1
]
[
hat{y} =
1
begin{bmatrix}
1 \
1 \
1
end{bmatrix}
+
2
begin{bmatrix}
1 \
2 \
3
end{bmatrix}
+
1
begin{bmatrix}
1 \
3 \
2
end{bmatrix}
]
[
=
begin{bmatrix}
1 \
1 \
1
end{bmatrix}
+
begin{bmatrix}
2 \
4 \
6
end{bmatrix}
+
begin{bmatrix}
1 \
3 \
2
end{bmatrix}
]
[
=
begin{bmatrix}
4 \
8 \
9
end{bmatrix}
= y
]
[
text{This shows that we can reach the target vector exactly using these directions.}
]
[
text{So, there is no need to find a closest point or perform projection.}
]
[
text{We have directly reached the destination.}
]
From this, we can say that if we go 1 unit in the direction of the intercept vector, 2 units in the direction of the size vector, and 1 unit in the direction of the age vector, we can reach the tip of the price vector directly.
Ok, now we have built our linear regression model using our data, and it seems to be a perfect model, but we know that a perfect model doesn’t exist, and we want to test our model.
A Perfect Fit… That Fails Completely
Now let’s consider a new house, which is House D.

Now, let’s use our model to predict the price of House D.
[
X_D =
begin{bmatrix}
1 & 1.5 & 20
end{bmatrix}
quad
beta =
begin{bmatrix}
1 \
2 \
1
end{bmatrix}
]
[
text{We use our model to predict the price of this house.}
]
[
hat{y}_D = X_D beta
]
[
= 1 cdot 1 + 2 cdot 1.5 + 1 cdot 20
]
[
= 1 + 3 + 20
]
[
= 24
]
[
text{So the predicted price is 24 (in 100k$ units).}
]
[
text{But the actual price is 5.5, which shows a large difference.}
]
[
text{This gives us an idea that the model may not generalize well.}
]
We can observe the difference between actual price and predicted price.
From this, we can say that the model has high variance. The model used all the possible directions to match the training data.
Instead of finding patterns in the data, we can say that the model memorized the data, and we can call this overfitting.
This usually happens when we have a large number of features compared to the number of observations, or when the model has too much flexibility (more directions = more flexibility).
In practice, we decide whether a model is overfitting based on its performance on a set of new data points but not just one.
Here, we are considering a single point only to build intuition and understand how Lasso Regression works.
So What’s the Problem?
How can we make this model perform well on unseen data?
One way to address this is using Lasso.
But what will happen actually when we apply lasso.
For our new data we got β₀ = 1, β₁ = 2, β₂ = 1, which means we already discussed that 1 unit in the direction of the intercept vector, 2 units in the direction of the size vector, and 1 unit in the direction of the age vector.

Breaking Down the Price Vector
Now let’s consider our target price vector (4, 8, 9). We need to reach the tip of that fixed price vector, and for that we have three directions.
In part 2 of my linear regression blog, we already discussed the need for a base vector, which helps us add a base value because even when size or age is zero, we still have a base price.
Now, for our price vector (4, 8, 9), which represents the prices of houses A, B, and C, the average value is 7.
We can write our price vector as (7, 7, 7) + (-3, 1, 2), which is equal to (4, 8, 9).
We can rewrite this as 7(1, 1, 1) + (-3, 1, 2).
What can we observe from this?
We can say that to reach the tip of our price vector, we need to move 7 units in the direction of the intercept vector and then adjust using the vector (-3, 1, 2).
Here, (-3, 1, 2) is a vector that represents the deviation of prices from the average. Also, we do not get any slope values here because we are not expressing the price vector in terms of feature directions, but simply separating it into average and variation.
So, if we only consider this representation, we would need to move 7 units in the direction of the intercept vector.
But when we applied the linear regression model to our data, we got a different intercept value, which is β₀ = 1.
Why is this happening?
We get an intercept value of 7 only when we do not have any other directions, meaning the size and age vectors are not present.
But when we include these feature directions, they also contribute to reaching the price vector.
Where Did the Intercept Go?
We obtained β₀ = 1, β₁ = 2, β₂ = 1. This means we move only 1 unit in the direction of the intercept vector. Then how do we still reach the price vector?
Let’s see.
We also have two more directions: the size vector (1, 2, 3) and the age vector (1, 3, 2).
First, consider the size vector (1, 2, 3).
We can write it as (2, 2, 2) + (-1, 0, 1), which is equal to 2(1, 1, 1) + (-1, 0, 1).
This shows that when we move along the size vector, we are also partially moving in the direction of the intercept vector.
If we move 2 units in the direction of the size vector, we get (2, 4, 6), which can be written as 4(1, 1, 1) + (-2, 0, 2).
We can say that size vector has a component along intercept direction.
Now consider the age vector (1, 3, 2).
We can write it as (2, 2, 2) + (-1, 1, 0), which is equal to 2(1, 1, 1) + (-1, 1, 0).
We can say that age vector also has a component along intercept direction.
Now, if we observe carefully, to reach the price vector, we effectively move a total of 7 units in the direction of the intercept vector, but this movement is distributed across the intercept, size, and age directions.
Introducing the Constraint (This Is Lasso)
Now we are applying lasso to generalize the model.
Earlier, we saw that we could reach the target by moving freely in different directions, with no restriction, and the model could use any amount of movement along each direction.
But now, we introduce a limit.
This means the coefficients cannot take arbitrary values anymore; they are restricted to stay within a certain total budget.
For example, we have β₀ = 1, β₁ = 2, β₂ = 1, and if we add their absolute values, we get |β₀| + |β₁| + |β₂| = 4.
This 4 represents the total allowed contribution across all directions.
Now don’t get confused. Earlier, we said we moved 7 units in the intercept direction, and now we are saying 4 units in total.
These are completely different.
Earlier, we expressed the price vector in terms of its average and deviations, where the intercept was taking care of the entire average.
But now, we are expressing the same vector using feature directions like size and age.
Because of that, part of the movement is already handled by these feature directions, so the intercept does not need to take full responsibility anymore.
We are restricting how much the model can move in total, but why do we do this?
In real world, we often have many features, and Ordinary Least Squares method tries to assign a coefficient to every feature, even if some are not useful.
This makes the model complex, unstable, and prone to overfitting.
Lasso addresses this by adding a constraint. When we limit the total contribution, coefficients start shrinking, and some shrink all the way to zero.
When a coefficient becomes zero, that feature is effectively removed from the model.
That is how lasso performs feature selection, not by choosing features, but by forcing the model to stay within a limited budget.
Our goal is not just to fit the data perfectly, but to capture the true pattern using only the most important directions.
Are We Using This Limit Wisely?

Now let’s say we set the limit to 2.
Before that, we need to understand one important thing. When we apply lasso, we are shrinking the coefficients.
Here, the coefficients are β₀ = 1, β₁ = 2, β₂ = 1.
β₀ represents the intercept. But think about this for a moment. Why should we shrink the intercept? What is the need?
The intercept represents the average level of the target. It is not telling us how the price changes with features like size and age.
What we actually care about is how much the price depends on these features, which is captured by β₁ and β₂. These should reflect the pure effect of each feature.
If the data is not adjusted, the intercept mixes with the feature contributions, and we don’t get a clean understanding of how each feature is influencing the target.
We only have limited movements and why do we waste them by moving along intercept direction? we will use the limit to move along actual deviations direction in size and age with respect to price.
Also, since we are putting a limit on the total coefficients, we only have limited movement. So why waste it by moving in the intercept direction?
We should use this limited budget to move along the actual deviation directions, like size and age, with respect to the price.
The Fix: Centering the Data
So what do we do?
We separate the baseline from the variations. This is done using a process called centering, where we subtract the mean from each vector.
For the price vector (4, 8, 9), the mean is 7, so the centered vector becomes (4, 8, 9) − (7, 7, 7) = (−3, 1, 2).
For the size vector (1, 2, 3), the mean is 2, so the centered vector becomes (1, 2, 3) − (2, 2, 2) = (−1, 0, 1).
For the age vector (1, 3, 2), the mean is 2, so the centered vector becomes (1, 3, 2) − (2, 2, 2) = (−1, 1, 0).
Now we have three centered vectors: price (−3, 1, 2), size (−1, 0, 1), and age (−1, 1, 0).
At this stage, the intercept is removed from the problem because everything is expressed relative to the mean.
We now build the model using these centered vectors, focusing only on how features explain deviations from the average.
Once the model is built, we bring back the intercept by adding the mean of the target to the predictions.

Now let’s solve this once again without using lasso.
This time without using the intercept vector.
We know that here we have two directions to reach the target of price deviations.
Here we are modeling the deviations in the data.
We already know that a 2d-plane will be formed in that 3d-space using different combinations of β₁ and β₂.
This time let’s do the math first.
[
text{Now we solve OLS again, but using centered vectors.}
]
[
y =
begin{bmatrix}
-3 \
1 \
2
end{bmatrix}
quad
x_1 =
begin{bmatrix}
-1 \
0 \
1
end{bmatrix}
quad
x_2 =
begin{bmatrix}
-1 \
1 \
0
end{bmatrix}
]
[
X =
begin{bmatrix}
-1 & -1 \
0 & 1 \
1 & 0
end{bmatrix}
]
[
text{We use the normal equation again.}
]
[
beta = (X^T X)^{-1} X^T y
]
[
X^T =
begin{bmatrix}
-1 & 0 & 1 \
-1 & 1 & 0
end{bmatrix}
]
[
X^T X =
begin{bmatrix}
2 & 1 \
1 & 2
end{bmatrix}
]
[
X^T y =
begin{bmatrix}
5 \
4
end{bmatrix}
]
[
text{Now compute the inverse.}
]
[
(X^T X)^{-1}
=
frac{1}{(2 cdot 2 – 1 cdot 1)}
begin{bmatrix}
2 & -1 \
-1 & 2
end{bmatrix}
]
[
=
frac{1}{3}
begin{bmatrix}
2 & -1 \
-1 & 2
end{bmatrix}
]
[
text{Now multiply with } X^T y.
]
[
beta =
frac{1}{3}
begin{bmatrix}
2 & -1 \
-1 & 2
end{bmatrix}
begin{bmatrix}
5 \
4
end{bmatrix}
]
[
=
frac{1}{3}
begin{bmatrix}
10 – 4 \
-5 + 8
end{bmatrix}
=
frac{1}{3}
begin{bmatrix}
6 \
3
end{bmatrix}
]
[
=
begin{bmatrix}
2 \
1
end{bmatrix}
]
[
text{So the centered solution is: } beta_1 = 2, beta_2 = 1
]
[
hat{y} = 2x_1 + 1x_2
]
We get the same values because centering only removes the average but not the relationship between features and target.

[
text{Now we bring back the intercept to get actual predictions.}
]
[
text{We know that centering was done by subtracting the mean.}
]
[
y_{text{centered}} = y – bar{y}
]
[
text{So the original vector can be written as:}
]
[
y = y_{text{centered}} + bar{y}
]
[
text{Similarly, our prediction also follows the same idea.}
]
[
hat{y} = hat{y}_{text{centered}} + bar{y}
]
[
text{From earlier, we have:}
]
[
hat{y}_{text{centered}} = 2x_1 + 1x_2
]
[
text{Note: these centered vectors are obtained by subtracting the mean from each feature.}
]
[
x_1 – bar{x}_1 = x_1 – 2, quad x_2 – bar{x}_2 = x_2 – 2
]
[
text{So instead of using } x_1 text{ and } x_2, text{ we are using } (x_1 – 2) text{ and } (x_2 – 2).
]
[
text{Now substitute the centered vectors.}
]
[
hat{y}_{text{centered}} =
2
begin{bmatrix}
-1 \
0 \
1
end{bmatrix}
+
1
begin{bmatrix}
-1 \
1 \
0
end{bmatrix}
]
[
=
begin{bmatrix}
-2 \
0 \
2
end{bmatrix}
+
begin{bmatrix}
-1 \
1 \
0
end{bmatrix}
=
begin{bmatrix}
-3 \
1 \
2
end{bmatrix}
]
[
text{Now add back the mean of } y.
]
[
bar{y} = 7
quad
Rightarrow
quad
bar{y}mathbf{1} =
begin{bmatrix}
7 \
7 \
7
end{bmatrix}
]
[
hat{y} =
begin{bmatrix}
-3 \
1 \
2
end{bmatrix}
+
begin{bmatrix}
7 \
7 \
7
end{bmatrix}
=
begin{bmatrix}
4 \
8 \
9
end{bmatrix}
]
[
text{So we recover the actual prediction by adding back the intercept.}
]
We got β₁ = 2 and β₂ = 1.
In total, we used 3 units to reach our target.
Now we apply lasso.
Let’s say we put a limit of 2 units. This means that across both directions combined, we only have 2 units of movement available.
We can distribute this in different ways. For example, we can use 1 unit in the size direction and 1 unit in the age direction, or we can use all 2 units in either the size direction or the age direction.
Let’s see all the possible values of β₁ and β₂ using a plot.

We can observe that when we plot all possible combinations of β₁ and β₂ under this constraint, they form a diamond shape, and our solution lies on this diamond.
Now let’s go back to the centered vector space and see where we reach on the plane under this constraint.

From the above visual, we can get a clear idea.
We already know that a 2D plane is formed in 3D space, and our target lies on that plane.
Now, after applying lasso, the movement on this plane is restricted. We can see this restricted region in the visual, and our solution now lies within this region.
So how can we reach that solution?
Let’s think. Here, the movements are restricted. We can see that the target lies on the plane, but we can’t reach it directly because we’ve applied a limit on the movement.
So what’s the best we can do?
We can go as close as possible to the target, right?
Yes, and that is our solution. Now the question is, how do we know which point in the limited region is closest to our target on that plane?
Let’s see.
Solving Lasso Along a Constraint Boundary
Let’s once again look at our diamond plot, which lies in coefficient space.
We obtain this diamond by considering all combinations of coefficients that satisfy the condition
This gives us a limited region on the plane within which we are allowed to move.
If we observe this region, the points inside mean we are not using the full limit of 2, while the points on the boundary mean we are using the full limit.
Now we are trying to find the closest point on our limited region to OLS solution.
We can observe that the closest point which we are looking for lies on the boundary of our limited region.

The Lasso constraint gives us a diamond shape in coefficient space. This diamond has four edges, and each edge represents a situation where we are fully using the limit.
When we are on an edge, the coefficients are no longer free. They are tied together by the equation . This means we cannot move in any direction we want. We are forced to move along that edge.
Now when we translate this into data space, something interesting happens. Each edge turns into a line of possible predictions. So instead of thinking about a full region, we can think in terms of these lines.
If we look at where the OLS solution lies, we can see that it is closest to the boundary . So, we now focus on this boundary.

Since this boundary is fixed, all predictions we can make along it lie on a single line. So instead of searching everywhere, we just move along this line.
Now the problem becomes simple. We take our target and project it onto this line to find the closest point. That point gives us the Lasso solution.
Now that we understand what Lasso is doing, let’s work through the math to find the solution.
[
textbf{Solving Lasso Using Projection on a Boundary}
]
[
text{Now that we understand the boundaries, let us find the solution using the nearest one.}
]
[
text{From the constraint, we have:}
quad
beta_1 + beta_2 = 2
]
[
text{This means the two coefficients are no longer independent.}
]
[
text{We can express one coefficient in terms of the other:}
quad
beta_2 = 2 – beta_1
]
[
text{Now substitute this into the model:}
]
[
hat{y} = beta_1 x_1 + (2 – beta_1)x_2
]
[
text{Rearranging terms:}
]
[
hat{y} = 2x_2 + beta_1(x_1 – x_2)
]
[
text{This shows that all predictions lie on a line.}
]
[
text{We can write this as:}
quad
hat{y} = text{fixed point} + beta_1 cdot text{direction}
]
[
text{where}
quad
text{fixed point} = 2x_2,
quad
d = x_1 – x_2
]
[
text{Compute the direction vector:}
]
[
d =
begin{bmatrix}
-1 \ 0 \ 1
end{bmatrix}
–
begin{bmatrix}
-1 \ 1 \ 0
end{bmatrix}
=
begin{bmatrix}
0 \ -1 \ 1
end{bmatrix}
]
[
text{Compute the starting point:}
quad
2x_2 =
2
begin{bmatrix}
-1 \ 1 \ 0
end{bmatrix}
=
begin{bmatrix}
-2 \ 2 \ 0
end{bmatrix}
]
[
text{So any point on this boundary is:}
]
[
hat{y} =
begin{bmatrix}
-2 \ 2 \ 0
end{bmatrix}
+
beta_1
begin{bmatrix}
0 \ -1 \ 1
end{bmatrix}
]
[
text{Now we find the point on this line closest to } y.
]
[
y =
begin{bmatrix}
-3 \ 1 \ 2
end{bmatrix}
]
[
text{We use the projection formula:}
quad
beta_1 =
frac{(y – 2x_2) cdot d}{d cdot d}
]
[
text{Compute the shifted vector:}
]
[
y – 2x_2 =
begin{bmatrix}
-1 \ -1 \ 2
end{bmatrix}
]
[
text{Compute } d cdot d:
quad
d cdot d = 2
]
[
text{Compute } (y – 2x_2) cdot d:
quad
3
]
[
text{So we get:}
quad
beta_1 = frac{3}{2}
]
[
text{Now compute } beta_2:
quad
beta_2 = frac{1}{2}
]
[
text{Substitute back to get the closest point on the line:}
]
[
hat{y} =
begin{bmatrix}
-2 \ 2 \ 0
end{bmatrix}
+
frac{3}{2}
begin{bmatrix}
0 \ -1 \ 1
end{bmatrix}
=
begin{bmatrix}
-2 \ 0.5 \ 1.5
end{bmatrix}
]
[
textbf{Closest point to } y textbf{ on this boundary is:}
quad
hat{y} =
begin{bmatrix}
-2 \ 0.5 \ 1.5
end{bmatrix}
]
[
text{Distance (error):}
quad
y – hat{y} =
begin{bmatrix}
-1 \ 0.5 \ 0.5
end{bmatrix}
]
[
text{Error} = 1.5
]
[
textbf{Final Lasso solution:}
quad
beta_1 = 1.5,
quad
beta_2 = 0.5
]
[
text{This shows that the 2D problem reduces to finding the closest point on a line.}
]
If you observe the above calculation, here’s what we actually did.
We started with the full 2D plane, where predictions can lie anywhere in the space formed by the features.
Then we focused on closest boundary of the Lasso constraint, , instead of the full region. This ties the coefficients together and removes their independence.
When we substitute this into the model, the plane collapses into a line of possible predictions.
This line represents all the predictions we can get along that boundary.
We can see that the problem reduced to projecting the target onto this line.
Once we reduce the problem to a line, the solution is just a projection.

Previously, we got β1=2 and β2=1.
Now, after applying Lasso, we have β1=1.5 and β2=0.5.
We can observe that the coefficients have shrunk.
Now, let’s predict the price for House D.

Until now, we worked with centered data. Now we convert the solution back to the original scale.
[
textbf{Centering the Data}
]
[
text{We first centered the features and target:}
]
[
x_1′ = x_1 – bar{x}_1, quad
x_2′ = x_2 – bar{x}_2, quad
y’ = y – bar{y}
]
[
text{After centering, the model becomes:}
quad
y’ = beta_1 x_1′ + beta_2 x_2′
]
[
text{Since the data is centered, the intercept becomes zero.}
]
[
textbf{Solving the Model}
]
[
text{From Lasso, we obtained:}
quad
beta_1 = 1.5, quad beta_2 = 0.5
]
[
textbf{Returning to Original Scale}
]
[
text{We now express the model in terms of original variables:}
]
[
y – bar{y} = beta_1 (x_1 – bar{x}_1) + beta_2 (x_2 – bar{x}_2)
]
[
text{Expanding:}
]
[
y = beta_1 x_1 + beta_2 x_2 + bar{y} – beta_1 bar{x}_1 – beta_2 bar{x}_2
]
[
text{Comparing with } hat{y} = beta_0 + beta_1 x_1 + beta_2 x_2:
]
[
beta_0 = bar{y} – beta_1 bar{x}_1 – beta_2 bar{x}_2
]
[
textbf{Compute the Means}
]
[
bar{y} = frac{4 + 8 + 9}{3} = 7
]
[
bar{x}_1 = frac{1 + 2 + 3}{3} = 2, quad
bar{x}_2 = frac{1 + 3 + 2}{3} = 2
]
[
textbf{Compute the Intercept}
]
[
beta_0 = 7 – (1.5 cdot 2) – (0.5 cdot 2)
]
[
beta_0 = 7 – 3 – 1 = 3
]
[
textbf{Final Model}
]
[
hat{y} = 3 + 1.5x_1 + 0.5x_2
]
[
textbf{Prediction for House D}
]
[
x_1 = 1.5, quad x_2 = 20
]
[
hat{y} = 3 + 1.5(1.5) + 0.5(20)
]
[
hat{y} = 3 + 2.25 + 10 = 15.25
]
Before applying Lasso, we predicted the price of House D as 24, which is far from the actual price of 5.5.
After applying Lasso, the predicted price becomes 15.25.
This happens because we do not allow the model to freely fit the target data, but instead force it to stay within a limited region.
As a result, the model becomes more stable and relies less on any single feature.
This may increase the bias on the training data, but it reduces the variance on unseen data.
But how do we choose the best limit to apply?
We can find this using cross-validation by trying different values.
Ultimately, we need to balance the bias and variance of the model to make it suitable for future predictions.
In some cases, depending on the data and the limit we choose, some coefficients may become zero.
This effectively removes those features from the model and helps it generalize better to new data.
What Really Changed After Applying Lasso?
Here we must observe one important thing.
Without Lasso, we predicted the price of House D as 24, whereas with Lasso we got 15.25.
What happened here?
The real price of the house is 5.5, but our model overfits the training data and predicts a much higher value. It incorrectly learns that age increases the price of a house.
Now consider a real-world situation. Suppose we see a house that was built 30 years ago and is priced low. Then we see another house of the same age, but recently renovated, and it is priced much higher.
From this, we can understand that age alone is not a reliable feature. We cannot rely too heavily on it while predicting house prices.
Instead, features like size may play a more consistent role.
When we apply Lasso, it reduces the influence of both features, especially those that are less reliable. As a result, the prediction becomes 15.25, which is closer to the actual value, though still not perfect.
If we increase the strength of the constraint further, for example by reducing the limit, the coefficient of age may become zero, effectively removing it from the model.
You might think that Lasso shrinks all coefficients equally, but that’s rarely the case. It depends entirely on the hidden geometry of your data.
By the way, the full form of LASSO is Least Absolute Shrinkage and Selection Operator.
I hope this gave you a clearer understanding of what Lasso Regression really is and the geometry behind it.
I’ve also written a detailed blog on solving linear regression using vectors and projections.
If you’re interested, you can check it out here.
Feel free to share your thoughts.
Thanks for reading!



