Machine Learning “The Lovent Calendar” Day 13: Lasso and Ridge Regression on Excel

One day, a data scientist told that Ridge Regression was a complex model. Because he realized that the training formula is more complicated.
Yes, it is the exact purpose of my reading machine “Advent calendar”, to clarify this kind of difficulty.
So, Ile, we are going to talk about the penalized types of direct regression.
- First, we will see why habituation or punishment is needed, and we will see how the model has been changed
- After that we will examine the different types of approval and their effects.
- We will also train the model in general and test different hyperparameters.
- We will also ask another question about how to survive the heavy lifting during the punishment period. (confused? You'll see)
A direct reorganization and its conditions”
When we talk about a direct return, people often say that certain conditions must be satisfied.
You may have heard statements like:
- Residuals must be Gaussian (sometimes confused with target Gaussian, which is false)
- The explanatory variable should not be collinear
In classical mathematics, these conditions are required for slope. In machine learning, the focus is on prediction, so this assumption is less central, but the underlying issues still exist.
Here, we will see an example of two elements that are collinear, and let's make them completely equal.
And we have the relation: y = x1 + x2, and x1 = x2
I know that if they are perfectly equal, we can just do: y = 2 * x1. But the idea is that they can be very similar, and we can always build a model using it, right?
So what's the problem?
When the elements go well, the solution is no different. Here is an example on the screen below.
y = 10000 * x1 – 9998 * x2
And we can see that the normality of the coefficients is large.
Therefore, this idea is to reduce the general limit of coefficients.
And after using the standard, the mental model is the same!
That's right. The parameters of the return of the changed queue. But the model is the same.
Different types of normality
So the idea is to combine the MSE and the normalization of the coefficients.
Instead of minimizing the MSE, we try to minimize the sum of the two terms.
Which is normal? We can do with Norm L1, l2, or combine them.
There are three classic ways to do this, and consistent names for the models.
Ridge Regression (L2 Penalty)
Ridge's redesign adds finesse to the Prices are inclusive coefficients.
Hopefully:
- Large coefficients are heavily penalized (because of the square)
- The coefficients are subtracted from zero
- But they are never exactly zero
Result:
- All features remain in the model
- The coefficients are smooth and very strong
- It's very effective in combating congestion
You are waiting to shrinkbut it is not selective.

Lasso Regression (L1 Penelty)
Lasso uses a different penalty: the total amount coefficients.
This small change has a big effect.
With the lasso:
- Other coefficients can be absolutely zero
- The model automatically ignores certain factors
That's why the lasso is called that, because it represents At least complete shrinkage with the selection operator.
- Active: Refers to a standard operator added to the loss function
- – It's a little bit: Derived from the regression model of Squares Regression
- – It's rational: Uses the total number of coefficients (normal L1)
- Shrinkage: Regresses the coefficients towards zero
- Choice: It can set some coefficients to zero, making the feature selection
Important nuance:
- We can say that the model still has the same number of coefficients
- But some of them are forced to zero during training
The form of the model is not changed, but the lasso effectively removes the features by driving the coefficients to zero.

3. Elastic Net (L1 + L2)
Elastic Net is combination of Ridge and Lasso.
It uses:
- L1 penalty (like lasso)
- and l2 penalty (like ridge)
Why combine them?
Because:
- Lasso becomes unstable when features are highly connected
- Ridge Move Integration works well but doesn't select features
Elastic Net offers a balance between:
- tighten up
- shrinkage
- to rush
It is usually the most efficient choice for real datasets.
What Really Changes: Model, Training, Organization
Let's look at this from a machine learning perspective.
The model doesn't really change
Of course templatefor all general types, we write:
y = ax + b.
- The same number of coefficients
- Same prediction formula
- But, the coefficients will be different.
From some perspective, Ridge, Lasso, and Elastic Net not different models.
This page preparing for the game The principle is also the same
We remain:
- Define the loss function
- Reduce it
- Compute gradients
- Update the coefficients
The only difference is:
- The loss function now includes a penalty period
That's it.
Added hyperparameters (this is the real difference)
In linear regression, we have no control over the “difficulty” of the model.
- Standard direct return: There is no hyperparameter
- Ridge: one hyperparameter (Lambda)
- Lasso: one hyperparameter (Lambda)
- Elastic Net: two hyperpassmeters
- one of general power
- one that should measure l1 vs l2
So:
- Normal line recovery does not need to be programmed
- Fined returns do
This is the reason why standard deviations are often seen as “not really machine learning”, when they are clearly normal deviations.
Implementation of standard gradients
We keep the Gradient Descent of Ols Regression as a reference, and for the restoration of the Ridge, we must add the normalization time of the coefficient.
We'll use a simple dataset I generated (the same one we already use for direct regression).
We can see 3 “” models that are different in terms of coefficients. And the goal in this chapter is to apply the gradient to all the models and compare them.

Gigi with a penalized gradient
First, we can make a ride, and we only have to change the gradient of a.
Now, it does not mean that the value of B is not changed, because the gradient of B is each step dependent on a.

Lasso with a penalized gradient
Then we can do the same with the lasso.
And the only difference is the cheater.
For each model, we can also calculate the MSE and standard MSE. It's satisfying to see how they shrink over iterations.

Comparing coefficients
Now, we can visualize the coeffled aa in all three models. To see the difference, we include the largest lambdas.

The lambda effect
With a large value of lambda, we will see that the adequacy coefficient becomes small.
And if the Lambda Lasso becomes too large, then we get a value of 0. Numerically, we have to develop the origin of the gradient.

Logistic regression?
We saw a healthy shift yesterday, and one question we can ask is that it can also be interpreted. If so, what are they called?
The answer is yes, Logistic Returns can be closed
Of course the same idea applies.
A reasonable return would also be:
- L1 is punished
- L2 is punished
- Elastic Net Puraded
They are There are no special names as “Ridge Regression” in common usage.
Why?
Because the idea is no longer new.
In practice, libraries like skiit-learn simply let you specify:
- The work of getting lost
- type of penalty
- normal power
Innovation is key when the idea is new.
Now, practice is a common choice.
Some questions we can ask:
- Is Normality Always Helpful?
- How does feature scaling affect the performance of direct real-time recovery?
Lasting
Ridge and lasso do not change the exact model itself, they change the way the coefficients are read. By adding finesse, the creation of strong solutions and strong and purposeful solutions, especially when the elements are connected. Seeing this process step by step in Excel makes it clear that these methods are not more complicated, they are more controlled.



