Machine Learning “Advent Calendar” Day 16: Kernel Trick in Excel

article about SVM, the next natural step is Kernel SVM.
At first glance, it looks like a completely different model. Training takes place in two forms, we stop talking about slope and intercept, and suddenly it's all about the “kernel”.
In today's article, I will express the word the kernel concrete by visualizing what it actually does.
There are many good ways to introduce Kernel SVM. If you've read my previous articles, you know that I like to start with something simple that you already know.
The classic way to introduce Kernel SVM is this: SVM is a linear model. If the relationship between the features and the target is not linear, a straight line will not separate the classes well. So we build new features. Polynomial regression is still a linear model, we simply add the factors of the polynomial (x, x², x³, …). According to this view, a polynomial kernel perform indirect polynomial regression, and i RBF kernel can be seen using an infinite series of polynomial factors…
Maybe one day we will follow this path, but today we will take a different one: we start with it KDE.
yes, Kernel Density Estimation.
Let's get started.
1. KDE as the sum of individual densities
I introduced KDE in an article about LDA and QDA, and at the time I said we would use it again later. This is the time.
We see the word the kernel in KDE, and we see it again Kernel SVM. This is not a coincidence, there is a real link.
The idea of KDE is simple:
next to each data point, we place a small distribution (kernel).
Then, we add all of these individual densities together to achieve global distribution.
Keep this idea in mind. It will be the key to understanding Kernel SVM.
We can also adjust one parameter to control how smooth the global density is, from local to very smooth, as shown in the GIF below.

As you know, KDE is a distance or density based model, so here, we will create a link between two models from two different families.
2. Converting KDE into a model
Now we reuse the same idea to construct a function around each point, and this function can be used to divide.
Remember that the classification function in weight-based models is a regression function, because the y-value is always considered continuous? We only do the division part after finding the decision function or f(x).
2.1. (Still) using a simple dataset
Someone once asked me why I always use 10 data points to describe machine learning, saying it makes no sense.
I strongly disagree.
If someone can't explain how a Machine Learning model works in 10 points (or less) and one feature, then they really don't understand how this model works.
So this will not surprise you. Of course, I will still use this very simple dataset, which I have already used for logistic regression and SVM. I know this dataset is linearly separable, but it is interesting to compare the results of the models.
I also generated another dataset with data points that are not linearly separable and visualize how the kernelized model works.

2.2. The RBF kernel is point oriented
Now let's apply the KDE view to our dataset.
For each data point, we set a bell-shaped curve centered on its x value. For now, we don't care about the split yet. We do only one simple thing: create a single zone instrument for each zone.
This instrument has a Gaussian shape, but here it has a specific name: RBFbecause Radial Basis Function.
In this picture, we can see RBF (Gaussian) kernel centered at this point x₇

The name sounds technical, but the concept is actually very simple.
Once you see RBFs as “range-based instruments”, the term ceases to be a mystery.
How to learn this intuitively
- x is any position on the x-axis
- x₇ is the center of the beam (point 7)
- γ (gamma) controls the width of the instrument
So the metal reaches its maximum size exactly in place.
As ux moves from x₇, the value decreases smoothly to 0.
The role of γ (gamma)
- Small γ mean broad metal (smooth, global influence)
- Larger γ means small metal (very local influence)
So γ plays the same role as bandwidth in KDE.
At this stage, nothing is included yet. We build ground blocks.
2.3. Combining instruments and class labels
In the figures below, you begin to see individual instruments, each focused on a data area.
Once this is clear, we move on to the next step: assembling instruments.
In this case, each instrument is multiplied by its yi label.
As a result, some metals are added and others are removed, creating influences in two opposite directions.
This is the first step to the task of classification.

And we can see all the components from each data point adding together in Excel to get the final score.

This of course looks very similar to KDE.
But we're not done yet.
2.4. From balanced weights to heavy weights
We mentioned earlier that SVM is family based on weight of models. So the next natural step is to introduce weights.
In grade-based models, one major limitation is that all factors are considered equally important when calculating grades. Yes, we can measure features, but this is often manually adjusted and imperfect.
Here, we take a different approach.
Instead of simply summarizing all the instruments, we assign a weight to each data point then multiply each metal by this weight.

At this point, the model is still linebut the line in character spacenot in the first input field.
To do this concretely, we can assume that the coefficients αi are already known and directly program the resulting function in Excel. Each data point contributes its own weighted instrument, and the final score is the sum of all these contributions.

When we apply this to a dataset with a linearly indistinguishable boundary, we clearly see what the Kernel SVM does: it fits the data by combining spatial influences, instead of trying to draw a single straight line.

3. Loss function: where SVM really starts
So far, we are just talking the kernel part of the model. We built the instruments, measured them, and assembled them.
But our model is called Kernel SVMnot just a “kernel model”.
I Part of SVM comes from loss function.
And as you may already know, SVM is defined by hinge loss.
3.1 Loss of hinges and support vectors
Hinge loss has a very important property.
If the point is:
- well spaced, too
- far enough from the decision boundary,
then its loss zero.
As a direct result, its coefficient αi becomes zero.
Only a few data points remain active.
These points are called support vectors.
So although we started with one instrument per data pointin the last model, only a few bells are left.
In the example below, you can see that for some points (for example points 5 and 8), the coefficient αi is zero. These points are not support vectors and do not contribute to the decision function.
Depending on how much we penalize the violation (with parameter C), the number of support vectors can increase or decrease.
This is an important advantage of SVM.
If the dataset is large, storing one parameter per data point can be expensive. Due to the hinge loss, SVM produces a a small modelwhere a small set of points is stored.

3.2 Kernel ridge regression: same letters, different losses
If we keep the same characters but replace the hinge loss with square loss, we get it the descent of the kernel ridge:
Similar characters.
Similar instruments.
Separate losses.
This leads to a very important conclusion:
The characters define the representation.
The loss function defines the model.
For kernel ridge regression, the model should save all training data points.
Since the squared loss does not force any coefficient to zero, each data point retains a non-zero weight and contributes to the prediction.
In contrast, Kernel SVM produces a smaller solution: only the support vectors are kept, all other points disappear from the model.

3.3 Quick link with LASSO
There is an interesting parallel with LASSO.
For linear regression, LASSO uses the L1 penalty in the basic coefficients. This penalty favors the size, and some coefficients are exactly zero.
In SVM, the hinge loss plays the same role, but in a different place.
- LASSO creates sparsity in initial coefficients
- SVM creates a minimum in two coefficients αi
Different methods, same result: only important parameters are saved.
The conclusion
Kernel SVM is not limited to kernels.
- Letters create a rich, indirect representation.
- Hinge loss is selected only for critical data points.
The result is a model that is both flexible again slowlythat's why SVM remains a powerful and efficient tool.
Tomorrow, we'll look at another model that deals with nonlinearity. Stay tuned.



