Machine Learning

Learning machine “Lovent Calendar” Day 9: Lof in Excel

Yesterday, we were working with isolate forest, which is a method of anomaly detection.

Today, we look at another algorithm with the same purpose. But unlike the isolate forest, it does -I build trees.

It is called the Lof, or Local Overlier factor.

People often sum up lof in one sentence: Does this point reside in a region that is smaller in size than its neighbors?

This sentence is actually Tricking understanding. I struggled with it for a long time.

However, there is one part that is easy to understand,
and we will see that it becomes a key point:
There is a neighborhood view.

And while we're talking about neighbors,
Naturally we go back models based on friends.

We will explain this algorithm in 3 steps.

To keep things very simple, we will use this data, and:

1, 2, 3, 9

Do you remember that I have a copyright on this data? We have created a forest of isolation with it, and we will enter it again. And we can compare the two results.

Lof in Excel in 3 steps- all images made by the author

All Excel files are available through this Kofi link. Your support means a lot to me. The price will go up during the month, so early backers get a great deal.

All Excel / GOOGLE SHEET files for ML and DL

Step 1 – K neighbors and K-distance

Lof starts with something very simple:

Look at the distances between the points.
Then find the nearest K neighbors of each point.

Let's take it k = 2to keep things small.

Nearest neighbors for each point

  • Point it out 1 → Neighbors: 2 and 3
  • Point 2 → Neighbors: 1 and 3
  • Point + → Neighbors: 2 and 1
  • Point 9 → Neighbors: 3 and 2

Of course, we see a clear structure from:

  • 1, 2, and 3 form a strong group
  • 9 lives alone, away from others

K-distance: The radius of the area

The K-distance is simply the greatest distance of all K's nearest neighbors.

And actually this the key point.

Because this one number tells you something very concrete:
the radius of the area around the point.

If the K-distance is small, the point is in a dense area.
If the distance is large, the point is in a soft spot.

With this one measurement, you already have the first sign of “Solodwation”.

Here, we use the concept of nearest neighbors “, which reminds us k-nn (classifier or regressor).
The context here is different, but the calculation is the same.

And if you think k-waydo not associate:
The “K” in K-Words is not related with “K” here.

Calculation of the distance k

To get points 1both are close neighbors 2 and + (grades 1 and 2), and so on k-range (1) = 2.

To get points 2The neighbors are there 1 and + (Both far 1), so k-range (2) = 1.

For point 3, both nearest neighbors are 1 again 2 (grades 2 and 1), and so on k-range (3) = 2.

To get points 9The neighbors are there + and 2 (6 and 7), and so on k-range (9) = 7. This is huge compared to all the others.

In Excel, we can create a virtual distance matrix to find the K distance for each point.

LOF in Excel – Image By Author

Step 2 – Accessing grades

For this step, I'll just explain the calculation here, and then use the formulas in Excel. Because, to be honest, I have never succeeded in finding a really accurate way to explain the results.

So, what is “access distance”?

To find a point p and a neighbor o, we define this reachability distance as:

Reach-Dist (P, O) = MAX (K-DAP (O), Distance (P, O))

Why did you take the higher price?

The purpose of accessibility is distance to strengthen the comparison of shields.

If the neighbor living in the densest district (K-dist) is more dense, then we don't want to allow an incomparably smaller distance.

Specifically, for point 2:

  • Distance to 1 = 1, but k-distance (1) = 2 → reach-dist (2, 1) = 2
  • Distance to 3 = 1, but k-distance (3) = 2 → reach-dist-dist (2, 3) = 2

Both neighbors insist that accessibility goes up and up.

In Excel, we will keep the matrix format to show the accessibility of the distances: one point compared to all others.

LOF in Excel – Image By Author

Medium reach distance

For each point, we can now add up an average value, which tells us: On average, how far do I need to travel to reach local neighbors?

And now, you see something: Point 2 has a normal reach distance of 1 and 3.

This makes no sense to me!

Step 3 – lrd and lof score

The last step is a kind of “normalization” to find the anomaly score.

First, we describe the personality of LRD, local reachability, which is simply a medium within the normal reachability range.

And the final Lof score is calculated as:

Therefore, Lof compares the number of points in the rule of its neighbors.

Translation:

  • If LRD (P) ≈ lrd (Neighbors), Then Lof ≈ 1
  • If lrd (p) exists – littlethen get >> 1. So p is in the sparse region
  • If lrd (p) exists they are big → LOF <1. So P is in a very crowded pack.

I also made a version with more improvements, and shorter formulas.

Understanding the meaning of “Anomaly” in insecure models

between Random readingthere is no real world. And that's exactly where things can get tricky.

We have no labels.
We don't have a “right answer”.
We only have a data structure.

Take this small sample:

1, 2, 3, 7, 8, 12
(I own the copyright to it.)

When you look at it, which one feels like an anomaly?

Personally, I was What you recorded.

Now let's look at the results. Lof says Outlier The purchase was heard +.

(And you can see that with the K-range, we can say that it is What you recorded.)

LOF in Excel – Image By Author

Now, we can compare Forest of Separation and The filter side by side.

On the left, with data 1, 2, 3, 9both methods agree:
9 is a clear outlier.
Isolation forest gives the lowest score,
and Loof gives it the highest lof value.

If we look closer, the isolation forest: 1, 2 and 3 have no difference in points. And Lof gives a high score of 2. This is what we have already seen.

With Dataset 1, 2, 3, 7, 8, 12The story changes.

  • Forest of Separation Points to What you recorded as the farthest point.
    This is like understanding: 12 is far from everyone.
  • The filterHowever, the highlights The purchase was heard + In turn.
LOF in Excel – Image By Author

So who is right?

It's hard to say.

In practice, we first need to agree with business groups What does “Anomaly” actually mean According to the nature of our data.

Because in unsupported learning, there is no single truth.

There is a definition of “Anomaly” used by each algorithm.

This is why it is so important to understand
How the algorithm worksand what kind of anomalies it is designed to detect.

Only then can you decide whether a Lof, or a K-range, or an isolation forest is the right option for your particular situation.

And this is the whole message of random reading:

Different algorithms look at different information.
Nothing is “true”.
Just an explanation of what it means for each model.

This is why you need to understand how the algorithm works
is more important than the final Score it produces.

Lasting

Lof Forest and Wrath both see Anomalies, but they look at the details through completely different lenses.

  • k-distance It captures the time of how far to find its neighbors.
  • The filter comparing local decisions.
  • Forest of Separation Split points use random splits.

Again very simple datasetsthese methods can be contradictory.
One algorithm may ignore a point as an overlier, while another highlights something completely different.

And this is the main message:

With unsupported learning, nothing is “true”.
Each algorithm interprets anomalies according to its own perspective.

That's why understanding How The method of operation is more important than the number it produces.
Only then can you choose the right algorithm for the right situation, and interpret the results with confidence.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button