Machine Learning

Machine learning “Calendar Lovent” Day 7: Last Tree Tree

we checked that a Resolution of Muscle Restaor chooses its optimal partition by minimizing i Mean mass error (MSE).

Today on the 7th day of the “value calendar” learning machine, we continue the same way but with The Decision to Save the Treethe classmate of yesterday's model.

A quick intuition test with two simple datasets

Let's start with the smallest toy tatdaset I made, with one value element and one target variable with two classes: 0 and 1.

The idea is to cut the data into two parts, based on a single rule. But the question is: What should this law be? What policy tells us which division is better?

Now, even if we don't know the math yet, we can just look at the details and guess the split points.

And in hindsight, it would happen 8 or What you recordedright?

But the question is what one is qualified for numerically.

Decision tree classifier in Excel – image by Author

If we think about the mind:

  • With split at 8:
    • left side: no misalignment
    • right side: single specification
  • With split at What you recorded:
    • right side: no misalignment
    • left side: negative specification

Obviously, dividing by 8 feels better.

Now, let's look at an example with Three classes. I added some random details, and made 3 classes.

Here are the labels 0, 1, 3and I edit them directly.

But we must be careful: These numbers are there just class namesnot numerical values. They should not be interpreted as “ordered”.

So the feeling is always: How does each region meet after the division?

But it's hard to see by sight the finer distinctions.

Now, we need a mathematical way to express this idea.

This is such a topic for the next chapter.

Pollution measurement as a classification method

In the Resteror Resteror decision, we already know:

  • The forecast for the region is this – common of the stone.
  • The split level is measured by A disease.

In the case of a classifier tree:

  • The forecast for the region is this Most people of the district.
  • Split quality is measured by the measure of infinity: Gini pollution or A sermon.

Both are common in books, and both are available on skikit-funda. Gini is used automatically.

But, what is the nature of pollution, really?

If you look at the curves of Kind of and A sermonboth behave in the same way:

  • That's right 0 Where a node exists -nothing (All samples have the same class).
  • They reached theirs which is in the end Where classes exist mixed evenly (50 percent / 50 percent).
  • The curve is there -bushelezsymmetric, and increases with disturbance.

This is an important property of any the measure of infinity:

Impurity is low when groups are pure, and high when groups are mixed.

Decision tree classifier in Excel – Gini and Entropy – image by Author

So we will use these methods to determine which division to create.

Divide by one continuous factor

As a tree resolver, we will follow the same structure.

List of all possible cracks

Similar to the Reslarsor Version, with a single numeric feature, the only locations we need to check are the midpoints between the plotted X values.

With each division, the dirty dirt on each side

Let's take a discrete value, for example, x = 5.5.

We divide the dataset into two regions:

  • Region l: x < 5.5
  • Region r: x ≥ 5.5

In each region:

  1. We calculate the total number of observations
  2. We handle Gini waste
  3. Finally, we include weighted impurities for classification
Decision tree classifier in Excel – image by Author

Choose the separation with the lowest pollution

As a reverse case:

  • List all possible cracks
  • The filth of every deception
  • The correct classification is him less pollution
Decision tree classifier in Excel – image by Author

Table for making all cracks

To automate everything in Excel,
We arrange all calculations internally one tablewhere:

  • Each row corresponds to one division,
  • For each line, we include:
    • The Gini of the – left District,
    • The Gini of the on the right side District,
    • once Full weighted Gini of separation.

This table provides a clean, compact overview of all possible partitions,
And the best split is simply the lowest value in the last column.

Decision tree classifier in Excel – image by Author

Multiple class classification

Until now, we were working with two classes. But Gini pollution naturally reaches the Three classesand the concept of the division of the same variable.

Nothing changes the structure of the algorithm:

  • We count all possible cracks,
  • We put dirt on each side,
  • We take a weighted average,
  • We choose separation with the lowest pollution.

The only way Gini's pollution becomes long-term.

Gini Pollution in three classes

If the circuit consists of measurements P1, P2, P3

In these three studies, then the Gini coefficient is:

Same idea as before:
A 'pure' state where one class rules,
And the pollution is greater when the classes are mixed.

Left and right regions

For each classification:

  • Region l contains some observations of classes 1, 2, and 3
  • Region r contains the remaining observations

In each region:

  1. Calculate how many points each class has
  2. identify the proportions p1, p2, p3
  3. determine the Gini coefficient using the above formula

Everything is exactly the same as in binary form, in one term.

Summary Temple of 3 Splits

As before, we collect all the combinations in one table:

  • each line is single-spaced
  • We list section 1, section 2, section 3 on the left
  • We list section 1, section 2, section 3 on the right
  • We include Gini (left), Gini (right), and weighted Gini

Separation from less weighty dirt is the one chosen by the decision tree.

Decision tree classifier in Excel – image by Author

We can easily create algorithms for K classes, using the following parameters to calculate Gini or entropy

Decision tree classifier in Excel – image by Author

How different are the methods of pollution, really?

Now, we always say Gini or entropy as terms, but Are they really different? When looking at mathematical formulas, some may say

The answer is no.

In theory, in almost all applicable cases:

  • Gini and Entropy Choose the same division
  • Tree structure almost the same
  • Predictions the same

Why?

Because their curves look very similar.

Both collect 50 percent mixing and throw at zero purity.

The only difference type curve:

  • Kind of a quadratic work. It penalizes misalignment more properly.
  • A sermon a logarithmic The function, therefore, penalizes the uncertainty slightly around 0.5.

But the difference is small, it works, and you can do it in Excel!

Other ways to get dirty?

Another natural question: Is it possible to invent / use other methods?

Yes, you can set up your own business, as long as:

  • Corner 0 When the node is clean
  • Corner size When classes are mixed
  • Corner -bushelez and grow strongly in “disruption”

Example: Pollution = 4 * P0 * P1

This is another acceptable method of pollution. And it actually fits Kind of repeated every time there are only two sections.

So again, it gives similar cracks. If you're not sure, you can

Here are some steps that can be used as well.

Decision tree for a classifier in Excel – Multiple impurity measures – image by Author

Exercises in Excel

Tests and other parameters and features

Once you've created the first partition, you can expand your file:

  • Try it A sermon Instead of Gini
  • Try to add Employment Features
  • Try to build the next division
  • Try to change The depth of the depth and be aware of under- and over-fitting
  • Try creating a confusion matrix for predictions

These simple tests already give you a good sense of how real real trees behave.

Implementation of the rules for the Titanic Survival Dataset

The next natural exercise is to recreate known decision rules The Titanic Survival Dataset (CC0 / Public Domain).

First, we can start with just two factors: gender and years.

Implementing rules in Excel is old and boring, but this is exactly what it is: It makes you aware of what the default rules look like.

It is nothing but a sequence of If / else Statements, repeated over and over.

This is the true nature of a decision tree: simple rules, stored on top of each other.

Decision tree of a classifier in Excel for Titanic Survival Dataset (CC0 / Public Domain) – photo by the Author

Lasting

Getting started with the tree function in Excel is surprisingly easy to find.

With a few formulas, he reveals the heart of the algorithm:

  • List the Splits
  • the impurity of compulsion
  • Choose the cleanest classification
Decision tree classifier in Excel – image by Author

This simple method is the basis of many advanced models that are similar Grown treeswhich we will discuss later in this series.

And stay tuned Day 8 Tomorrow!

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button