Machine Learning

Machine learning “Lovent Calendar” Day 6: Rearranging the belly of the tree

During the 5 days of this learning machine “Advent Calendar”, we tested 5 models (or algorithms) all based on distances (Euclidean geographic distance, or MALALAnobis global distance).

So it's time to change course, right? We will return to the concept of distance later.

For today, we will see something completely different: olive trees!

An introduction to simple data

Let's use simple data with one continuous factor.

As always, the idea is that you can visualize the results yourself. Then you have to think about making the computer do it.

Decision making in a simple Excel Dataset (created by myself) – image by Author

We can imagine that for the first division, there are two possible values, one around 5.5 and the other around 12.

Now the question is, which one do we choose?

This is exactly what we will find out: How do we determine the value of the first crack with the implementation of Excel?

Once we have determined the value of the first split, we can use the same process for the next split.

That is why we will only use the first division in Excel.

An algorithmic system for tree recovery

I wrote an article to break down the three steps of machine learning to learn effectively, and let's use the principle of retrieving a tree.

So first, we have a “true” machine learning model, with non-trivial steps for all three.

What is the model?

The model here is a set of rules, to classify the data, and for each classification, we will assign a value. Where is it? The average value y of all observations in the same group.

So while nn predicts the mean value of the nearest neighbor (similar observations in terms of variables), Re-split tree predicts the mean value of a group of observations (similar in terms of the characteristic variable).

Qualifying or training process

By decision tree, we also call this process that grows the tree. In the case of a tree type, the leaves will contain only one observation, therefore an MSE of zero.

Growing a tree consists of iterative partitioning of the input data into smaller and smaller chunks. In each region, predictions can be calculated.

In the case of regression, the forecast is a measure of the target variable in the region.

At each step of the construction process, the algorithm selects a feature and a discrete value that maximizes one policy, and in the case of the regressor, it is the frequency of the specific error (MSE) between the actual value and the prediction.

Example order or trees

In a decision tree, the standard time for model planning is also called pruning, which can be seen as discarding nodes and leaves from a fully grown tree.

It is the same as saying that the construction process stops when a policy is met, such as a maximum depth or a minimum number of samples per leaf area. And these are the hyperpaspaseters that can be made through the rescue process.

The inference process

Once the decision tree has been reconstructed, it can be used to predict target variables for new input conditions by applying rules and spanning the tree from the root node to the leaf node node.

The predicted mean value of an input sample is the mean mean values ​​of the training samples that fall in the same region.

Excel implementation of the first division

Here are the steps we will follow:

  • List all possible cracks
  • For each classification, we will calculate the MSE (mean standard error)
  • We will choose the split that minimizes the MSE as the next split

All possible cracks

First, we must list all possible breaks that are the middle values ​​of two consecutive values. No need to check multiple values.

Excel Decision Making with Probable Splits – image by Author

Calculation of MSE for each classification

As a starting point, we can calculate the MSE before any crack. This also means that the prediction is just an average value of y. And MSE is equal to the standard deviation of y.

Now, the idea is to find the split so that the MSE with the split is lower than before. It is possible that the division does not improve performance (or reduce MSE), then the final tree can be small, that is the average value of y.

For each classification, we can then calculate the MSE (mean standard error). The figure below shows the calculation of the first possible division, which is x = 2.

Excel MSE Visualization Resolution for all possible parameters – image by Author

We can see the calculation details:

  1. Cut the dataset into two regions: for the value of X = 2, we determine two probabilities x <2 noma x> 2, so the X axis is cut into two parts.
  2. Calculate the prediction: For each segment, we calculate the Y ratio. That is a possible prediction of Y.
  3. Calculate the Error: Then we compare the prediction to the actual value of y
  4. Calculate the integrated error: For each observation, we can calculate the squared error.
The determination of the decision in Excel and all possible cracks – photo by the Author

Good classification

For each classification, we do the same to find the MSE. In Excel, we can copy and paste the formula and the only value that changes is the background value of X.

Decision Tree Recovery in Excel Splits- Images by Author

Then we can plot the MSE on the y-axis and the possible difference on the x-axis, and now we see that there is a minimum MSE with X = 5.5, this is the result obtained directly with the Python code.

Determining the progress of the MSE micro-shaping – photo by the Author

Exercises you can try outside

Now, you can play with the Google sheet:

  • You can transform the data, the MSE will be updated, and you will see a good cut
  • You can introduce a standard feature
  • You can try to find the following division
  • You can change the policy, instead of MSE, you can use absolute error, Poisson or Friedman_mse as shown in the TenTrereerereressorgressor documentation
  • You can change the variable variable to a binary variable, normally, this becomes a division function, but 0 or 1 are also numbers so the Criterion MSE still works. But if you want to create a proper classifier, you should use a standard exponent or Gini. It is for the next article.

Lasting

Using Excel, it is possible to use one division to gain more insight into how tree regressions work. Even if we didn't create a full tree, it's still interesting, because the most important part is to find the correct classification between all possible partitions.

One thing

Have you noticed anything interesting about how attributes are handled between food-based models, and decision trees?

Food-based models, everything should be numbered. Continuous properties are always continuous, and categorical properties must be converted to numbers. The model compares the points in the area, so everything must live on the numerical axis.

Olive trees make a stranger: they to cut Features are grouped. A continuous feature occurs from time to time. A feature in a category lives in categories.

And the lost price? It's just another stage. No need to hurt first. A tree can handle it naturally by sending all the “missing” values ​​to one branch, just like any other group.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button