Defined: How does the standard L1 make the option choice?

The process of selecting the appropriate subset of features from the set of provided aspects; The maximum feature of the subset is enabled the model performance to the service provided.
Feature choices can be a craftsman or clear when doing Sort or paths to roll. In these ways, the factors are added or removed based on the limited amount of rate, which reaches the compliance of the feature in making the forecast. The steps can be the benefit of information, differences or molded statistics of the Chi, and the algorithm will decide to accept / refuse the feature that looks at a planned thread. Note, these methods are not part of the model training phase and are made before it.
Embedded ways Make a completely feature selection, without using any options for the predefined choices and receive them from the training details itself. This process of choosing a common feature is part of a model training phase. The model learns to choose the features and make appropriate predictions at the same time. In the latest part, we will explain the role of the approval of doing this.
The usual and the difficulty of model
Typical is a process of discipline of model model to avoid extremes and achieve normal at that work.
Here, model difficulty is in line with its power to adapt to training mode. Considering a simple polynomial model in 'x'with a degree'd', as we increase the quality'd'Polynomial, the model reaches a great flexibility of patterns in visual data.
Excessive and less
If we try to fit with a polynomial model with D = 2 With a set of training samples based on the cubic's Polynomial and a certain sound, the model will not be able to circulate the samples. The model is just not for Adaptation either doubt modeling data generated from Degree 3 (or more order) polynomials. It means that such model below to the training details.
Working in the same example, think we now have model with a D = 6. Now with growing hardships, it should be easy to measure the original cubic polynomial used to produce data (such as coefficients of all names by exponent> 3 to 0 to 0). If the training process can be terminated at the right time, the model will continue to use its additional fluctuations to reduce the error between the starting of the sound samples. This will reduce most training errors, but model now overcrowding Training data. The sound will change in real world settings (or the assessment phase) and any information based on predictable will interfere, result in a higher test error.
How can you determine the good look of the model?
In practical conditions, we have a small understanding of data generation process or the actual data distribution. Finding the right model with the right hardship, so that no income or excessive extreme experience is challenging.
One process may begin with a strong model and reduce its difficulties with feature selection. Under factors, the smaller is the complex of the model.
As discussed in the previous paragraph, the option option can specify (filtering, methods of complying) or entries. Unwanted features with a less complementary compliance in determining the number of reflections should be completed to avoid an exemplary readings of unclean learning. General, and makes the same work. So, continuing the selection of the feature connected to get the general goal of the right crisis?
Standard L1 as a feature selector
To continue our polynomial model, representing it as F, by entering xparameters θ and degree d,
A polynomial model, each of the installation of the installation x_i It may be viewed as a feature, creating a venue of the form,

We also describe the purpose of objective, which is reducing it to the relevant parameters θ * and includes a to make it Penalty to punish the difficulty of model.

Determining this project, we need to analyze all of its sensitive points which means points when they are unsecured is zero or undefined.
Wreasitive Devative Wrt in parameters, θjcan be written as,

When the work ISGN defined as,

Booklet: Perfect job removal is different from the specified SQN activity above. The original remission is not defined in X = 0 In addition, those unscrupified activities are used and ML structures in which include complete integration. Look at this string in the Pyterch Forum.
By advertising to legal purposes in one parameter θj, and to set up zero, we can create an equation relating to the fair value of θj For predictions, targets, and features.


Let's examine the equation above. If we think that the inputs and focus stones focused on the meaning (ie that the information was limited in the pre-step), the LHS term represents a successful Covariince between Jth Feature and differences between predicted prices and target.
Covalance Mathematics between two variables specifies that a variable of variable rating (and pin-versa)
The work of the sign at RHS fired the Covalance in LHS to LHS only three prices (as the signal work returns only -1, 0 and 1). If motut The feature also searches for and does not adhere to predictions, Covariant will almost zero, bring the corresponding parameter θj * to zero. This resulting in the feature is removed from the model.
Think of the work of a sign as Canyon carved by the river. You can go to Canyon (namely a river bed) but to get out of it, you have these major issues or slopes. The standard L1 includes the same 'degree of' gradient 'drop' of loss. Gradient must be strong enough to break obstacles or be zero, finally bringing a parameter to zero.
The more supportive example, consider the data that contains samples taken from the straight line (parks combined with two coefficients) and additional sound. The appropriate model should contain two more parameters, otherwise it will be able to synchronize existing noise in the data (and additional liberation / power in polynomial). Changing the higher energy parameters in a polynomial model does not affect the difference between the intended purposes and modeling, thus reducing their cameras and feature.
During the training process, a consistent step is added / released from the gradient of the lost function. If the gradient of the lost function (MSE) is smaller than lasting step, the paramet will eventually reach the lower 0. View the equation below, indicating how the parameters are renewed with the gradient interest,


If the blue part above is smaller than λαwhich is the smallest number, ΔπJ It's about as a continuous step λα. The sign of this step (red part) to SGN (θj)Outgoing depends on it θj. If θj Hope higher than ε, SGN (θj) equal to 1, so do Δθj Approx. equal to –λα push it toward zero.
Confidential action (red part) that makes zero parameter, gradient of the loss of loss (blue part) should be greater than step size. For a great deal of gradient loss, a feature value must affect the ultimate model effect.
This is the way the fact is completed, or accurately, its corresponding parameter, the number of their number is not in line with the removal of model, Zero-ed is L1.
To learn more and conclude
- Finding more insight into the topic, send me a question about R / Machchamalication subreddit and the resulting string contains different definitions you can want to read.
- Madiyar Aitbayev and has an interesting blog that cover the same question, but by the meaning of geometri.
- Brian's Blog of Brian describes being done in good view.
- This Crossvalianteted thread explains why the usual L1 encourages fixed models. The Detailed Blog Ranjan explains why the usual L1 encourages parameters to become zero and not normal l2.
“The standard L1 makes the option of feature” It is a simple statement that most ML students agree with, without being drawn in the depths of how it works in. This blog is an attempt to show my understanding and mental model to students to answer the question in a correct way. For suggestions and doubts, you can find my email on my website. Keep reading and have a good day to come!