Machine Learning

LossVal Explained: Efficient Data Estimation for Neural Networks | by Tim Wibiral | January, 2025

Not all data is created equal: Some training data points influence the training of a machine learning model more than others. Understanding the impact of each data point is often inefficient and often relies on repeated retraining of the model. LossVal presents a new approach in this regard, which successfully integrates the Data Measurement process into the loss function of an artificial neural network.

Machine Learning Models are often trained on large datasets. In many cases, not all training samples in such a dataset are equally useful or informative for the model. For example, if a data point is noisy or mislabeled, it has little information for your machine learning model. In one of the works in our paper, we trained a machine learning model on a car crash test dataset to predict how dangerous a crash would be to a passenger, based on some car parameters. Some data points from cars from the 80s and 90s! You can imagine, that very old cars may not be so important in predicting the model in today's cars.

The process of understanding the impact of each training sample on a machine learning model is called Data Standardization, where a significant score is assigned to each training sample. Data Analytics is a growing field connected to data markets, interpretable AI, active learning, and much more. Several methods have been proposed, such as Data Shapley, Influence Functions, or LAVA. To learn more about this, you can check out my recent blog post that introduces different Data Measurement methods and applications.

The basic idea behind LossVal is to “learn” the value points of each sample while training the model, similar to how model weights are learned. This saves us from restarting model training multiple times and keeping track of all model weight updates during training.

To achieve this, we can modify common loss functions such as mean squared error (MSE) and cross-entropy loss. We add model-based weights to the loss and multiply it by a weighted distance function. In general, the LossVal function has the following form:

where ℒ denotes the weighted target loss (weighted MSE or cross-entropy) and OT denotes the weighted propagation distance (OT represents the ideal transport). This results in new loss functions that can be used like any other loss function in neural network training. However, during each training step, the weights w in the loss update using gradient descent.

We illustrate this with regression functions using MSE and classification using inverse loss. After that, we take a closer look at the distribution distance OT.

LossVal for Regression

Let's start with MSE. Standardized MSE is the squared difference between the model predictions ŷ and the right forecast y (and n to be indicative of the training sample):

In LossVal, we adjust the MSE in two steps: First, the weight wₙ included in each training example n. Second, the entire MSE is multiplied by the weighted distribution distance function.

LossVal for Classification

The cross-entropy loss is often expressed as:

We can change the cross-entropy loss in the same way as MSE:

Correct Transport Distance

The optimal transport distance is the least effort you need to convert one distribution to another. It is also known as earth moving distance, which comes from the analogy of the fastest way to fill a hole with a lot of dirt. OT can be defined as:

there c is the cost of moving the point xₙ to xⱼ. Each one γ is a possible transport system, which describes how the points are moved. The best transport system is γ* with less effort involved (smaller distribution distance). Note that we include the weights w in the cost function through the joint distribution Π(w, 1). In other words, OTᵥᵥ is the weighted distance between the training and validation set. You can find an in-depth explanation of suitable transport here.

In the practical sense, to reduce OTᵥᵥ by changing the weights it will assign higher weights to training data points similar to validation data. Effectively, noisy samples receive less weight.

Our implementation and all data are available on GitHub. The code below shows the use of LossVal for the mean square error.

def LossVal_mse(train_X: torch.Tensor, 
train_y_true: torch.Tensor, train_y_pred: torch.Tensor,
val_X: torch.Tensor, sample_ids: torch.Tensor
weights: torch.Tensor, device: torch.device) -> torch.Tensor:
weights = weights.index_select(0, sample_ids) # Select the weights corresponding to the sample_ids

# Step 1: Compute the weighted mse loss
loss = torch.sum((train_y_true - train_y_pred) ** 2, dim=1)
weighted_loss = torch.sum(weights @ loss) # Loss is a vector, weights is a matrix

# Step 2: Compute the Sinkhorn distance between the training and validation distributions
sinkhorn_distance = SamplesLoss(loss="sinkhorn")
dist_loss = sinkhorn_distance(weights, train_X, torch.ones(val_X.shape[0], requires_grad=True).to(device), val_X)

# Step 3: Combine mse and Sinkhorn distance
return weighted_loss * dist_loss**2

This loss function works like any other loss function in pytorch, with some unusual features: the parameters include the validation set, the sample weights, and the indexes of the samples in the batch. This is necessary to select the appropriate weights of the aggregated samples to calculate weighted losses. Remember that this implementation depends on pytorch's automatic gradient calculation. That means that the sample weight vector needs to be part of the model parameters. In this way, optimization of weights is achieved by using an optimizer, such as Adam. Alternatively, one may also manually update the weights, using the loss gradient with respect to each weight. i. The cross-entropy implementation works equally well, but you need to replace line 8.

Benchmark comparison of different Data Estimation methods for noisy sample acquisition. The higher the better. (Photo by author.)

The figure above shows a comparison between different methods of Data Evaluation in a noisy sampling task. This function is defined by the OpenDataVal benchmark. First, noise is added to p% of the training data, and then Data Standardization is used to find noisy samples. Better methods will get more noisy samples, resulting in higher F1 scores. The graph above shows the average over 6 data sets for the classification and 6 data sets for the regression. We tested 3 different sound types; noise labels, noise features, and mixed noise. In the case of mixed noise, part of the noisy sample is feature noise and the other part is label noise. In detecting a noisy sample, LossVal outperforms all other methods for label noise and mixed noise. However, LAVA works best for feature audio.

The test setup for the point displacement test (image below) is the same. However, here the goal is to extract the highest value data points from the training set and see how the model trained on this application performs. This means that the best way to fit the data will lead to a rapid decline in model performance because it removes important data points prematurely. We found that LossVal is similar to modern methods.

Benchmark comparison of different Data Standardization methods for high value point removal. The lower the better. (Photo by author.)

For more detailed results, see our paper.

The concept of LossVal is simple: Use gradient descent to find the absolute weight of each data point. The weight indicates the importance of the data point.

Our tests show that LossVal achieves state-of-the-art performance in the OpenDataVal benchmark. LossVal has the lowest time complexity of all the other model-based methods we tested and shows the most robust performance over different types of noise and tasks.

Overall, LossVal presents an efficient alternative to other high-level Data Estimation methods in neural networks.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button