Real-World Use Cases: Strategies to Bridge the Gap Between Development and Productivity | by Hampus Gustavsson | January, 2025

nimda January 23, 2025

0 5 6 minutes read

Real-World Use Cases: Strategies to Bridge the Gap Between Development and Productivity | by Hampus Gustavsson | January, 2025

Photo courtesy of Dall-e. All images and visuals in this article are created by the author.

Data science shows its value when applied to real-world challenges. This article shares insights gained from hands-on machine learning projects.

In my experience in machine learning and data science, going from development to production is a critical and challenging phase. This process usually occurs in iterative steps, gradually refining the product until it reaches acceptable standards. Along the way, I've seen recurring pitfalls that often delay productivity journeys.

This article examines some of these challenges, focusing on the pre-release process. A separate article will go deeper into the post-production life cycle of a project in more detail.

I believe that the iterative cycle is part of the development process, and my goal is to improve it, not eliminate it. To make the concepts more visible, I will use the Kaggle Fraud Detection dataset (DbCL license) as an example. Modeling, I will use power TabNet again Optuna performing hyperparameter optimization. For an in-depth explanation of these tools, please refer to my previous article.

Developing Loss Activities and Metrics for Impact

When starting a new project, it is important to clearly define the end goal. For example, in fraud detection, the quality objective – catching fraudulent sales – must be translated into statistical terms that guide the model building process.

There is a default in using the F1 metric to measure the results and the weightless entropy loss function, BCE lossfor class problems. And for good reason – these are excellent, robust solutions for model calibration and training. This method remains valid even for non-symmetric datasets, as shown later in this section.

To illustrate, we will establish a basic model trained with BCE loss (equal weights) and tested using the F1 score. Here is the resulting confusion matrix.

Confusion matrix showing the results of a model trained with BCE loss weighted at 0.5 and tested with an F1 score.

The model shows reasonable performance, but struggles to detect fraud, missing 13 cases while flagging only one false positive. From a business perspective, allowing fraudulent transactions to occur may be worse than flagging legitimate wrongdoing. Adjusting the loss function and evaluation metrics to align with business priorities can lead to a more appropriate model.

To guide the choice of the model in prioritizing certain classes, we adjusted the F-beta metric. Looking at our model selection metric, F-beta, we can make the following discovery.

Extract the F-beta metric to get the desired statistic. Photo by the author.

Here, false positives are measured as beta-squared false negatives. Determining the right balance between false positives and false negatives is a flawed process, often tied to business objectives of quality. In the next article, we'll dive deeper into how we derive beta from high-level business goals. To illustrate, we will use a weight equal to the square root of 200, which means that 200 redundant flags are accepted for each additional forgery operation that is blocked. Another thing to note, is that as FN and FP go to zero, the metric goes to one, regardless of the choice of beta.

For our loss function, we randomly chose a weight of 0.995 for fake data points and 0.005 for non-fake data points.

The confusion matrix showing the results of the model trained with the loss of BCE with a weight of 0.995 and evaluated with the F14 score.

Results from the updated model on the test set are shown above. Apart from the base case, our second model selects 16 cases of false positives rather than two cases of false negatives. This trade is in line with the shift we were hoping for.

Prioritize Representative Metrics Over Inflation.

In data science, competition for resources is common, and presenting increased results can be tempting. While this may provide temporary approval, it often leads to stakeholder frustration and unrealistic expectations.

Instead, presenting metrics that accurately represent the current state of the model encourages better long-term relationships and realistic project planning. Here is the direct way.

Segment data accordingly.

Disaggregate the dataset to reflect real-world conditions as closely as possible. If your data has a temporal dimension, use it to create meaningful divisions. I covered this in a previous article, for those who want to see more examples.

For the Kaggle dataset, we will assume that the data is sorted by time, in the time column. We will do a train vale split, 80%, 10%, 10%. These sets can be thought of as: You train with the training dataset, optimize the parameters with the test dataset, and present the metrics from the validation dataset..

Note, that in the previous section we looked at the results from the test data, i.e. the one we use to optimize the parameter. The validation data set that was going on, we will now look at.

Confusion matrix of the validation dataset, with beta 1 and unweighted loss. Photo by the author.

Confusion matrix of validation dataset, with beta 14 and weight loss. Photo by the author.

We see a drop in recall from 75% to 68% and from 79% to 72%, for our baseline and weighted models respectively. This is expected, as the test set is developed during model selection. A validation set, however, provides a more reliable test.

Be aware of Model Uncertainty.

As with personal decision-making, some data points are more difficult than others to evaluate. And similar events may occur with a modeling perspective. Dealing with this uncertainty can facilitate smooth model deployment. For this business purpose — should we separate all the data points? Should we give a poundage or is the range sufficient? First focus on limited, high-confidence predictions.

These are two possible scenarios, with their respective solutions.

Separation.

If the task is to classify, consider applying a threshold to your output. This way, only the labels the model feels confident about will be extracted. Otherwise, the model will pass the function, not the data label. I have covered this in depth in this article.

Getting down.

The regression equation of the threshold for the separation case, presents a confidence interval rather than presenting a point estimate. The confidence interval is determined by the business use case, but of course the trade-off is between predictive accuracy and predictive certainty. This topic is discussed in the previous article.

Explanation of the model

Incorporating an interpretable model is a choice whenever possible. While the definition concept is model-agnostic, its implementation can vary depending on the type of model.

The importance of model specification is twofold. The first is to build trust. Machine learning is still met with skepticism in some circles. Transparency helps reduce this uncertainty by making the model's behavior understandable and its decisions justifiable.

The second is to see overcrowding. If the model's decision-making process does not match the domain knowledge, it can show overfitting to noisy training data. Such a model risks poor generalization when exposed to new data in production. On the other hand, the description can provide wonderful details that enhance the expertise of the subject.

In our use case, we will check the importance of the feature to get a clear understanding of the behavior of the model. Factor importance scores indicate how much each factor contributes, on average, to the model's predictions.

This is the average score for all features in the dataset, indicating how well they are used on average to determine the class label.

Consider the data set as if it were anonymized. I've been on projects where factor importance analysis provided insights into marketing effectiveness and yielded key predictions for technical systems, such as within predictive maintenance projects. However, a common reaction from subject matter experts (SMEs) is often to affirm, “Yes, these principles make sense to us.”

An in-depth article exploring the techniques of various explanatory models and their applications is forthcoming.

Data Optimization and Label Drift in Manufacturing Systems

A common but dangerous assumption is that the data and distribution of labels will remain static over time. Based on my experience, this assumption rarely holds, except for certain highly regulated technical applications. Data drift — changes in the distribution of features or labels over time — is a natural phenomenon. Instead of resisting it, we should embrace it and incorporate it into the design of our system.

A few things we can think of are trying to build a better model to adapt to change or we can set up a drift monitoring system and calculate its results. And make a plan for when and why to retrain the model. An in-depth article on drift detection and modeling techniques will appear soon, covering data interpretation and drift labeling and including retraining and monitoring techniques.

For our example, we will use the Python library Deepchecks analyzing the drift factor in the Kaggle dataset. In particular, we will test the feature very high Kolmogorov-Smirnov (KS) score, which shows the greatest drift. We watch the drift between the train and the test set.

Although it is difficult to predict how data will change in the future, we can be sure that it will change. Planning for this inevitability is critical to maintaining robust and reliable machine learning systems.

Summary

Bridging the gap between machine learning development and production is no small feat — it's an iterative journey full of pitfalls and learning opportunities. This article enters the critical pre-production phase, focusing on optimizing metrics, managing model uncertainty, and ensuring transparency. By aligning technology options with business priorities, we explore strategies such as adjusting loss functions, applying confidence limits, and monitoring data flooding. After all, a model is only as good as its adaptability – it's like human adaptability.

Thanks for taking the time to check out this article.

I hope this article has provided valuable insight and inspiration. If you have any other ideas or questions, please get in touch. You can also connect with me on LinkedIn.

Source link

nimda January 23, 2025

0 5 6 minutes read