How to build Benchmark with your models

nimda May 15, 2025

0 2 7 minutes read

I have a scientific adviser over the last three years, and I have the opportunity to work in many projects in all different industries. Anyway, I saw one common denominator among many customers I worked with:

They rarely have a clear idea of the purpose of the project.

This is one of the main issues of the scientific data they face, especially now as the Gen AI takes all domains.

But suppose after returning and forth, the purpose becomes evident. We have been able to add a specific question to answer. For example:

I want to divide my customers into two groups according to their original reality: “High Opportunities” and “Less Opportunity to Destroy”

So what? Easy, let's start building some models!

It's wrong!

If having a clear goal is unusual, being trustworthy coat rarely.

In my opinion, one of the most important steps in bringing the data science project describes and agrees to a set of benches and client.

In this blog study, I will explain:

That is to be considered,
Why is it important to have a bench,
How would I build an example and
Some potential issues to remember

What is the beach?

A coat It is a common way to test the model performance. It provides a reference point where new models may be compared to.

Benchmark requires two important elements to be considered complete:

A set of metrics Assessment
A set of simple models to use as basic

Its main sense is simple: Every time I am enhancing the new model I compare with with both previous types and models. This ensures the real and follow-up.

It is important to understand that this basis should not be a model or dataset-special, but special business. It should be a common sign of a given business offense.

If I meet new data, and the same business purpose, this benched item should be the reliable place of reference.

Why Create Benchmark is important

Now that we have described the beach, let's get into the water why I believe it is worth spending the added church project in the development of a strong bench.

Without a benched bench – If you work outside the clear reference point any result you will lose meaning. “My model has mae of 30.000” That's good? Idk! Perhaps with a simple means you will receive a ma of 25.000. By comparing your model to thencan match both performance including recovery.
Improves communication with customers – Customers and business groups may not immediately understand the general result of model. However, by engaging with simple foundations from the beginning, it is easier to show the development later. In many cases benchmarks that can appear directly in the business with different forms or forms.
Helps in choosing the model – Benchmark gives a The first point to compare multiple models well. Apart from it, you can spend the time test models that should not be considered.
Drift Drift and Caution – Models can dropped together later. By having benchmark you can look It pulls early By comparing the results of the new model against previous benches and foundations.
Consistency between different data – Datasets from. By finding a planned set of metrics and models that ensure that the comparisons is still applicable.

By a clear bench, every step in the model development will give The answer is fastto make every process over intentionally and conducted by data.

How to build a bench

I hope I have confirmed the importance of having a bench. Now, I really let's make up one.

Let's start with a business question that has expressed the beginning of this blog study:

I want to divide my customers into two groups according to their original reality: “High Opportunities” and “Less Opportunity to Destroy”

For convenience, I will think No other business issues, But in the actual conditions of the world, the issues are often present.

As a result of this example, I use this data (CC0: Domain Public). Details contains some of the symptoms from the Company Customer (eg. Age, gender, productivity, …) and their Churn status.

Now that we have something to work for Let's build Benchmark:

1. Defining Metrics

We deal with a churn, especially, is a Binary Issue Problem. Thus large metrics can use it:

Accuracy – Percentage of well predicted Churchers between all predictors
Remember – Percentage of real churchren are well identified
F1 Score – To estimate insight and remember
True purposes, fosive positive, bad negative negative and lies

These are other simple metrics that can be used to test the model effect.

HoweverIt is not a complete list, ordinary metrics do not stay sufficient. In most cases of use, it can be useful to Create custom Metrics.

Let's think that in our business business Customers are written as “the highest chances of deception” are given a discount. This is creating:

A charge ($ 250) when you give to a discount on a non-top customer
A wages ($ 1000) when you store customer that stimulates customers

Following this description can build a custom metrics that will be important in our case:

# Defining the business case-specific reference metric
def financial_gain(y_true, y_pred):  
    loss_from_fp = np.sum(np.logical_and(y_pred == 1, y_true == 0)) * 250  
    gain_from_tp = np.sum(np.logical_and(y_pred == 1, y_true == 1)) * 1000  
    return gain_from_tp - loss_from_fp

When Socking Metric Metric Conducted by Business This is usually very appropriate. Such a metric can take any kind or form: Financial purposes, low demands, percentage coverage and more.

2. To describe benches

Now that we have described our metric, we can explain a set of basic models for use as a reference.

At this stage, you must specify a simple-starting model list of their best use. No reason available in this situation spending time with resources in the performance of these types, my MindSet is:

If I had 15 minutes, how can I use this model?

In the latest model categories, you can add basic models to remove as the project continues.

In this case, I will use the following models:

The random model – Assign labels at times
A lot of model – Always predicts the most common class
Simple XGB
Simple Knn

import numpy as np  
import xgboost as xgb  
from sklearn.neighbors import KNeighborsClassifier  
  
class BinaryMean():  
    @staticmethod  
    def run_benchmark(df_train, df_test):  
        np.random.seed(21)  
        return np.random.choice(a=[1, 0], size=len(df_test), p=[df_train['y'].mean(), 1 - df_train['y'].mean()])  
      
class SimpleXbg():  
    @staticmethod  
    def run_benchmark(df_train, df_test):  
        model = xgb.XGBClassifier()  
        model.fit(df_train.select_dtypes(include=np.number).drop(columns='y'), df_train['y'])  
        return model.predict(df_test.select_dtypes(include=np.number).drop(columns='y'))  
      
class MajorityClass():  
    @staticmethod  
    def run_benchmark(df_train, df_test):  
        majority_class = df_train['y'].mode()[0]  
        return np.full(len(df_test), majority_class)  
  
class SimpleKNN():  
    @staticmethod  
    def run_benchmark(df_train, df_test):  
        model = KNeighborsClassifier()  
        model.fit(df_train.select_dtypes(include=np.number).drop(columns='y'), df_train['y'])  
        return model.predict(df_test.select_dtypes(include=np.number).drop(columns='y'))

And, as in metric condition, we can build Custom benches.

Let's think that in our business business The commercial team contacts all clients that:

Over 50 y / o including
That Does not work again

Following this Act may create this model:

# Defining the business case-specific benchmark
class BusinessBenchmark():  
    @staticmethod  
    def run_benchmark(df_train, df_test):  
        df = df_test.copy()  
        df.loc[:,'y_hat'] = 0  
        df.loc[(df['IsActiveMember'] == 0) & (df['Age'] >= 50), 'y_hat'] = 1  
        return df['y_hat']

Running bench

To run the measurement sign I will use the next section. The point of access is the way compare_with_benchmark() That, given to predict, runs all models and calculates all the metrics.

import numpy as np  
  
class ChurnBinaryBenchmark():  
    def __init__(        
	    self,  
        metrics = [],  
        benchmark_models = [],        
        ):  
        self.metrics = metrics  
        self.benchmark_models = benchmark_models  
  
    def compare_pred_with_benchmark(        
	    self,  
        df_train,  
        df_test,  
        my_predictions,    
        ):  
       
        output_metrics = {  
            'Prediction': self._calculate_metrics(df_test['y'], my_predictions)  
        }  
        dct_benchmarks = {}  
  
        for model in self.benchmark_models:  
            dct_benchmarks[model.__name__] = model.run_benchmark(df_train = df_train, df_test = df_test)  
            output_metrics[f'Benchmark - {model.__name__}'] = self._calculate_metrics(df_test['y'], dct_benchmarks[model.__name__])  
  
        return output_metrics  
      
    def _calculate_metrics(self, y_true, y_pred):  
        return {getattr(func, '__name__', 'Unknown') : func(y_true = y_true, y_pred = y_pred) for func in self.metrics}

Now all we need to predict. For this example, I am making a quick feature of the feature and some hyperparameter planning.

The last step is to simply work bench:

binary_benchmark = ChurnBinaryBenchmark(  
    metrics=[f1_score, precision_score, recall_score, tp, tn, fp, fn, financial_gain],  
    benchmark_models=[BinaryMean, SimpleXbg, MajorityClass, SimpleKNN, BusinessBenchmark]  
    )  
  
res = binary_benchmark.compare_pred_with_benchmark(  
    df_train=df_train,  
    df_test=df_test,  
    my_predictions=preds,  
)  
  
pd.DataFrame(res)

Benchmark Metric Metric Metrics Comparations | Photo by the writer

This creates a The comparative table In all models in all Metrics. Using this table, it is possible to draw concrete conclusions in model forecasts and make informed decisions about the following processes of the process.

Some barriers

As we have seen that there are many reasons why it helps to have a bench. However, even if benchmarks are very useful distances, there are others Pity Realizing:

Benchchmark informalized Benchchmark – Where metrics or models are negatively defined in Marginal impact of the bench. Always plan sensible foundations.
Misunderstood by participants – Communication with the client is important, it is important to say good enough by the metrics. The best model may not be the best thing for all the Matterts described.
Overdown on the bench – You can keep trying to create the most direct features, which may be a bench, but do not treat them well in predictions. Don't concentrate on the beating Benchmark, but to create the best solution of the problem.
The change of purpose – Specific purposes may change, because of poor communication or changes in planning. Keep your Benchmark agree to be able to adapt to the need.

The last thoughts

The benches provide clarity, ensuring the development is measured, and creates a Refered Reference point between scientists and customers. They help to avoid the trap of taking the model we do well without evidence and ensure that all Itemation brings a real value.

They work again as Communication toolmaking it easier to describe the development of customers. Instead of simply presenting numbers, you can show clear comparisons that highlights the development.

Here you can receive a full start letter from this blog post.

Source link

nimda May 15, 2025

0 2 7 minutes read