Machine Learning

Model Estimating, Explained: A Visual Guide with Code Examples for Beginners | by Samy Baladram | January, 2025

INSPECTION AND REPAIR

When all models have the same accuracy, then what?

About Data Science

He trained several classification models, and they all seemed to perform well with high accuracy scores. Congratulations!

But hold on – is one model really better than the others? Accuracy alone does not tell the whole story. What if one model always overestimates her confidence, while another underestimates it? This is where it is model calibration enters.

Here, we'll see what a measurement model is and explore how to test the reliability of your models' assumptions – using visuals and working code examples to show you how to identify measurement problems. Get ready to transcend accuracy and unleash the true power of your machine learning models!

All visuals: Created by author using Canva Pro. Made for mobile; it may appear too large on the desktop.

Model calibration measures how well the model's prediction probabilities match its actual performance. A model that gives a score of 70% probability should be correct 70% of the time for the same prediction. This means that its probability score reflects the true probability that its predictions are correct.

Why Measurement is Important

While accuracy tells us how often the model is correct, approximation does or we can trust its probable points. Two models may both have 90% accuracy, but one may provide realistic probability scores while the other provides overconfident predictions. In many real applications, having reliable probability scores is as important as correct prediction.

Two models that are equally accurate (70% correct) show different levels of confidence in their predictions. Model A uses moderate probability scores (0.3 and 0.7) while Model B uses only extreme probabilities (0.0 and 1.0), indicating whether it is completely certain or completely uncertain about each prediction.

Total Rating vs. The truth

A well-calibrated model will show an exact match between its prediction probability and the actual success rate: If it predicts with 90% probability, it should be correct 90% of the time. The same applies to all possible levels.

However, most models are not completely calibrated. They can be:

  • Overconfidence: giving chances of scoring too high for their actual performance
  • Lack of confidence: giving the odds a score that is too low for their actual performance
  • Both: overconfidence in some grades and underconfidence in others
The four models with the same accuracy (70%) show different measurement patterns. An overconfidence model makes extreme predictions (0.0 or 1.0), while an underconfidence model stays close to 0.5. The over- and under-confidence model alternates between extreme and intermediate values. A well-calibrated model uses reasonable probabilities (0.3 for 'NO' and 0.7 for 'YES') that match its true performance.

This discrepancy between predicted probabilities and actual accuracy can lead to poor decision making when using these models in real applications. This is why understanding and improving model calibration is necessary to build reliable machine learning systems.

To test the calibration of the model, we will continue with the same dataset used in my previous articles on Classification: predicting whether someone will play golf or not based on weather conditions.

Columns: 'Overcast (one-hot coded in 3 columns)', 'Temperature' (in Fahrenheit), 'Humidity' (in %), 'Windy' (Yes/No) and 'Play' (Yes /No, targeted feature)
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Create and prepare dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast',
'sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy',
'sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast',
'rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0,
72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0,
88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0,
90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0,
65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True,
True, False, True, True, False, False, True, False, True, True, False,
True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes',
'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes',
'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
# Prepare data
df = pd.DataFrame(dataset_dict)

Before training our models, we numerically performed climate measurements with standard scaling and transformed phase characteristics with thermal coding. These pre-processing steps ensure that all models can use the data effectively while maintaining proper comparability between them.

from sklearn.preprocessing import StandardScaler
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Yes').astype(int)

# Rearrange columns
column_order = ['sunny', 'overcast', 'rainy', 'Temperature', 'Humidity', 'Wind', 'Play']
df = df[column_order]

# Prepare features and target
X,y = df.drop('Play', axis=1), df['Play']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

# Scale numerical features
scaler = StandardScaler()
X_train[['Temperature', 'Humidity']] = scaler.fit_transform(X_train[['Temperature', 'Humidity']])
X_test[['Temperature', 'Humidity']] = scaler.transform(X_test[['Temperature', 'Humidity']])

Models and Training

In this experiment, we trained four classification models to the same accuracy scores:

  • K nearest neighbors (kNN)
  • Bernoulli Naive Bayes
  • Logistic Regression
  • Multi-Layer Perceptron (MLP)

For those who want to know how those algorithms make predictions and their probabilities, you can check this article:

Although these models achieved similar accuracy for this simple problem, they calculated their prediction probabilities differently.

Although the four models are correct 85.7% of the time, they show different levels of confidence in their predictions. Here, the MLP model tends to be more confident in its answers (giving values ​​close to 1.0), while the kNN model is more conservative, giving varying confidence scores.
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import BernoulliNB

# Initialize the models with the found parameters
knn = KNeighborsClassifier(n_neighbors=4, weights='distance')
bnb = BernoulliNB()
lr = LogisticRegression(C=1, random_state=42)
mlp = MLPClassifier(hidden_layer_sizes=(4, 2),random_state=42, max_iter=2000)

# Train all models
models = {
'KNN': knn,
'BNB': bnb,
'LR': lr,
'MLP': mlp
}

for name, model in models.items():
model.fit(X_train, y_train)

# Create predictions and probabilities for each model
results_dict = {
'True Labels': y_test
}

for name, model in models.items():
# results_dict[f'{name} Pred'] = model.predict(X_test)
results_dict[f'{name} Prob'] = model.predict_proba(X_test)[:, 1]

# Create results dataframe
results_df = pd.DataFrame(results_dict)

# Print predictions and probabilities
print("nPredictions and Probabilities:")
print(results_df)

# Print accuracies
print("nAccuracies:")
for name, model in models.items():
accuracy = accuracy_score(y_test, model.predict(X_test))
print(f"{name}: {accuracy:.3f}")

With this distinction, we will explore why we need to look beyond accuracy.

To assess how well a model's predictive probabilities match its actual performance, we use several methods and metrics. These measurements help us understand whether the confidence levels of our model are reliable.

Brier Score

Brier Score measure i mean square difference between predicted probabilities and actual outcomes. It ranges from 0 to 1, where lower scores indicate better ratings. This score is particularly useful because it considers both accuracy and precision together.

The score (0.148) shows how the confidence of the model matches its actual performance. It is obtained by comparing the model's predicted probabilities with what actually happened (0 for 'NO', 1 for 'YES'), where a smaller difference means a better prediction.

Loss of Log

Loss of Log calculates the negative log likelihood of a correct guess. This metric is more sensitive to confident but incorrect predictions — if the model is 90% sure but wrong, it gets a bigger penalty than if it's 60% sure and wrong. Lower values ​​indicate better calibration.

For each prediction, it looks at how confident the model was in the correct answer. If the model is too confident but wrong (as in reference 26), it receives a large penalty. The final score of 0.455 is the average of all these penalties, where lower numbers mean better predictions.

Expected Error of Measurement (ECE)

ECE measures the ratio difference between predicted and actual probabilities (taken as the label ratio), measured by how many predictions fall into each probability group. This metric helps us understand if our model has a systematic bias in its probability estimation.

The predictions were sorted into 5 bins based on how confident the model was. For each group, we compare the average confidence of the model and how often it was correct. The final result (0.1502) tells us how well these match, with lower numbers being better.

Reliability Drawings

Similar to the ECE, the reliability diagram (or calibration curve) shows the calibration of the model by combining the predictions and comparing them with the actual results. While ECE gives us a single number measurement error, reliability diagram it shows us the same information in pictures. We use the same clustering method and calculate the true frequency of positive results in each bin. When plotted, these points show us exactly where our model's predictions deviate from the perfect estimate, which can be seen as a diagonal line.

As with ECE, forecasts are organized into 5 bins based on confidence levels. Each dot shows how often the model was correct (up/down) compared to how confident it was (left/right). The dotted line shows the perfect fit – the model's curve shows that it sometimes thinks it is better or worse than it really is.

Comparing Measurement Metrics

Each of these metrics reflects different aspects of measurement problems:

  • A higher Brier Score suggests poorer overall opportunity ratings.
  • High Log Loss points to an incorrect forecast with overconfidence.
  • A high ECE indicates a systematic bias in the probability estimates.

Together, these metrics give us a complete picture of how well our model's probabilities reflect its true performance.

Our Models

For our models, let's calculate the measurement metrics and draw their measurement curves:

from sklearn.metrics import brier_score_loss, log_loss
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt

# Initialize models
models = {
'k-Nearest Neighbors': KNeighborsClassifier(n_neighbors=4, weights='distance'),
'Bernoulli Naive Bayes': BernoulliNB(),
'Logistic Regression': LogisticRegression(C=1.5, random_state=42),
'Multilayer Perceptron': MLPClassifier(hidden_layer_sizes=(4, 2), random_state=42, max_iter=2000)
}

# Get predictions and calculate metrics
metrics_dict = {}
for name, model in models.items():
model.fit(X_train, y_train)
y_prob = model.predict_proba(X_test)[:, 1]
metrics_dict[name] = {
'Brier Score': brier_score_loss(y_test, y_prob),
'Log Loss': log_loss(y_test, y_prob),
'ECE': calculate_ece(y_test, y_prob),
'Probabilities': y_prob
}

# Plot calibration curves
fig, axes = plt.subplots(2, 2, figsize=(8, 8), dpi=300)
colors = ['orangered', 'slategrey', 'gold', 'mediumorchid']

for idx, (name, metrics) in enumerate(metrics_dict.items()):
ax = axes.ravel()[idx]
prob_true, prob_pred = calibration_curve(y_test, metrics['Probabilities'],
n_bins=5, strategy='uniform')

ax.plot([0, 1], [0, 1], 'k--', label='Perfectly calibrated')
ax.plot(prob_pred, prob_true, color=colors[idx], marker='o',
label='Calibration curve', linewidth=2, markersize=8)

title = f'{name}nBrier: {metrics["Brier Score"]:.3f} | Log Loss: {metrics["Log Loss"]:.3f} | ECE: {metrics["ECE"]:.3f}'
ax.set_title(title, fontsize=11, pad=10)
ax.grid(True, alpha=0.7)
ax.set_xlim([-0.05, 1.05])
ax.set_ylim([-0.05, 1.05])
ax.spines[['top', 'right', 'left', 'bottom']].set_visible(False)
ax.legend(fontsize=10, loc='upper left')

plt.tight_layout()
plt.show()

Now, let's analyze the estimation performance of each model based on those metrics:

The k-Nearest Neighbors (KNN) model works well in measuring how confident it should be about its predictions. Its graph line is always close to the dotted line, indicating good performance. It has strong scores – a Brier score of 0.148 and an excellent ECE score of 0.090. Although it sometimes shows a lot of confidence in the middle range, it generally make reliable measurements about its certainty.

The Bernoulli Naive Bayes model exhibits an irregular stair step pattern along its line. This means that it jumps between different levels of certainty instead of changing smoothly. Although it has the same Brier score as KNN (0.148), its higher ECE of 0.150 shows that it is less accurate in estimating its certainty. Model it fluctuates between overconfident and underconfident.

The Logistic Regression model shows clear problems with its predictions. Its line strays far from the dotted line, which means that you are often not as judgmental as you should be. It has a very poor ECE score (0.181) and a poor Brier score (0.164). The model appears continuously greater confidence in its predictionwhich makes it unreliable.

The Multilayer Perceptron presents a different problem. Despite having the best Brier results (0.129), its line reflects that make more extreme predictions — very certain or very uncertain, little in between. Its high ECE (0.167) and the flat line in the middle range indicate that it struggles to make accurate estimates of confidence.

After testing all four models, i k-Near Neighbors works very well in measuring the certainty of its prediction. It maintains consistent performance across different levels of validation and shows a highly reliable pattern in its predictions. While some models may score well on certain measures (such as the Multilayer Perceptron's Brier score), their graphs show that they are unreliable if we need to trust their exact estimates.

When choosing between different models, we need to consider both their accuracy and the quality of measurement. A model with lower accuracy but better approximation may be more valuable than a more accurate model with lower probability estimates.

By understanding measurement and its importance, we can build more reliable machine learning systems that users can trust not only with their predictions, but also with their confidence in those predictions.

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import brier_score_loss, log_loss
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt

# Define ECE
def calculate_ece(y_true, y_prob, n_bins=5):
bins = np.linspace(0, 1, n_bins + 1)
ece = 0
for bin_lower, bin_upper in zip(bins[:-1], bins[1:]):
mask = (y_prob >= bin_lower) & (y_prob < bin_upper)
if np.sum(mask) > 0:
bin_conf = np.mean(y_prob[mask])
bin_acc = np.mean(y_true[mask])
ece += np.abs(bin_conf - bin_acc) * np.sum(mask)
return ece / len(y_true)

# Create dataset and prepare data
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast','sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy','sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast','rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0,72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0,88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0,90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0,65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True,True, False, True, True, False, False, True, False, True, True, False,True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes','Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes','Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}

# Prepare and encode data
df = pd.DataFrame(dataset_dict)
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Yes').astype(int)
df = df[['sunny', 'overcast', 'rainy', 'Temperature', 'Humidity', 'Wind', 'Play']]

# Split and scale data
X, y = df.drop('Play', axis=1), df['Play']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)
scaler = StandardScaler()
X_train[['Temperature', 'Humidity']] = scaler.fit_transform(X_train[['Temperature', 'Humidity']])
X_test[['Temperature', 'Humidity']] = scaler.transform(X_test[['Temperature', 'Humidity']])

# Train model and get predictions
model = BernoulliNB()
model.fit(X_train, y_train)
y_prob = model.predict_proba(X_test)[:, 1]

# Calculate metrics
metrics = {
'Brier Score': brier_score_loss(y_test, y_prob),
'Log Loss': log_loss(y_test, y_prob),
'ECE': calculate_ece(y_test, y_prob)
}

# Plot calibration curve
plt.figure(figsize=(6, 6), dpi=300)
prob_true, prob_pred = calibration_curve(y_test, y_prob, n_bins=5, strategy='uniform')

plt.plot([0, 1], [0, 1], 'k--', label='Perfectly calibrated')
plt.plot(prob_pred, prob_true, color='slategrey', marker='o',
label='Calibration curve', linewidth=2, markersize=8)

title = f'Bernoulli Naive BayesnBrier: {metrics["Brier Score"]:.3f} | Log Loss: {metrics["Log Loss"]:.3f} | ECE: {metrics["ECE"]:.3f}'
plt.title(title, fontsize=11, pad=10)
plt.grid(True, alpha=0.7)
plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.gca().spines[['top', 'right', 'left', 'bottom']].set_visible(False)
plt.legend(fontsize=10, loc='lower right')

plt.tight_layout()
plt.show()

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import brier_score_loss, log_loss
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt

# Define ECE
def calculate_ece(y_true, y_prob, n_bins=5):
bins = np.linspace(0, 1, n_bins + 1)
ece = 0
for bin_lower, bin_upper in zip(bins[:-1], bins[1:]):
mask = (y_prob >= bin_lower) & (y_prob < bin_upper)
if np.sum(mask) > 0:
bin_conf = np.mean(y_prob[mask])
bin_acc = np.mean(y_true[mask])
ece += np.abs(bin_conf - bin_acc) * np.sum(mask)
return ece / len(y_true)

# Create dataset and prepare data
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast','sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy','sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast','rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0,72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0,88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0,90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0,65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True,True, False, True, True, False, False, True, False, True, True, False,True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes','Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes','Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}

# Prepare and encode data
df = pd.DataFrame(dataset_dict)
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Yes').astype(int)
df = df[['sunny', 'overcast', 'rainy', 'Temperature', 'Humidity', 'Wind', 'Play']]

# Split and scale data
X, y = df.drop('Play', axis=1), df['Play']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)
scaler = StandardScaler()
X_train[['Temperature', 'Humidity']] = scaler.fit_transform(X_train[['Temperature', 'Humidity']])
X_test[['Temperature', 'Humidity']] = scaler.transform(X_test[['Temperature', 'Humidity']])

# Initialize models
models = {
'k-Nearest Neighbors': KNeighborsClassifier(n_neighbors=4, weights='distance'),
'Bernoulli Naive Bayes': BernoulliNB(),
'Logistic Regression': LogisticRegression(C=1.5, random_state=42),
'Multilayer Perceptron': MLPClassifier(hidden_layer_sizes=(4, 2), random_state=42, max_iter=2000)
}

# Get predictions and calculate metrics
metrics_dict = {}
for name, model in models.items():
model.fit(X_train, y_train)
y_prob = model.predict_proba(X_test)[:, 1]
metrics_dict[name] = {
'Brier Score': brier_score_loss(y_test, y_prob),
'Log Loss': log_loss(y_test, y_prob),
'ECE': calculate_ece(y_test, y_prob),
'Probabilities': y_prob
}

# Plot calibration curves
fig, axes = plt.subplots(2, 2, figsize=(8, 8), dpi=300)
colors = ['orangered', 'slategrey', 'gold', 'mediumorchid']

for idx, (name, metrics) in enumerate(metrics_dict.items()):
ax = axes.ravel()[idx]
prob_true, prob_pred = calibration_curve(y_test, metrics['Probabilities'],
n_bins=5, strategy='uniform')

ax.plot([0, 1], [0, 1], 'k--', label='Perfectly calibrated')
ax.plot(prob_pred, prob_true, color=colors[idx], marker='o',
label='Calibration curve', linewidth=2, markersize=8)

title = f'{name}nBrier: {metrics["Brier Score"]:.3f} | Log Loss: {metrics["Log Loss"]:.3f} | ECE: {metrics["ECE"]:.3f}'
ax.set_title(title, fontsize=11, pad=10)
ax.grid(True, alpha=0.7)
ax.set_xlim([-0.05, 1.05])
ax.set_ylim([-0.05, 1.05])
ax.spines[['top', 'right', 'left', 'bottom']].set_visible(False)
ax.legend(fontsize=10, loc='upper left')

plt.tight_layout()
plt.show()

The Nature of Technology

This article uses Python 3.7 and scikit-learn 1.5. Although the concepts discussed generally apply, the implementation of specific code may differ slightly in different versions.

About Images

Unless otherwise noted, all images are created by the author, including licensed design elements from Canva Pro.

See more Model Evaluation & Optimization methods here:

Samy Baladram

Model Testing and Development

𝙔ou might also like:

Samy Baladram

Integrating Learning

Samy Baladram

Classification Algorithms

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button