Model Estimating, Explained: A Visual Guide with Code Examples for Beginners | by Samy Baladram | January, 2025
INSPECTION AND REPAIR
When all models have the same accuracy, then what?
He trained several classification models, and they all seemed to perform well with high accuracy scores. Congratulations!
But hold on – is one model really better than the others? Accuracy alone does not tell the whole story. What if one model always overestimates her confidence, while another underestimates it? This is where it is model calibration enters.
Here, we'll see what a measurement model is and explore how to test the reliability of your models' assumptions – using visuals and working code examples to show you how to identify measurement problems. Get ready to transcend accuracy and unleash the true power of your machine learning models!
Model calibration measures how well the model's prediction probabilities match its actual performance. A model that gives a score of 70% probability should be correct 70% of the time for the same prediction. This means that its probability score reflects the true probability that its predictions are correct.
Why Measurement is Important
While accuracy tells us how often the model is correct, approximation does or we can trust its probable points. Two models may both have 90% accuracy, but one may provide realistic probability scores while the other provides overconfident predictions. In many real applications, having reliable probability scores is as important as correct prediction.
Total Rating vs. The truth
A well-calibrated model will show an exact match between its prediction probability and the actual success rate: If it predicts with 90% probability, it should be correct 90% of the time. The same applies to all possible levels.
However, most models are not completely calibrated. They can be:
- Overconfidence: giving chances of scoring too high for their actual performance
- Lack of confidence: giving the odds a score that is too low for their actual performance
- Both: overconfidence in some grades and underconfidence in others
This discrepancy between predicted probabilities and actual accuracy can lead to poor decision making when using these models in real applications. This is why understanding and improving model calibration is necessary to build reliable machine learning systems.
To test the calibration of the model, we will continue with the same dataset used in my previous articles on Classification: predicting whether someone will play golf or not based on weather conditions.
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split# Create and prepare dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast',
'sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy',
'sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast',
'rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0,
72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0,
88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0,
90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0,
65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True,
True, False, True, True, False, False, True, False, True, True, False,
True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes',
'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes',
'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
# Prepare data
df = pd.DataFrame(dataset_dict)
Before training our models, we numerically performed climate measurements with standard scaling and transformed phase characteristics with thermal coding. These pre-processing steps ensure that all models can use the data effectively while maintaining proper comparability between them.
from sklearn.preprocessing import StandardScaler
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Yes').astype(int)# Rearrange columns
column_order = ['sunny', 'overcast', 'rainy', 'Temperature', 'Humidity', 'Wind', 'Play']
df = df[column_order]
# Prepare features and target
X,y = df.drop('Play', axis=1), df['Play']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)
# Scale numerical features
scaler = StandardScaler()
X_train[['Temperature', 'Humidity']] = scaler.fit_transform(X_train[['Temperature', 'Humidity']])
X_test[['Temperature', 'Humidity']] = scaler.transform(X_test[['Temperature', 'Humidity']])
Models and Training
In this experiment, we trained four classification models to the same accuracy scores:
- K nearest neighbors (kNN)
- Bernoulli Naive Bayes
- Logistic Regression
- Multi-Layer Perceptron (MLP)
For those who want to know how those algorithms make predictions and their probabilities, you can check this article:
Although these models achieved similar accuracy for this simple problem, they calculated their prediction probabilities differently.
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import BernoulliNB# Initialize the models with the found parameters
knn = KNeighborsClassifier(n_neighbors=4, weights='distance')
bnb = BernoulliNB()
lr = LogisticRegression(C=1, random_state=42)
mlp = MLPClassifier(hidden_layer_sizes=(4, 2),random_state=42, max_iter=2000)
# Train all models
models = {
'KNN': knn,
'BNB': bnb,
'LR': lr,
'MLP': mlp
}
for name, model in models.items():
model.fit(X_train, y_train)
# Create predictions and probabilities for each model
results_dict = {
'True Labels': y_test
}
for name, model in models.items():
# results_dict[f'{name} Pred'] = model.predict(X_test)
results_dict[f'{name} Prob'] = model.predict_proba(X_test)[:, 1]
# Create results dataframe
results_df = pd.DataFrame(results_dict)
# Print predictions and probabilities
print("nPredictions and Probabilities:")
print(results_df)
# Print accuracies
print("nAccuracies:")
for name, model in models.items():
accuracy = accuracy_score(y_test, model.predict(X_test))
print(f"{name}: {accuracy:.3f}")
With this distinction, we will explore why we need to look beyond accuracy.
To assess how well a model's predictive probabilities match its actual performance, we use several methods and metrics. These measurements help us understand whether the confidence levels of our model are reliable.
Brier Score
Brier Score measure i mean square difference between predicted probabilities and actual outcomes. It ranges from 0 to 1, where lower scores indicate better ratings. This score is particularly useful because it considers both accuracy and precision together.
Loss of Log
Loss of Log calculates the negative log likelihood of a correct guess. This metric is more sensitive to confident but incorrect predictions — if the model is 90% sure but wrong, it gets a bigger penalty than if it's 60% sure and wrong. Lower values indicate better calibration.
Expected Error of Measurement (ECE)
ECE measures the ratio difference between predicted and actual probabilities (taken as the label ratio), measured by how many predictions fall into each probability group. This metric helps us understand if our model has a systematic bias in its probability estimation.
Reliability Drawings
Similar to the ECE, the reliability diagram (or calibration curve) shows the calibration of the model by combining the predictions and comparing them with the actual results. While ECE gives us a single number measurement error, reliability diagram it shows us the same information in pictures. We use the same clustering method and calculate the true frequency of positive results in each bin. When plotted, these points show us exactly where our model's predictions deviate from the perfect estimate, which can be seen as a diagonal line.
Comparing Measurement Metrics
Each of these metrics reflects different aspects of measurement problems:
- A higher Brier Score suggests poorer overall opportunity ratings.
- High Log Loss points to an incorrect forecast with overconfidence.
- A high ECE indicates a systematic bias in the probability estimates.
Together, these metrics give us a complete picture of how well our model's probabilities reflect its true performance.
Our Models
For our models, let's calculate the measurement metrics and draw their measurement curves:
from sklearn.metrics import brier_score_loss, log_loss
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt# Initialize models
models = {
'k-Nearest Neighbors': KNeighborsClassifier(n_neighbors=4, weights='distance'),
'Bernoulli Naive Bayes': BernoulliNB(),
'Logistic Regression': LogisticRegression(C=1.5, random_state=42),
'Multilayer Perceptron': MLPClassifier(hidden_layer_sizes=(4, 2), random_state=42, max_iter=2000)
}
# Get predictions and calculate metrics
metrics_dict = {}
for name, model in models.items():
model.fit(X_train, y_train)
y_prob = model.predict_proba(X_test)[:, 1]
metrics_dict[name] = {
'Brier Score': brier_score_loss(y_test, y_prob),
'Log Loss': log_loss(y_test, y_prob),
'ECE': calculate_ece(y_test, y_prob),
'Probabilities': y_prob
}
# Plot calibration curves
fig, axes = plt.subplots(2, 2, figsize=(8, 8), dpi=300)
colors = ['orangered', 'slategrey', 'gold', 'mediumorchid']
for idx, (name, metrics) in enumerate(metrics_dict.items()):
ax = axes.ravel()[idx]
prob_true, prob_pred = calibration_curve(y_test, metrics['Probabilities'],
n_bins=5, strategy='uniform')
ax.plot([0, 1], [0, 1], 'k--', label='Perfectly calibrated')
ax.plot(prob_pred, prob_true, color=colors[idx], marker='o',
label='Calibration curve', linewidth=2, markersize=8)
title = f'{name}nBrier: {metrics["Brier Score"]:.3f} | Log Loss: {metrics["Log Loss"]:.3f} | ECE: {metrics["ECE"]:.3f}'
ax.set_title(title, fontsize=11, pad=10)
ax.grid(True, alpha=0.7)
ax.set_xlim([-0.05, 1.05])
ax.set_ylim([-0.05, 1.05])
ax.spines[['top', 'right', 'left', 'bottom']].set_visible(False)
ax.legend(fontsize=10, loc='upper left')
plt.tight_layout()
plt.show()
Now, let's analyze the estimation performance of each model based on those metrics:
The k-Nearest Neighbors (KNN) model works well in measuring how confident it should be about its predictions. Its graph line is always close to the dotted line, indicating good performance. It has strong scores – a Brier score of 0.148 and an excellent ECE score of 0.090. Although it sometimes shows a lot of confidence in the middle range, it generally make reliable measurements about its certainty.
The Bernoulli Naive Bayes model exhibits an irregular stair step pattern along its line. This means that it jumps between different levels of certainty instead of changing smoothly. Although it has the same Brier score as KNN (0.148), its higher ECE of 0.150 shows that it is less accurate in estimating its certainty. Model it fluctuates between overconfident and underconfident.
The Logistic Regression model shows clear problems with its predictions. Its line strays far from the dotted line, which means that you are often not as judgmental as you should be. It has a very poor ECE score (0.181) and a poor Brier score (0.164). The model appears continuously greater confidence in its predictionwhich makes it unreliable.
The Multilayer Perceptron presents a different problem. Despite having the best Brier results (0.129), its line reflects that make more extreme predictions — very certain or very uncertain, little in between. Its high ECE (0.167) and the flat line in the middle range indicate that it struggles to make accurate estimates of confidence.
After testing all four models, i k-Near Neighbors works very well in measuring the certainty of its prediction. It maintains consistent performance across different levels of validation and shows a highly reliable pattern in its predictions. While some models may score well on certain measures (such as the Multilayer Perceptron's Brier score), their graphs show that they are unreliable if we need to trust their exact estimates.
When choosing between different models, we need to consider both their accuracy and the quality of measurement. A model with lower accuracy but better approximation may be more valuable than a more accurate model with lower probability estimates.
By understanding measurement and its importance, we can build more reliable machine learning systems that users can trust not only with their predictions, but also with their confidence in those predictions.
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import brier_score_loss, log_loss
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt# Define ECE
def calculate_ece(y_true, y_prob, n_bins=5):
bins = np.linspace(0, 1, n_bins + 1)
ece = 0
for bin_lower, bin_upper in zip(bins[:-1], bins[1:]):
mask = (y_prob >= bin_lower) & (y_prob < bin_upper)
if np.sum(mask) > 0:
bin_conf = np.mean(y_prob[mask])
bin_acc = np.mean(y_true[mask])
ece += np.abs(bin_conf - bin_acc) * np.sum(mask)
return ece / len(y_true)
# Create dataset and prepare data
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast','sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy','sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast','rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0,72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0,88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0,90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0,65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True,True, False, True, True, False, False, True, False, True, True, False,True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes','Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes','Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
# Prepare and encode data
df = pd.DataFrame(dataset_dict)
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Yes').astype(int)
df = df[['sunny', 'overcast', 'rainy', 'Temperature', 'Humidity', 'Wind', 'Play']]
# Split and scale data
X, y = df.drop('Play', axis=1), df['Play']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)
scaler = StandardScaler()
X_train[['Temperature', 'Humidity']] = scaler.fit_transform(X_train[['Temperature', 'Humidity']])
X_test[['Temperature', 'Humidity']] = scaler.transform(X_test[['Temperature', 'Humidity']])
# Train model and get predictions
model = BernoulliNB()
model.fit(X_train, y_train)
y_prob = model.predict_proba(X_test)[:, 1]
# Calculate metrics
metrics = {
'Brier Score': brier_score_loss(y_test, y_prob),
'Log Loss': log_loss(y_test, y_prob),
'ECE': calculate_ece(y_test, y_prob)
}
# Plot calibration curve
plt.figure(figsize=(6, 6), dpi=300)
prob_true, prob_pred = calibration_curve(y_test, y_prob, n_bins=5, strategy='uniform')
plt.plot([0, 1], [0, 1], 'k--', label='Perfectly calibrated')
plt.plot(prob_pred, prob_true, color='slategrey', marker='o',
label='Calibration curve', linewidth=2, markersize=8)
title = f'Bernoulli Naive BayesnBrier: {metrics["Brier Score"]:.3f} | Log Loss: {metrics["Log Loss"]:.3f} | ECE: {metrics["ECE"]:.3f}'
plt.title(title, fontsize=11, pad=10)
plt.grid(True, alpha=0.7)
plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.gca().spines[['top', 'right', 'left', 'bottom']].set_visible(False)
plt.legend(fontsize=10, loc='lower right')
plt.tight_layout()
plt.show()
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import brier_score_loss, log_loss
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt# Define ECE
def calculate_ece(y_true, y_prob, n_bins=5):
bins = np.linspace(0, 1, n_bins + 1)
ece = 0
for bin_lower, bin_upper in zip(bins[:-1], bins[1:]):
mask = (y_prob >= bin_lower) & (y_prob < bin_upper)
if np.sum(mask) > 0:
bin_conf = np.mean(y_prob[mask])
bin_acc = np.mean(y_true[mask])
ece += np.abs(bin_conf - bin_acc) * np.sum(mask)
return ece / len(y_true)
# Create dataset and prepare data
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast','sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy','sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast','rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0,72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0,88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0,90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0,65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True,True, False, True, True, False, False, True, False, True, True, False,True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes','Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes','Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
# Prepare and encode data
df = pd.DataFrame(dataset_dict)
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Yes').astype(int)
df = df[['sunny', 'overcast', 'rainy', 'Temperature', 'Humidity', 'Wind', 'Play']]
# Split and scale data
X, y = df.drop('Play', axis=1), df['Play']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)
scaler = StandardScaler()
X_train[['Temperature', 'Humidity']] = scaler.fit_transform(X_train[['Temperature', 'Humidity']])
X_test[['Temperature', 'Humidity']] = scaler.transform(X_test[['Temperature', 'Humidity']])
# Initialize models
models = {
'k-Nearest Neighbors': KNeighborsClassifier(n_neighbors=4, weights='distance'),
'Bernoulli Naive Bayes': BernoulliNB(),
'Logistic Regression': LogisticRegression(C=1.5, random_state=42),
'Multilayer Perceptron': MLPClassifier(hidden_layer_sizes=(4, 2), random_state=42, max_iter=2000)
}
# Get predictions and calculate metrics
metrics_dict = {}
for name, model in models.items():
model.fit(X_train, y_train)
y_prob = model.predict_proba(X_test)[:, 1]
metrics_dict[name] = {
'Brier Score': brier_score_loss(y_test, y_prob),
'Log Loss': log_loss(y_test, y_prob),
'ECE': calculate_ece(y_test, y_prob),
'Probabilities': y_prob
}
# Plot calibration curves
fig, axes = plt.subplots(2, 2, figsize=(8, 8), dpi=300)
colors = ['orangered', 'slategrey', 'gold', 'mediumorchid']
for idx, (name, metrics) in enumerate(metrics_dict.items()):
ax = axes.ravel()[idx]
prob_true, prob_pred = calibration_curve(y_test, metrics['Probabilities'],
n_bins=5, strategy='uniform')
ax.plot([0, 1], [0, 1], 'k--', label='Perfectly calibrated')
ax.plot(prob_pred, prob_true, color=colors[idx], marker='o',
label='Calibration curve', linewidth=2, markersize=8)
title = f'{name}nBrier: {metrics["Brier Score"]:.3f} | Log Loss: {metrics["Log Loss"]:.3f} | ECE: {metrics["ECE"]:.3f}'
ax.set_title(title, fontsize=11, pad=10)
ax.grid(True, alpha=0.7)
ax.set_xlim([-0.05, 1.05])
ax.set_ylim([-0.05, 1.05])
ax.spines[['top', 'right', 'left', 'bottom']].set_visible(False)
ax.legend(fontsize=10, loc='upper left')
plt.tight_layout()
plt.show()
The Nature of Technology
This article uses Python 3.7 and scikit-learn 1.5. Although the concepts discussed generally apply, the implementation of specific code may differ slightly in different versions.
About Images
Unless otherwise noted, all images are created by the author, including licensed design elements from Canva Pro.
See more Model Evaluation & Optimization methods here:
Model Testing and Development
𝙔ou might also like: