Stop Blind Retraining: Use PSI to Build a Smart Monitoring Pipeline

cleaned the data, made a few changes, modeled it, then released your model for the client to use.
That's a lot of work for a data scientist. But the job isn't done when the model hits the real world.
Everything looks perfect on your dashboard. But under the hood, something is wrong. Most models don't fail very well. They don't “crash” like a buggy app. Instead, they just… drift.
Remember, you still need to monitor it to ensure the results are accurate.
One of the easiest ways to do that is to check that the the data is moving.
In other words, you will measure if the distribution of new data fitting your model to the distribution of the data used to train it.
Why Models Don't Scream
When you feed a model, you are betting that the future looks like the past. You expect that the new data will have similar patterns compared to the data used to train it.
Let's think about that for a minute: if I trained my model to recognize apples and oranges, what would happen if suddenly all my model gets are pineapples?
Yes, real world data is messy. Changes in user behavior. Economic change is happening. Even a small change to your data pipeline can mess things up.
If you wait for metrics like accuracy or RMSE to drop, you're already behind. Why? Because labels usually take weeks or months to arrive. You need a way to catch a problem before damage is done.
PSI: Data Smoke Detector
I Population Stability Index (PSI) it is an ancient tool. It was born in the world of credit risk to monitor loan models.
The Population Stability Index (PSI) is a statistical measure based on information theory that measures the difference between a single probability distribution from a reference probability distribution.
[1]
It doesn't care about the accuracy of your model. It only cares about one thing: Is the data coming in today different from the data used during training?
This metric is a way of measuring how much “weight” has moved between buckets. If your training data has 10% of users in a certain age group, but production is 30%, PSI will flag it.
Translate it: What the Numbers Tell You
We generally follow these rules-of-thumb thresholds:
- PSI < 0.10: All is well. Your data is stable.
- 0.10 ≤ PSI < 0.25: Something is changing. You should probably investigate.
- PSI ≥ 0.25: A big change. Your model may be making a wrong guess.
The code
The Python script in this function will perform the following steps.
- Divide the data into “buckets” (quantiles).
- It calculates the percentage of data in each bucket for both your training set and your production set.
- The formula then compares these percentages. If they are almost the same, the PSI stays close to zero. The further apart they are, the higher the score.
Here is the code for the PSI calculation function.
def psi(ref, new, bins=10):
# Data to array
ref, new = np.array(ref), np.array(new)
# Generate 10 equal buckets between 0% and 100%
quantiles = np.linspace(0, 1, bins + 1)
breakpoints = np.quantile(ref, quantiles)
# Counting the number of samples in each bucket
ref_counts = np.histogram(ref, breakpoints)[0]
new_counts = np.histogram(new, breakpoints)[0]
# Calculating the percentage
ref_pct = ref_counts / len(ref)
new_pct = new_counts / len(new)
# If any bucket is zero, add a very small number
# to prevent division by zero
ref_pct = np.where(ref_pct == 0, 1e-6, ref_pct)
new_pct = np.where(new_pct == 0, 1e-6, new_pct)
# Calculate PSI and return
return np.sum((ref_pct - new_pct) * np.log(ref_pct / new_pct))
It's fast, cheap, and doesn't require “true” labels to work, meaning you don't have to wait a few weeks to have enough forecasts to calculate metrics like RMSE. That is why it is a favorite production.
PSI checks whether your model's current data has changed significantly compared to the data used to build it. Comparing today's data to the baseline, helps ensure that your model remains stable and reliable.
When PSI Is On
- PSI is great because it's easy to do yourself
- You can use it every day in all aspects.
When It's Not There
- It depends on how you choose your buckets.
- It doesn't tell you why the data changed, only that it did.
- It looks at the features one by one.
- It may miss subtle interactions between multiple variables.
How Pro Teams Use It
Senior teams don't just look at a single PSI value. They track the habit over time.
A single spike can be a mistake. A constant upward crawl is a sign that it's time to retrain your model. Pair PSI with other metrics such as a good old man summary statistics (mean, contrast) of the full picture.
Let's take a quick look at this animated data toy example. First, we generate random data.
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
# 1. Generate Reference Data
# np.random.seed(42)
X,y = make_regression(n_samples=1000, n_features=3, noise=5, random_state=42)
df = pd.DataFrame(X, columns= ['var1', 'var2', 'var3'])
df['y'] = y
# Separate X and y
X_ref, y_ref = df.drop('y', axis=1), df.y
# View data head
df.head()
Then, we train the model.
# 2. Train Regression Model
model = LinearRegression().fit(X_ref, y_ref)
Now, let's generate the missing data.
# Generate the Drift Data
X,y = make_regression(n_samples=500, n_features=3, noise=5, random_state=42)
df2 = pd.DataFrame(X, columns= ['var1', 'var2', 'var3'])
df2['y'] = y
# Add the drift
df2['var1'] = 5 + 1.5 * X_ref.var1 + np.random.normal(0, 5, 1000)
# Separate X and y
X_new, y_new = df2.drop('y', axis=1), df2.y
# View
df2.head()
Next, we can use our function to calculate PSI. You should notice a big difference in PSI for 1 variable.
# 4. Calculate PSI for the drifted feature
for v in df.columns[:-1]:
psi_value= psi(X_ref[v], X_new[v])
print(f"PSI Score for Feature {v}: {psi_value:.4f}")
PSI Score for Feature var1: 2.3016
PSI Score for Feature var2: 0.0546
PSI Score for Feature var3: 0.1078
And, finally, let's examine the impact it has on the rating y.
# 5. Generate Estimates to see the impact
preds_ref = model.predict(X_ref[:5])
preds_drift = model.predict(X_new[:5])
print("nSample Predictions (Reference vs Drifted):")
print(f"Ref Preds: {preds_ref.round(2)}")
print(f"Drift Preds: {preds_drift.round(2)}")
Sample Predictions (Reference vs Drifted):
Ref Preds: [-104.22 -57.58 -32.69 -18.24 24.13]
Drift Preds: [ 508.33 621.61 -241.88 13.19 433.27]
We can also visualize the difference in dynamics. We created a simple function for plotting overlay histograms.
def drift_plot(ref, new):
fig = plt.hist(ref)
fig = plt.hist(new, color='r', alpha=.5);
return plt.show(fig)
# Calculate PSI for the drifted feature
for v in df.columns[:-1]:
psi_value= psi(X_ref[v], X_new[v])
print(f"PSI Score for Feature {v}: {psi_value:.4f}")
drift_plot(X_ref[v], X_new[v])
Here are the results.

The difference is greater in variable 1!
Before You Go
We've seen how easy it is to calculate PSI, and how it can show us where drift is occurring. We identified quickly var1 as our problematic variable. Monitoring your model without monitoring your data is a big blind spot.
We must ensure that the same data distribution identified while training the model is still valid, so the model can continue to use the pattern from the reference data to estimate with the new data.
Manufacturing ML is less about building the “perfect” model and more about maintaining consistency with reality.
The best models just don't predict well. They know when the world has changed.
If you liked this content, I found it on my website.
GitHub Repository
The code for this function.
References
[1. PSI Definition]
[2. Numpy Histogram]
[3. Numpy Linspace]
[4. Numpy Where]
[5. Make Regression data]



