Machine Learning

Why MLOps Retraining Schedules Fail — Models Don’t Forget, They Get Shocked

Most production ML models don’t decay smoothly — they fail in sudden, unpredictable shocks. When we fit an exponential “forgetting curve” to 555,000 production-like fraud transactions, it returned R² = −0.31, meaning it performed worse than predicting the mean.

Before you set (or trust) any retraining schedule, run this 3-line diagnostic on your existing weekly metrics:

report = tracker.report()
print(report.forgetting_regime)   # "smooth" or "episodic"
print(report.fit_r_squared)       # < 0.4 → abandon schedule assumptions
  • R² ≥ 0.4 → Smooth regime → scheduled retraining works
  • R² < 0.4 → Episodic regime → use shock detection, not calendars

If your R² is below 0.4, your model isn’t “decaying” — and everything derived from a half-life estimate is likely misleading.

It Started With One Week

It was Week 7.

Recall dropped from 0.9375 to 0.7500 in seven days flat. No alert fired. The aggregate monthly metric moved a few points — well within tolerance. The dashboard showed green.

That single week erased three weeks of model improvement. Dozens of fraud cases that the model used to catch walked straight through undetected. And the standard exponential decay model — the mathematical backbone of every retraining schedule I had ever built — did not just fail to predict it.

It predicted the opposite.

R² = −0.31. Worse than a flat line.

That number broke something in how I think about MLOps. Not dramatically. Quietly. The kind of break that makes you go back and re-examine an assumption you have been carrying for years without ever questioning it.

This article is about that assumption, why it is wrong for an entire class of production ML systems, and what to do instead — backed by real numbers on a public dataset you can reproduce in an afternoon.

Full code:

The Assumption Nobody Questions

The entire retraining-schedule industry is built on a single idea borrowed from a 19th-century German psychologist.

In 1885, Hermann Ebbinghaus conducted a series of self-experiments on human memory — memorising lists of nonsense syllables, measuring his own retention at fixed intervals, and plotting the results over time [1]. What he documented was a clean exponential relationship:

R



Memory fades smoothly. Predictably. At a rate proportional to how much memory remains. The curve became one of the most replicated findings in cognitive psychology and remains a foundational reference in memory research to this day.

A century later, the machine learning community borrowed it wholesale. The logic felt sound: a production model is exposed to new patterns over time that it was not trained on, so its performance degrades gradually and continuously. Set a retraining cadence based on the decay rate. Estimate a half-life. Schedule accordingly.

Every major MLOps platform, every “retrain every 30 days” rule of thumb, every automated decay calculator, is downstream of this assumption.

The problem is that nobody verified it against production data.

So I did.

The Experiment

I used the Kaggle Credit Card Fraud Detection dataset created by Kartik Shenoy [2] — a synthetic dataset of 1.85 million transactions generated using the Sparkov Data Generation tool [3], covering the period January 2019 to December 2020. The test split contains 555,719 transactions spanning June to December 2020, with 2,145 confirmed fraud cases (0.39% prevalence).

The simulation was designed to mirror a realistic production deployment:

  • Model: LightGBM [4] trained once on historical data, never retrained during the test period
  • Primary metric: Recall — in fraud detection, a missed fraud costs orders of magnitude more than a false alarm, making recall the operationally correct objective [5]
  • Evaluation: Weekly rolling windows on the hold-out test set, each window containing between 15,000 and 32,000 transactions
  • Quality filters: Windows with fewer than 30 confirmed fraud cases were excluded — below that threshold, weekly recall estimates are statistically unreliable due to the extreme class imbalance

The baseline was established using the mean of the top-3 recall values across the first six qualifying weeks — a method designed to ignore early warm-up noise while tracking near-peak performance.

26 Weeks of Production Performance

Here is what the full simulation produced across 26 qualifying weekly windows:

Week  1  [2020-06-21]  n=19,982   fraud=  68  R=0.7647
Week  2  [2020-06-28]  n=20,025   fraud= 100  R=0.8300
Week  3  [2020-07-05]  n=20,182   fraud=  83  R=0.7831
Week  4  [2020-07-12]  n=19,777   fraud=  52  R=0.8462
Week  5  [2020-07-19]  n=19,898   fraud=  99  R=0.8586
Week  6  [2020-07-26]  n=19,733   fraud=  64  R=0.9375   ← peak
Week  7  [2020-08-02]  n=20,023   fraud= 152  R=0.7500   ← worst shock (−0.1875)
Week  8  [2020-08-09]  n=19,637   fraud=  82  R=0.7439
Week  9  [2020-08-16]  n=19,722   fraud=  59  R=0.7966
Week 10  [2020-08-23]  n=19,605   fraud= 102  R=0.8922
Week 11  [2020-08-30]  n=18,081   fraud=  84  R=0.8690
Week 12  [2020-09-06]  n=16,180   fraud=  67  R=0.7910
Week 13  [2020-09-13]  n=16,087   fraud=  63  R=0.8413
Week 14  [2020-09-20]  n=15,893   fraud=  90  R=0.7444
Week 15  [2020-09-27]  n=16,009   fraud=  81  R=0.8272
Week 16  [2020-10-04]  n=15,922   fraud= 121  R=0.8264
Week 17  [2020-10-11]  n=15,953   fraud= 111  R=0.8559
Week 18  [2020-10-18]  n=15,883   fraud=  53  R=0.9245   ← recovery
Week 19  [2020-10-25]  n=15,988   fraud=  73  R=0.8630
Week 20  [2020-11-01]  n=15,921   fraud=  70  R=0.7429   ← second shock
Week 21  [2020-11-08]  n=16,098   fraud=  59  R=0.9322   ← recovery
Week 22  [2020-11-15]  n=15,835   fraud=  63  R=0.9206
Week 23  [2020-11-22]  n=15,610   fraud=  91  R=0.9121
Week 24  [2020-11-29]  n=30,246   fraud=  57  R=0.8596   ← volume doubles
Week 25  [2020-12-06]  n=31,946   fraud= 114  R=0.7895
Week 26  [2020-12-13]  n=31,789   fraud=  67  R=0.8507

Two windows were excluded: the week of December 20 (only 20 fraud cases) and December 27 (zero fraud cases recorded — a data artefact consistent with the holiday period).


The exponential decay model fits worse than a flat mean line — R² = −0.309 is the operative number here, not the slope. Image by author.

What −0.31 Actually Means

R² — the coefficient of determination — measures how much variance in the observed data is explained by the fitted model [6].

  • R² = 1.0: Perfect fit. The model explains all observed variance.
  • R² = 0.0: The model does no better than predicting the mean of the data for every point.
  • R² < 0.0: The model is actively harmful — it introduces more prediction error than a flat mean line would.

When the exponential decay model returned R² = −0.3091 on this dataset, it was not fitting poorly. It was fitting backwards. The model predicts a gentle slope declining from a stable peak. The data shows repeated sudden drops and recoveries with no consistent directional trend.

This is not a decay curve. It is a seismograph.

Two Regimes of Model Forgetting

After observing this pattern, I formalised a classification framework based on the R² of the exponential fit. Two regimes emerge cleanly:

Side-by-side dark-theme comparison diagram of two ML model forgetting regimes. The left panel, labelled Smooth, shows a blue exponential decay curve with R² ≥ 0.4, gradual signal, and a calendar retraining fix. The right panel, labelled Episodic, shows an orange step-drop pattern with two red shock markers, R² < 0.4, and a shock detection fix. A VS divider separates the two panels.
The R² of the exponential fit is the only number you need to decide which toolbox to open — everything else follows from it. Image by author.

The smooth regime is the world Ebbinghaus described. Feature distributions shift gradually — demographic changes, slow economic cycles, seasonal behaviour patterns that evolve over months. The exponential model fits the observed data well. The half-life estimate is actionable. A scheduled retraining cadence is the correct operational response.

The episodic regime is what fraud detection, content recommendation, supply chain forecasting, and any domain with sudden external discontinuities actually looks like in production. Performance does not decay — it switches. A new fraud pattern emerges overnight. A platform policy change flips user behaviour. A competitor exits the market and their customers arrive with different characteristics. A regulatory change alters the transaction mix.

These are not points on a decay curve. They are discontinuities. And the R² diagnostic identifies which regime you are in before you commit to an operations strategy built on the wrong assumption.

This pattern extends beyond fraud detection — similar episodic behavior appears in recommendation systems, demand forecasting, and user behavior modeling.


Line chart of weekly retention ratio expressed as a percentage of the baseline recall. The line zigzags between roughly 85% and 107%, with six downward-pointing red triangles marking weeks where retention fell below the 93% retrain threshold.
Six threshold breaches across 26 weeks — each followed by a spontaneous recovery, with zero retraining in between. Image by author.

Why Fraud Detection Is Always Episodic

The Week 7 collapse was not random noise. Let us look at what the data actually shows.

Week 6 (July 26): 64 fraud cases. Recall = 0.9375. The model is near peak performance.

Week 7 (August 2): 152 fraud cases — 2.4 times more fraud than the previous week — and recall collapses to 0.7500. The model missed 38 frauds it would have detected seven days earlier.

A 137% increase in fraud volume in a single week does not represent a gradual distribution shift. It represents a regime change — a new fraud ring, a newly exploited vulnerability, or an organised campaign that the model had never encountered in its training data. The model’s learned patterns became suddenly insufficient, not gradually insufficient.

Then consider Week 24 (November 29). Transaction volume nearly doubles — from roughly 16,000 transactions per week to 30,246 — as the Thanksgiving and Black Friday period begins. Simultaneously, the fraud count drops to 57, giving a fraud rate of 0.19%, the lowest in the entire test period. The model encounters a volume-to-fraud ratio it has never seen. Recall holds at 0.860, but only because the absolute fraud count is low. The precision simultaneously collapses, flooding any downstream review queue with false positives.

Neither of these events is a point on a decay curve. Neither would be predicted by a retraining schedule. Both would be caught immediately by a single-week shock detector.

The Diagnostic Framework

The classification is a three-step process that can be applied to any existing performance log.

Step 1: Fit the forgetting curve and compute R²

from model_forgetting_curve import ModelForgettingTracker

tracker = ModelForgettingTracker(
    metric_name="Recall",
    baseline_method="top3",   # mean of top-3 early weeks
    baseline_window=6,        # weeks used to establish baseline
    retrain_threshold=0.07    # alert threshold: 7% drop from baseline
)

# log your existing weekly metrics
for week_recall in your_weekly_metrics:
    tracker.log(week_recall)

report = tracker.report()
print(f"Regime : {report.forgetting_regime}")
print(f"R²     : {report.fit_r_squared:.3f}")

Step 2: Branch on regime

if report.fit_r_squared >= 0.4:
    # SMOOTH — exponential model is valid
    print(f"Schedule retrain in {report.predicted_days_to_threshold:.0f} days")
    print(f"Half-life: {report.half_life_days:.1f} days")
else:
    # EPISODIC — exponential model is invalid, use shock detection
    print("Deploy shock detection. Abandon calendar schedule.")

Step 3: If episodic, replace the schedule with these three mechanisms

import pandas as pd
import numpy as np

recall_series = pd.Series(your_weekly_metrics)
fraud_counts  = pd.Series(your_weekly_fraud_counts)

# Mechanism 1 — single-week shock detector
rolling_mean  = recall_series.rolling(window=4).mean()
shock_flags   = recall_series < (rolling_mean * 0.92)

# Mechanism 2 — volume-weighted recall (more reliable than raw recall)
weighted_recall = np.average(recall_series, weights=fraud_counts)

# Mechanism 3 — two-consecutive-week trigger (reduces false retrain alerts)
breach = recall_series < (recall_series.mean() * (1 - 0.07))
retrain_trigger = breach & breach.shift(1).fillna(False)

print(f"Shock weeks detected      : {shock_flags.sum()}")
print(f"Volume-weighted recall    : {weighted_recall:.4f}")
print(f"Retrain trigger activated : {retrain_trigger.any()}")

The threshold of 0.92 in the shock detector (alert when recall drops more than 8% below the 4-week rolling mean) and the retrain threshold of 0.07 relative to the long-run baseline are starting points, not fixed rules. Calibrate both against your domain’s cost asymmetry — the ratio of missed-fraud cost to false-alarm cost — and your labelling latency.


Bar chart of rolling exponential decay rate lambda computed over a sliding 5-week window. Bars are coloured green through amber to red by magnitude. Two clusters of tall red bars appear around days 60 and 100, separated by a period of near-zero green bars.
In a smooth-decay system, these bars climb steadily. Here they spike, collapse, and spike again — the visual signature of episodic drift. Image by author.

The Full Diagnostic Report

============================================================
  FORGETTING CURVE REPORT
============================================================
  Baseline Recall               : 0.8807  [top3]
  Current  Recall               : 0.8507
  Retention ratio               : 96.6%
  Decay rate  lambda            : 0.000329
  Half-life                     : 2107.7 days    ← statistically meaningless
  Forgetting speed              : STABLE
  Forgetting regime             : EPISODIC
  Curve fit R-squared           : -0.3091         ← the operative number
  Snapshots logged              : 26
  Retrain recommended NOW       : False
  Days until retrain alert      : 45.7            ← unreliable in episodic regime
  Recommended retrain date      : 2026-05-20      ← disregard in episodic regime
  Worst single-week drop        : Week 7  (−0.1875)
============================================================

The tension visible in this report is intentional and important. The system simultaneously reports Forgetting speed: STABLE (decay rate λ = 0.000329 implies a half-life of 2,107 days) and Forgetting regime: EPISODIC (R² = −0.31). Both are correct.

The average performance across 26 weeks is stable — current recall of 0.8507 against a baseline of 0.8807 gives a retention ratio of 96.6%, comfortably above the 93% retrain threshold. An aggregate monthly dashboard would show a model operating well within acceptable bounds.

The week-to-week behaviour is violently unstable. The worst single shock dropped recall by 18.75 points in seven days. Three separate weeks dropped below 75% recall. These events are entirely invisible in aggregate metrics and entirely unpredictable by a decay model with a 2,107-day half-life.

This is the core failure mode of calendar-based retraining: the granularity at which monitoring happens determines what is visible. Episodic shocks only appear at weekly or sub-weekly resolution with per-window fraud case counts as quality weights.

Semi-circular gauge showing 46 days until retrain is recommended. The arc is mostly green. Below the gauge a stats block lists baseline 0.8807, current 0.8507, retention 96.6%, R² −0.3091, half-life 2107.7 days, and worst shock Week 7. A purple badge reads Regime: EPISODIC. A green button reads Retrain by May 20, 2026.
The gauge says 46 days. The EPISODIC badge says ignore it. The half-life of 2,107 days makes the countdown number statistically meaningless. Image by author.

On the Choice of Baseline Method

One calibration decision materially affects the diagnostic: how the baseline recall is computed from the first N qualifying weeks.

Three methods are available, each with different sensitivity characteristics:

"mean" — arithmetic mean of the first N weeks. Appropriate when early-week performance is consistent. Sensitive to warm-up noise when the model has not yet encountered the full test distribution, which is common in fraud detection where label arrival is delayed.

"max" — peak performance in the first N weeks. The most conservative option: any subsequent drop below the historical peak is immediately visible. Risk: a single anomalously good week permanently inflates the baseline, generating false retrain alerts for weeks that are performing normally.

"top3" — mean of the top three values in the first N weeks. The method used in this simulation. It filters warm-up noise while preserving proximity to true peak performance. Recommended for imbalanced classification problems with delayed labelling.

The choice of baseline_window — how many weeks are included in baseline estimation — matters equally. Six weeks is the minimum for statistical stability at typical fraud prevalence rates. Fewer than six weeks risks a baseline dominated by early distributional artefacts.

What This Means for MLOps Practice

The practical implications break cleanly along regime lines.

In the smooth regime, calendar-based retraining is valid — but the cadence should be derived from the empirical decay rate, not from convention. A model with a half-life of 180 days should be retrained every 120 to 150 days. A model with a half-life of 30 days needs weekly retraining. The specific schedule should be calibrated to the point where retention falls to the threshold — not picked because monthly feels reasonable.

In the episodic regime, calendar-based retraining is operationally wasteful. A model that experiences sudden shocks but recovers will trigger scheduled retrains during stable recovery periods, wasting compute and labelling budget, while the actual shock events — the weeks that matter — occur between scheduled dates and go unaddressed until the next calendar trigger.

The replacement is not more frequent scheduled retraining. It is event-driven retraining triggered by the shock detection mechanisms described above: a sudden drop below the rolling mean, sustained across two consecutive windows, with confirmation that the drop is not a data artefact (volume check, fraud rate floor, labelling delay indicator).

This is the distinction that the R² diagnostic makes actionable: it tells you which toolbox to open.

Limitations

This analysis has several boundaries that should be stated explicitly.

The dataset is synthetic. The Kaggle fraud dataset used here was generated using the Sparkov simulation tool [3] and does not represent real cardholder or merchant data. The fraud patterns reflect the simulation’s generative model, not actual fraud ring behaviour. The episodic shocks observed may differ in character from those encountered in live production systems, where shocks often involve novel attack vectors that have no prior representation in training data.

A single domain. The two-regime classification framework is proposed based on analysis of fraud detection data and informal observation of other domains. A systematic study across multiple production ML systems, including healthcare risk models, content recommendation engines, and demand forecasting systems, would be required to validate the R² cutoff of 0.4 as a robust regime boundary.

Label availability assumption. The simulation assumes weekly recall is computable from ground-truth labels available within the same week. In many fraud systems, confirmed fraud labels arrive with delays of days to weeks as investigations complete. The shock detection mechanisms described here require adaptation for delayed labelling environments — specifically, the rolling window should be constructed from label-available transactions only, and the shock threshold should be widened to account for label-arrival variance.

The retrain threshold of 7%. This is a starting point, not a universal recommendation. The operationally correct threshold is a function of the cost ratio between missed fraud and false alarms in a specific deployment, which varies significantly across merchant categories, transaction values, and review team capacity.

Reproducing This Analysis

The complete implementation — ModelForgettingTracker, the fraud simulation, all four diagnostic charts, and the live tracking dashboard — is available at .

Requirements:

pip install pandas numpy scikit-learn matplotlib scipy lightgbm

Dataset: Download fraudTrain.csv and fraudTest.csv from the Kaggle dataset [2] and place them in the same directory as fraud_forgetting_demo.py.

Run:

python fraud_forgetting_demo.py

To apply the tracker to an existing performance log without the Kaggle dependency:

from model_forgetting_curve import load_from_dataframe

tracker = load_from_dataframe(
    df,
    metric_col="weekly_recall",
    metric_name="Recall",
    baseline_method="top3",
    retrain_threshold=0.07,
)

report = tracker.report()
print(f"Regime: {report.forgetting_regime}")
print(f"R²:     {report.fit_r_squared:.3f}")

figs = tracker.plot(save_dir="./charts", dark_mode=True)

Conclusion

The Ebbinghaus forgetting curve is a foundational result in cognitive psychology. As an assumption about production ML system behaviour, it is unverified for an entire class of domains where performance is driven by external events rather than gradual distributional drift.

The R² diagnostic presented here is a one-time, zero-infrastructure check that classifies a system’s forgetting regime from existing weekly performance logs. If R² ≥ 0.4, the exponential model is valid and a retraining schedule is the correct tool. If R² < 0.4, the model is in the episodic regime, the half-life is meaningless, and the retraining schedule should be replaced with event-driven shock detection.

On 555,000 real-synthetic transactions spanning six months of simulated production, the fraud detection model returned R² = −0.31. The exponential decay model performed worse than predicting the mean. The worst shock dropped recall by 18.75 points in seven days with no aggregate-level signal.

The conclusion is precise: scheduled retraining is a symptom of not knowing which regime you are in. Run the diagnostic first. Then decide whether a schedule makes sense at all.

Disclosure

The author has no financial relationship with Kaggle, the dataset creator, or any of the software libraries referenced in this article. All tools used — LightGBM, scikit-learn, SciPy, pandas, NumPy, and Matplotlib — are open-source projects distributed under their respective licences. The dataset used for this analysis is a publicly available synthetic dataset distributed under the database contents licence (DbCL) on Kaggle [2]. No real cardholder, merchant, or financial institution data was used. The ModelForgettingTracker implementation described and linked in this article is original work by the author, released under the MIT licence.

References

[1] Ebbinghaus, H. (1885). Über das Gedächtnis: Untersuchungen zur experimentellen Psychologie. Duncker & Humblot, Leipzig. English translation: Ebbinghaus, H. (1913). Memory: A Contribution to Experimental Psychology (H. A. Ruger & C. E. Bussenius, Trans.). Teachers College, Columbia University.

[2] Shenoy, K. (2020). Credit Card Transactions Fraud Detection Dataset [Data set]. Kaggle. Retrieved from

Distributed under the Database Contents Licence (DbCL) v1.0.

[3] Mullen, B. (2019). Sparkov Data Generation [Software]. GitHub.

[4] Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T. Y. (2017). LightGBM: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems, 30, 3146–3154.

[5] Dal Pozzolo, A., Caelen, O., Johnson, R. A., & Bontempi, G. (2015). Calibrating probability with undersampling for unbalanced classification. 2015 IEEE Symposium Series on Computational Intelligence, 159–166.

[6] Nagelkerke, N. J. D. (1991). A note on a general definition of the coefficient of determination. Biometrika, 78(3), 691–692.

[7] Tsymbal, A. (2004). The problem of concept drift: Definitions and related work (Technical Report TCD-CS-2004-15). Department of Computer Science, Trinity College Dublin.

[8] Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM Computing Surveys, 46(4), 44:1–44:37.

[9] Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., & Zhang, G. (2019). Learning under concept drift: A review. IEEE Transactions on Knowledge and Data Engineering, 31(12), 2346–2363.

[10] Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S. J., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., … SciPy 1.0 Contributors. (2020). SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nature Methods, 17, 261–272.

[11] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesneau, É. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

[12] Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. Computing in Science & Engineering, 9(3), 90–95.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button