Your Churn Threshold Is a Pricing Decision

0 16 12 minutes read

Your Churn Threshold Is a Pricing Decision

says “this customer’s probability of leaving is 0.4” and your code does predict(X) >= 0.5, you have just made a pricing decision: you decided that the cost of sending a retention offer to a customer who would have stayed is exactly equal to the cost of losing one who would have left, and on the IBM Telco dataset (arguably the most-recycled churn dataset on Kaggle and GitHub), that decision is wrong by a factor of 13.

I assembled a corpus of 36 publicly available IBM Telco churn analyses (Kaggle notebooks, GitHub repositories, blog posts, peer-reviewed papers), and the reporting pattern is striking: roughly nine in ten report classification accuracy or F1, just over one in seven report a profit curve, and none use survival analysis to compute lifetime value.

The result is a literature where the same dataset has been re-modelled hundreds of times, and every default-threshold model leaves money on the floor: about $86 per customer in avoidable burn on the standard 20% test split, and scaled to a 100,000-subscriber book with the same churn profile that would represent $8.6 million in recoverable cost; the IBM Telco churn rate (26.5% annual) is unusually high, and a healthier B2C SaaS book with 5–8% annual churn would see the per-customer figure drop by roughly 3–4×, so what stays invariant across any cost-sensitive setting is not the headline dollar amount but the asymmetry — 13× more expensive to miss a churner than to over-treat a loyalist.

Image by author.

This piece lays out three things, in order: first, what the IBM Telco literature reports and what it leaves out; second, how to compute the dollar cost of a misclassification using public 2026 B2C SaaS benchmarks and Kaplan-Meier survival analysis, with no hand-waved CAC; third, why the textbook Bayes-optimal threshold formula loses to a brute-force sweep when the model is trained on SMOTE-balanced data, and what to do about it.

Every number in this article is reproducible from the scripts linked at the end.

1. The 36-article gap

The IBM Telco Customer Churn dataset is small (7,032 cleaned rows), tidy, labelled, and has been the canonical introductory churn dataset on Kaggle for nearly a decade.

To get a feel for what the public corpus actually measures, I indexed 36 analyses across Kaggle, GitHub, and the major data-science blogs, scoring each one on ten reporting dimensions ranging from F1 score to a CAC-and-LTV-grounded profit curve.

The pattern that emerged is in Figure 2.

Coverage of ten reporting dimensions across 36 publicly available IBM Telco churn analyses. The high bars are saturated, and the low bars are where the field has work to do. — Image by author.

Three findings worth pulling out:

Saturated: F1, accuracy, AUC, confusion-matrix screenshots, and SMOTE-vs-no-SMOTE comparisons appear in 80–90% of the corpus, with hyperparameter tuning via Optuna or grid search a near-universal trope.
Uncommon: a profit curve (total dollar cost of misclassifications as a function of decision threshold) appears in fewer than 15% of the analyses I reviewed, and when it does appear, the FN/FP cost numbers are usually picked from a textbook example without anchoring them to real CAC or LTV.
Absent: none of the 36 analyses I indexed compute customer lifetime value via survival analysis on tenure; most either skip LTV entirely or use the steady-state Skok formula LTV = ARPU / monthly_churn_rate, which assumes a homogeneous customer base — a strong claim for a dataset where contract type, payment method, and tenure all materially shift retention.

Skipping survival analysis matters because the threshold decision is a function of LTV: if you misjudge LTV by 2x you misjudge the cost of a missed churner by 2x, and the cost-optimal threshold moves with it.

The next two sections build the missing piece, then plug it back into the threshold problem.

2. The cost of an error, in dollars

Three numbers determine the dollar cost of every prediction (ARPU, gross margin, and CAC); two come straight out of the dataset, one from public 2026 industry benchmarks.

import pandas as pd

df = pd.read_csv("telco.csv")
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")
df = df.dropna().reset_index(drop=True)

arpu = df["MonthlyCharges"].mean()                  # $64.80
mean_tenure = df["tenure"].mean()                   # 32.42 months
churn_rate = (df["Churn"] == "Yes").mean()          # 26.58 %
realised_ltv_churned = df.loc[df["Churn"] == "Yes",
                              "TotalCharges"].mean() # $1,531.80

ARPU and tenure are dataset-native: mean monthly charge is $64.80, mean observed tenure is 32.4 months, and the realised LTV of churners is $1,531.80 (just the average TotalCharges over customers who already left), so holding ARPU fixed, three common LTV framings give very different ceilings:

Table showing three lifetime-value framings using the dataset’s mean ARPU. The Simple framing, ARPU times mean tenure, gives $2,100.87. The Skok steady-state framing, ARPU divided by monthly churn rate, gives $7,904.41. The Realized framing, mean total charges for the already-churned cohort, gives $1,531.80.

None of these three is the right answer, and the first one is actively wrong. ARPU × mean_tenure is the framing most tutorials reach for, and it is broken at the root. ARPU is not a property of a customer; it is the joint outcome of every feature that also drives churn — contract type, payment method, product bundle, household structure. The revenue leaves with the customer, so ARPU and tenure are not independent quantities, and the textbook decomposition LTV ≈ E[ARPU] × E[lifetime] is only valid when Cov(ARPU, lifetime) = 0. In any churn dataset worth modelling that covariance is non-zero — if it were zero, the model would have nothing to predict — and a customer with $80/month Monthly Charges and 8 months of tenure is not a “high-revenue customer”; their $80 and their 8 months are two readings of the same underlying risk profile.

Layer on the fact that ARPU varies within a single customer’s lifetime — a promo-onboarding cohort paying $30/month for the first three months and $80/month thereafter, a long-tenured loyalty customer grandfathered into a $50/month rate while new customers pay $90/month for the same product — and multiplying the mean of one distribution by the mean of another describes a customer who exists nowhere in the data.

The Skok formula at least uses a steady-state churn rate, but it assumes that rate is constant forever. The realised LTV is a real number, but it only describes the customers you have already lost.

None of the three tells you when a new customer breaks even on acquisition cost — for that you need a real retention curve, which is the next section, but first a word on CAC.

For B2C SaaS in the 2026 benchmarks, CAC ranges from about $68 (eCommerce) to over $200 (fintech), with mid-market subscription products clustering around $150 ([1], [2]); telecom subscriber acquisition cost is materially higher ($300+ once handset subsidies are amortised), so $150 is a conservative anchor for this dataset, and picking a higher number would only make the burn calculation in Section 4 larger.

Where the $150 anchor sits on the 2026 B2C subscription CAC spectrum. Telecom (the actual industry of the dataset) clusters at $300–$500, so a $150 CAC is the conservative floor. — Image by author. Data: Genesys Growth Marketing (2026); Proven SaaS (2026).

Gross margin for B2C SaaS sits in the 70–85% band, with 75% as the usual midpoint that matches David Skok’s modelling assumptions for steady-state SaaS economics ([3]).

That gives us the building blocks for the cost of a single prediction error.

CAC = 150
ARPU = 64.80
REMAINING_TENURE_MONTHS = 18

FN_COST = CAC + ARPU * REMAINING_TENURE_MONTHS   # $1,316.40
FP_COST = 100              # typical campaign cost (midpoint)
ratio = FN_COST / FP_COST                        # 13.2 : 1

A false negative (telling a customer they will stay when they actually leave) costs you the new acquisition spend ($150) plus 18 months of foregone revenue ($64.80 × 18 = $1,166.40), for $1,316.40 total, while a false positive (flagging a customer as a churn risk when they were going to stay) costs roughly $100 of campaign and discount expense, leaving a cost ratio of 13.2 to 1.

A note on what the framework is and is not. The $150 CAC and the $100 false-positive cost in this article are placeholders; CAC varies materially by acquisition channel, and the $100 is shorthand for whatever your real retention intervention costs — a discount, a CSM call, a bundle upgrade, a product investigation. None of these are interchangeable, and a blanket discount is not a retention strategy: it is a deferral mechanism that retains customers only until the discount expires while paying to retain customers who were never going to leave (and, worse, training them to expect the next discount). Real retention strategy maps churn drivers — a customer leaving because backup_online keeps failing is retained by fixing backup_online, not by a 10% off email — and allocates budget toward product improvement, with the campaign cost as a short-term bridge while engineering catches up. The profit curve here is a threshold-setting tool that operates after you have decided your retention playbook (what intervention applies to whom, at what real cost); it is not a substitute for that decision. Treat the $150 and the $100 as a single representative pair; segment them, and the framework segments with them.

That ratio is the entire reason threshold = 0.5 is the wrong default: the decision boundary should reflect the asymmetry, and we will get to the exact formula, but first comes the LTV piece.

3. The LTV profit curve

Most churn writeups treat lifetime value as a static dollar number you multiply by a hazard rate.

Survival analysis does better.

It measures retention directly from the data and turns LTV into a curve: the cumulative contribution margin per customer as a function of months since acquisition, starting at −CAC on day zero (you’ve paid to acquire the customer and earned nothing) and climbing as each surviving month adds ARPU × gross_margin × P(still alive) to the balance.

The Kaplan-Meier estimator does the heavy lifting, with tenure as the duration and Churn == "Yes" as the event, producing the overall curve in Figure 4.

from lifelines import KaplanMeierFitter
import numpy as np

CAC, GROSS_MARGIN, HORIZON = 150, 0.75, 72

kmf = KaplanMeierFitter().fit(df["tenure"], (df["Churn"] == "Yes").astype(int))
months = np.arange(0, HORIZON + 1)
S = kmf.survival_function_at_times(months).values

monthly = ARPU * GROSS_MARGIN * S
ltv_curve = np.cumsum(monthly) - CAC
ltv_curve[0] = -CAC              # day zero, only CAC is sunk

Figure 4 — Kaplan-Meier-derived LTV profit curve: cumulative contribution margin minus CAC, with -10% and -20% churn-reduction scenarios, and breakeven (cumulative margin crosses zero) sitting at month 3. — Image by author.

Three readings worth pulling out:

Breakeven at month 3 (in expectation, not per customer): across the original acquired cohort, the survival-weighted cumulative contribution covers the $150 acquisition cost by month 3, and CAC payback under twelve months is the David Skok rule of thumb that Telco beats by a factor of four. This is the right number for budgeting cohort-level retention spend, but it is a cohort average that hides bimodal variance: a customer who churns in month 1 individually contributes one month of margin ($48.60) and never recoups their CAC, while a 70-month survivor contributes well over $3,400. The Kaplan-Meier weighting bakes those early losses in correctly — they just do not get a star on the curve.
LTV at the 72-month horizon ≈ $2,527 per customer: combined with the $150 CAC, that is an LTV:CAC ratio of about 17.8:1, well above the 3:1 floor most SaaS investors look for, and a useful sanity check that the dataset describes a healthy unit-economics business rather than a death-spiral one.
Churn-reduction uplift is modest at the cohort level: a 10% reduction in churn lifts terminal LTV by ~2.8% and a 20% reduction lifts it by ~5.7%, so the lift is real but not heroic, and the acquisition decision matters more than the retention intervention.

Segmenting the same calculation by Contract (Figure 5) is where the framework earns its keep.

Figure 5 — LTV profit curve segmented by contract type: same CAC, same ARPU, with the difference being pure retention. — Image by author.

A two-year-contract customer is worth about $3,372 over the 72-month horizon, while a month-to-month customer is worth $1,620 (less than half), with the same ARPU and the same CAC, so the entire delta is retention; from a marketing-spend perspective, the right-side acquisition target (the customer you should be willing to pay more to acquire) is the contract-locked one, even though they look “less profitable per month” in any given snapshot.

This is the kind of decision the standard IBM Telco analysis cannot make, because it never computes survival-conditional LTV in the first place.

4. The classification profit curve

With FN cost, FP cost, and survival-based LTV in hand, the threshold question becomes a one-dimensional optimization: train a model, get predicted probabilities on the test set, sweep the threshold from 0 to 1, compute total dollar cost at each threshold, and pick the minimum.

The model here is a tuned XGBoost trained with SMOTE on the train fold only, the standard Telco recipe.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from imblearn.over_sampling import SMOTE
from xgboost import XGBClassifier

y = (df["Churn"] == "Yes").astype(int)
X = pd.get_dummies(df.drop(columns=["customerID", "Churn"]),
                   drop_first=True).astype(float)

X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.20, stratify=y, random_state=42)

scaler = StandardScaler().fit(X_tr)
X_tr_s, X_te_s = scaler.transform(X_tr), scaler.transform(X_te)
X_tr_b, y_tr_b = SMOTE(random_state=42).fit_resample(X_tr_s, y_tr)

model = XGBClassifier(n_estimators=400, max_depth=5, 
                     learning_rate=0.05,
                     subsample=0.9, colsample_bytree=0.9,
                     random_state=42, eval_metric="logloss")
model.fit(X_tr_b, y_tr_b)
probs = model.predict_proba(X_te_s)[:, 1]

thresholds = np.linspace(0.01, 0.99, 99)
totals = []
for t in thresholds:
    pred = (probs >= t).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_te, pred).ravel()
    totals.append(fn * 1316.40 + fp * 100)

The result is in Figure 6.

Figure 6 — Classification profit curve on the IBM Telco test set: false-negative cost (red) dominates because of the 13:1 ratio, and the default 0.5 threshold sits well to the right of the cost-minimum. — Image by author.

The numbers, on a 1,407-row test set:

Table comparing the cost of three classification thresholds on a 1,407-row test set. The 0.50 default threshold costs $199,575, with 139 false negatives and 166 false positives, balanced but the wrong metric. The 0.07 Bayes-optimal threshold costs $87,076, which theory predicts should win. The 0.03 empirical-minimum threshold costs $78,415, with 7 false negatives and 692 false positives, beating the theoretical optimum by $8,661.

Moving from 0.5 to the empirical minimum recovers $121,160 on the test set, which is $86.11 per customer, and applying that to a 100,000-subscriber book gives the headline $8.6M; your mileage will vary with your CAC, your ARPU, and your retention curve, but the multiplier (13× more expensive to miss a churner than to over-treat a loyalist) is what makes the gap large.

When the textbook formula loses to the threshold sweep

Open any cost-sensitive classification reference (Provost and Fawcett’s Data Science for Business is the canonical one [4]) and you will find the Bayes-optimal threshold formula:

t* = C_FP / (C_FP + C_FN)

Plug in our cost ratio: t* = 100 / (100 + 1316.40) ≈ 0.0706, which is correct math, and on a model with calibrated probabilities that threshold minimizes expected cost; but the sweep gives t = 0.03, and at that threshold the test-set cost is $8,661 lower than at 0.07, so where is the gap?

The Bayes Optimal formula assumes the model’s predicted probabilities are calibrated: a prediction of 0.5 should correspond to a 50% true churn probability, but our model is trained on a SMOTE-balanced set, which inflates the minority class to 50% during training, and tree-based learners then output probabilities biased toward higher values, with the model’s “0.07” mapping to roughly the true 0.03 in calibrated probability space; the textbook formula isn’t wrong, it is being applied to an out-of-spec input.

There are two clean fixes:

Calibrate the probabilities first: apply Platt scaling or isotonic regression on a held-out set, then use the Bayes-optimal threshold on the calibrated output, with scikit-learn’s CalibratedClassifierCV doing it in one line.
Skip calibration and sweep: it is cheap, it tolerates calibration drift, and on a small dataset like Telco the test-set sweep is more reliable than a calibration model fit on hundreds of held-out rows.

In practice, for production systems with regular re-training, the sweep is what most teams ship; the formula is the right thing to teach (with calibration as the asterisk), and both should appear in any honest writeup of a cost-sensitive churn model.

Neither one shows up in the IBM Telco corpus I indexed.

5. What the next IBM Telco article should report

Three concrete shifts would make the next 36 IBM Telco analyses more useful than the last 36:

Report a profit curve, not a confusion matrix. F1 at threshold 0.5 is a tournament metric (useful for ranking models when you have to pick one, useless for deciding how to ship one), and the curve in Figure 6 has more decision-relevant information than every accuracy comparison in the corpus combined.

Anchor LTV in survival analysis, not steady-state assumptions. Kaplan-Meier on tenure is 30 lines of Python; the breakeven number, the LTV at horizon, and the contract-segmented curve give marketing operations a usable budget to spend on retention plus a defensible answer to “which customer should we acquire harder?”, and the Skok formula remains a fine sanity check rather than the load-bearing LTV estimate.

Disclose the calibration assumption when you quote a Bayes-optimal threshold. Either calibrate first or note explicitly that the threshold reported is the empirical minimum from a sweep, and Wang et al. ([5]) make a closely related argument with a more elaborate metric (e-Profits) that uses survival analysis end-to-end, with the same core idea.

Segment the intervention, not just the score. A profit curve assumes a single FP cost (the price of “treating” one false alarm), but in practice the cheapest intervention for a high-value loyalist is different from the cheapest for an at-risk new account: a bundle upgrade for one, a price-pain investigation for the other, do-nothing for the third; segment-aware FP costs (and segment-aware thresholds) are the natural follow-up to the framework here.

I went into this expecting the gap would be in the modeling. It isn’t.

The IBM Telco dataset has been mined to bedrock for predictive accuracy, and what it can still teach is whether our pipelines lead to good decisions, not just accurate predictions.

That requires three things: dollar costs on errors, real retention curves on customers, and an honest threshold on the classifier — four scripts and a Kaplan-Meier fit get you there.

References

[1] Genesys Growth Marketing, Customer Acquisition Cost Benchmarks: 44 Statistics Every Marketing Leader Should Know in 2026 (2026), genesysgrowth.com.

[2] Proven SaaS, CAC Payback Benchmarks 2026: SaaS Customer Acquisition Cost (2026), proven-saas.com.

[3] D. Skok, SaaS Metrics 2.0: Detailed Definitions (2014, updated 2024), forentrepreneurs.com.

[4] F. Provost and T. Fawcett, Data Science for Business (2013), O’Reilly Media, ch. 7–8.

[5] Y. Wang, S. Albrecht, et al., e-Profits: A Business-Aligned Evaluation Metric for Profit-Sensitive Customer Churn Prediction (2025), arXiv:2507.08860.

[6] C. Davidson-Pilon, lifelines: survival analysis in Python (2019), Journal of Open Source Software 4(40), 1317.

[7] W. Verbeke, T. Verdonck and S. Maldonado, Profit-driven decision trees for churn prediction (2018), European Journal of Operational Research, 284(3).

[8] N. El Attar and M. El-Hajj, A systematic review of customer churn prediction approaches in telecommunications (2026), Frontiers in Artificial Intelligence.

Code, data, and reproducible scripts for every figure are available on request. The dataset is the IBM Telco Customer Churn dataset, fully synthetic sample data published by IBM in its official repository (github.com/IBM/telco-customer-churn-on-icp4d) under the Apache License 2.0, which permits use, derivative analysis, and publication with attribution. The data is synthetic and contains no real customers or PII.