Relation vs. Rationale: Estimating the True Effect with Propensity Score Matching

nimda April 22, 2026

0 3 10 minutes read

Relation vs. Rationale: Estimating the True Effect with Propensity Score Matching

work in Data Science, especially when we do A/B Tests to understand the results of given variations in those groups.

The problem is that the world is just… well, real. I mean, it's great to think a controlled environment where we can isolate one variable and measure its effect. But what often happens is that life just overtakes everything, and the next thing you know, your boss is asking you to compare the results of the latest campaign on customer spending.

But you didn't prepare the test data. All you have is ongoing data before and after the campaign.

Enter Propensity Score Matching

In simple words, Matching Score Propensity (PSM) a statistical method used to determine whether a particular action (“treatment”) actually caused an effect.

Because we can't go back in time and see what would have happened if someone had made a different choice, we get a “twin” in the data, someone who looks exactly like him but didn't take it. treatment action, and compare their results instead. Finding these “statistical twins” helps us compare customers fairly, even if you didn't do a completely random survey.

The Problem with Measurements

Simple estimates assume that the groups are identical at the beginning. When you compare a simple average of a treated group to a control group, you measure all the pre-existing differences that led people to choose that treatment in the first place.

Let's say we want to test a new energy gel for runners. If we just compare everyone who used the gel to everyone who didn't, we ignore important factors like the experience levels and knowledge of the runners. The people who bought the gel may be more experienced, have better shoes, or have trained harder and been supervised by a professional. They were already “ready” to run fast anyway.

PSM acknowledges the difference and works as a scout:

Inspection Report: For every runner who used the gel, the spy looked at their stats: age, years of experience, and average training miles.
Finding a Twin: The scout then looks at the group of runners he didn't use the gel to get a “twin” with exactly the same stats.
Comparison: Now, you compare the finishing times of these “twins”.

Have you noticed how we are now comparing similar groups? High achievers versus high achievers, low achievers. That way, we can separate other factors that can cause the desired effect (confusion) and measure the real effect of the energy gel.

Good. Let's move on to learn how to use this model.

Step by step for PSM

We will now go over the steps we need to take to apply PSM to our data. This is important, so we can build a feel and learn the logical steps to take when we need to apply this to any dataset.

The first step is to create a simple Logistic Regression Model. This is a well-known classification model that will try to predict the probability that a subject will belong to a treatment group. In simple words, What is the desire of that person to take the action being studied?
From the first step, we will add a propensity score (probability) to the dataset.
Next, we will use the nearest neighbor algorithm to scan the control group and find the person with the closest score to each treated user.
As a “quality filter”, we add a threshold number to be measured. If the “closest” matches are still above that threshold, we remove them. It is better to have a small, complete sample than a large, biased one.
We test the matched pairs using Standard Difference (SMD). To test whether two groups are really comparable.

So let's code!

Data set

For the purpose of this exercise, I will generate a dataset of 1000 rows with the following variables:

Age of a person
Past expenses with this company
A binary flag indicating i mobile phone usage
A binary flag that displays even if the person saw the ad

   age   past_spend  is_mobile  saw_ad
0   29   557.288206          1       1
1   45   246.829612          0       1
2   24   679.609451          0       0
3   67  1039.030017          1       1
4   20   323.241117          0       1

You can find the code that generated this dataset in the GitHub repository.

Using the Code

Next, we will implement PSM using Python. Let's start importing modules.

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

Now, we can start by creating a propensity score.

Step 1: Calculating Propensity Scores

In this step, we will just run a LogisticRegression a model that considers i age, past_spendagain is_mobile variables and estimates the likelihood that this person will see the ad.

Our idea is not to have 99% accuracy in forecasting, but to estimate the covariatesto ensure that the treated and control groups have similar average characteristics (such as age, cost) so that any difference in outcome may be attributed to treatment rather than pre-existing differences.

# Step 1: Calculate the Propensity Scores

# Define covariates and treament
covariates = ['age', 'past_spend', 'is_mobile']
treatment_col = 'saw_ad'

# 1. Estimate Propensity Scores (Probability of treatment)
lr = LogisticRegression()
X = df[covariates]
y = df[treatment_col]

# Fit a Logistic Regression
lr.fit(X, y)

# Store the probability of being in the 'Treatment' group
df['pscore'] = lr.predict_proba(X)[:, 1]

Therefore, after estimating the model, we cut the predict_proba() the effects of restoring only the column and the probability of being in the treatment group (prediction of saw_ad == 1)

Propensity Score was added to the dataset. Author's photo.

Next, we will separate the data into controls and tests.

Take control: people who didn't see the ad.
Treatment: people who saw the ad.

# 2. Split into Treatment and Control
treated = df[df[treatment_col] == 1].copy()
control = df[df[treatment_col] == 0].copy()

Time to find statistical twins in this data.

Step 2: Finding matching pairs

In this step, we will use NearestNeighbors and on Scikit Read to find similar pairs for us to notice. The idea is simple.

We have two groups with a propensity to belong to the treatment group, taking into account all confounding variables.
So we get i one observations from the control dataset that are most similar to each set of treatment datasets.
We use pscore again age in this game. It might just be the propensity score, but after looking at the matched pairs, I realized that adding the years would give us a better match.

# 3. Use Nearest Neighbors to find matches
# We use a 'caliper', or a threshold to ensure matches aren't too far apart
caliper = 0.05
nn = NearestNeighbors(n_neighbors=1, radius=caliper)
nn.fit(control[['pscore', 'age']])

# Find the matching pairs
distances, indices = nn.kneighbors(treated[['pscore', 'age']])

Now that we have pairs, we can calibrate the model to discard those that are not too close.

Step 3: Estimating the Model

These are filters for code snippets distances again indices based on caliper to identify valid matches, and issue Pandas indicators for successfully matched controls and managed observations. Any index that exceeds the limit is discarded.

We then simply combined both data sets with the remaining observations that passed quality control.

# 4. Filter out matches that are outside our 'caliper' (quality control)
matched_control_idx = [control.index[i[0]] for d, i in zip(distances, indices) if d[0] <= caliper]
matched_treated_idx = [treated.index[i] for i, d in enumerate(distances) if d[0] <= caliper]

# Combine the matched pairs into a new balanced dataframe
matched_df = pd.concat([df.loc[matched_treated_idx], df.loc[matched_control_idx]])

That's right. We have a dataset with similar pairs of customers who have seen an ad but can't see it. And the best thing is that now we can compare the same groups and differentiate the effect of the advertising campaign.

print(matched_df.saw_ad.value_counts())

saw_ad
1    532
0    532
Name: count, dtype: int64

Let's check if our model gives a good match.

Step 4: Testing

To test a PSM model, the best metrics are:

Standard Difference (SMD)
Check the standard deviation of the Propensity Score.
Visualize data overlap

Let's start by looking at propensity score statistics.

# Check standard deviation (variance around the mean) of the Propensity Score
matched_df[['pscore']].describe().T

Propensity Score calculations. Author's photo.

These statistics suggest that our propensity score matching procedure created a dataset in which the treated and control groups had very similar propensity scores. I small standard deviation as well as concentrated interquartile range (25%-75%) shows good overlap and balance of propensity scores. This is a good sign that our matching was successful in bringing the distribution of covariates closer between the treated and control groups.

Moving on, comparing the means of other covariates like age again is_mobile After comparing the Propensity Score, we can refer to the Standardized Mean Differences (SMD). A small SMD (usually less than 0.1 or 0.05) indicates that the covariate means are well balanced between the treated and control groups, suggesting a successful match.

We will calculate the SMD metric using a custom function that takes the mean and standard deviation of a given covariate variable and calculates the metric.

def calculate_smd(df, covariate, treatment_col):
    treated_group = df[df[treatment_col] == 1][covariate]
    control_group = df[df[treatment_col] == 0][covariate]

    mean_treated = treated_group.mean()
    mean_control = control_group.mean()
    std_treated = treated_group.std()
    std_control = control_group.std()

    # Pooled standard deviation
    pooled_std = np.sqrt((std_treated**2 + std_control**2) / 2)

    if pooled_std == 0:
        return 0 # Avoid division by zero if there's no variance
    else:
        return (mean_treated - mean_control) / pooled_std

# Calculate SMD for each covariate
smd_results = {}
for cov in covariates:
    smd_results[cov] = calculate_smd(matched_df, cov, treatment_col)

smd_df = pd.DataFrame.from_dict(smd_results, orient='index', columns=['SMD'])

# Interpretation of SMD values
for index, row in smd_df.iterrows():
    smd_value = row['SMD']
    interpretation = "well-balanced (excellent)" if abs(smd_value) < 0.05 else 
                     "reasonably balanced (good)" if abs(smd_value) < 0.1 else 
                     "moderately balanced" if abs(smd_value) < 0.2 else 
                     "poorly balanced"
    print(f"The covariate '{index}' has an SMD of {smd_value:.4f}, indicating it is {interpretation}.")

	SMD
age	        0.000000
past_spend	0.049338
is_mobile	0.000000

The covariate 'age' has an SMD of 0.0000, indicating it is well-balanced (excellent).
The covariate 'past_spend' has an SMD of -0.0238, indicating it is well-balanced (excellent).
The covariate 'is_mobile' has an SMD of 0.0000, indicating it is well-balanced (excellent).

SMD < 0.05 or 0.1: This is often assumed well balanced or excellent balance. Most researchers aim for an SMD of less than 0.1, and preferably less than 0.05.

We can see that our variable passes this test!

Finally, let's examine the distribution overlay between Control and treatment.

# Control and Treatment Distribution Overlays
plt.figure(figsize=(10, 6))
sns.histplot(data=matched_df, x='past_spend', hue='saw_ad', kde=True, alpha=.4)
plt.title('Distribution of Past Spend for Treated vs. Control Groups')
plt.xlabel('Past Spend')
plt.ylabel('Density / Count')
plt.legend(title='Saw Ad', labels=['Control (0)', 'Treated (1)'])
plt.show()

Spread overlay: They should be one on top of the other and the same shape. Author's photo.

It looks good. The distributions overlap perfectly and have the same shape.

This is a sample of matched pairs. You can find the code to build this on GitHub.

A sample dataset of matched pairs. Author's photo.

With that, I believe we can conclude that this model works well, and we can continue to evaluate the results.

Results

Okay, now that we have the same groups and distribution, let's move on to the results. We will examine the following:

Means difference between two groups
T-Test to test statistical differences
Cohen's D to calculate effect size.

Here are the statistics for the simulated dataset.

Statistics on the final dataset. Author's photo.

After matching the Propensity Score, the estimated effect of the cause of seeing the ad (saw_ad) to past_spend can be derived from the difference in means between the same treated and control groups.

# Difference of averages
avg_past_spend_treated = matched_df[matched_df['saw_ad'] == 1]['past_spend'].mean()
avg_past_spend_control = matched_df[matched_df['saw_ad'] == 0]['past_spend'].mean()

past_spend_difference = avg_past_spend_treated - avg_past_spend_control

print(f"Average past_spend (Treated): {avg_past_spend_treated:.2f}")
print(f"Average past_spend (Control): {avg_past_spend_control:.2f}")
print(f"Difference in average past_spend: {past_spend_difference:.2f}")

Average past_spend (Treated Group): 541.97
Average past_spend (Control group): 528.14
Difference in Average past_spend (Treated – Control): 13.82

This shows that, on average, users who saw an ad (treated) spent approximately 13.82 times more than users who did not see the ad (control), after accounting for observed covariates.

Let's check if the difference is statistically significant.

# T-Test
treated_spend = matched_df[matched_df['saw_ad'] == 1]['past_spend']
control_spend = matched_df[matched_df['saw_ad'] == 0]['past_spend']

t_stat, p_value = stats.ttest_ind(treated_spend, control_spend, equal_var=False)

print(f"T-statistic: {t_stat:.3f}")
print(f"P-value: {p_value:.3f}")

if p_value < 0.05:
    print("The difference in past_spend between treated and control groups is statistically significant (p < 0.05).")
else:
    print("The difference in past_spend between treated and control groups is NOT statistically significant (p >= 0.05).")

T-statistic: 0.805
P-value: 0.421
The difference in past_spend between treated and control groups 
is NOT statistically significant (p >= 0.05).

The difference is not significant, given that the standard deviation is still very high (~280) between the groups.

Let's also do an effect size calculation using Cohen's D.

# Cohen's D Effect measurement

def cohens_d(df, outcome_col, treatment_col):
    treated_group = df[df[treatment_col] == 1][outcome_col]
    control_group = df[df[treatment_col] == 0][outcome_col]

    mean1, std1 = treated_group.mean(), treated_group.std()
    mean2, std2 = control_group.mean(), control_group.std()
    n1, n2 = len(treated_group), len(control_group)

    # Pooled standard deviation
    s_pooled = np.sqrt(((n1 - 1) * std1**2 + (n2 - 1) * std2**2) / (n1 + n2 - 2))

    if s_pooled == 0:
        return 0 # Avoid division by zero
    else:
        return (mean1 - mean2) / s_pooled

# Calculate Cohen's d for 'past_spend'
d_value = cohens_d(matched_df, 'past_spend', 'saw_ad')

print(f"Cohen's d for past_spend: {d_value:.3f}")

# Interpret Cohen's d
if abs(d_value) < 0.2:
    interpretation = "negligible effect"
elif abs(d_value) < 0.5:
    interpretation = "small effect"
elif abs(d_value) < 0.8:
    interpretation = "medium effect"
else:
    interpretation = "large effect"

print(f"This indicates a {interpretation}.")

Cohen's d for past_spend: 0.049
This indicates a negligible effect.

The difference is small, suggesting a negligible moderate treatment effect past_spend in this matched sample.

With that, we conclude this article.

Before You Go

Causal Influence is the area of Data Science that gives us i the reasons why something is happening, without just telling us that it might or might not happen.

Many times, you may face this challenge of understanding why something works (or not) in business. Companies love that, even more if it can save money or increase sales because of that information.

Just remember the basic steps to create your model.

Run Logistic Regression to calculate propensity scores
Separate data into Control and Treatment
Run Nearest Neighbors to find a good match between the control and treatment groups, so you can isolate the true effect.
Measure your model using SMD
Calculate your results.

If you liked this content, find out more about me on my website.