Building Effective Metrics for Defining Users | by Vladimir Zhyvov | January, 2025
Imagine you are an e-commerce platform that aims to personalize your email campaigns based on user activity over the past week. If the user is less active compared to previous weeks, you plan to send them a discount offer.
You have collected user stats and noticed the following for a user named John:
- John visited the stage for the first time 15 days ago.
- In the first 7 days (days 1-7), you make 9 visits.
- In the next 7 days (days 2-8), you make 8 visits.
- In total we have 9 values.
Now, you want to check how extreme the latest value is compared to the previous one.
import numpy as np
visits = np.array([9, 8, 6, 5, 8, 6, 8, 7])
num_visits_last_week = 6
Let's create the CDF of these values.
import numpy as np
import matplotlib.pyplot as pltvalues = np.array(sorted(set(visits)))
counts = np.array([data.count(x) for x in values])
probabilities = counts / counts.sum()
cdf = np.cumsum(probabilities)
plt.scatter(values, cdf, color='black', linewidth=10)
Now we need to return the function based on these values. We will use spline interpolation.
from scipy.interpolate import make_interp_splinex_new = np.linspace(values.min(), values.max(), 300)
spline = make_interp_spline(values, cdf, k=3)
cdf_smooth = spline(x_new)
plt.plot(x_new, cdf_smooth, label='Сплайн CDF', color='black', linewidth=4)
plt.scatter(values, cdf, color='black', linewidth=10)
plt.scatter(values[-2:], cdf[-2:], color='#f95d5f', linewidth=10, zorder=5)
plt.show()
It's not bad. But we see a small problem between the red dots – the CDF must grow proportionally. Let's solve this with a Piecewise Cubic Hermite Interpolating Polynomial.
from scipy.interpolate import PchipInterpolatorspline_monotonic = PchipInterpolator(values, cdf)
cdf_smooth = spline_monotonic(x_new)
plt.plot(x_new, cdf_smooth, color='black', linewidth=4)
plt.scatter(values, cdf, color='black', linewidth=10)
plt.show()
Okay, now it's complete.
To calculate the p-value from our current observations (6 visits during the last week) we need to calculate the surface area filled.
To do so let's create a simple function count_p_value:
def calculate_p_value(x):
if x < values.min():
return 0
elif x > values.max():
return 1
else:
return spline_monotonic(x) p_value = calculate_p_value(num_visits_last_week)
print(f"Probability of getting less than {num_visits_last_week} equals: {p_value}")
The probability of getting less than 6 is equal to: 0.375
So the probability is too high (we might compare it to a threshold of 0.1 for example) and we decide not to send the discount to John. The same calculation we need to do for all users.