Machine Learning

Building Effective Metrics for Defining Users | by Vladimir Zhyvov | January, 2025

Imagine you are an e-commerce platform that aims to personalize your email campaigns based on user activity over the past week. If the user is less active compared to previous weeks, you plan to send them a discount offer.

You have collected user stats and noticed the following for a user named John:

  • John visited the stage for the first time 15 days ago.
  • In the first 7 days (days 1-7), you make 9 visits.
  • In the next 7 days (days 2-8), you make 8 visits.
  • In total we have 9 values.

Now, you want to check how extreme the latest value is compared to the previous one.

import numpy as np
visits = np.array([9, 8, 6, 5, 8, 6, 8, 7])
num_visits_last_week = 6

Let's create the CDF of these values.

import numpy as np
import matplotlib.pyplot as plt

values = np.array(sorted(set(visits)))
counts = np.array([data.count(x) for x in values])
probabilities = counts / counts.sum()
cdf = np.cumsum(probabilities)

plt.scatter(values, cdf, color='black', linewidth=10)

CDF, image by Author

Now we need to return the function based on these values. We will use spline interpolation.

from scipy.interpolate import make_interp_spline

x_new = np.linspace(values.min(), values.max(), 300)
spline = make_interp_spline(values, cdf, k=3)
cdf_smooth = spline(x_new)

plt.plot(x_new, cdf_smooth, label='Сплайн CDF', color='black', linewidth=4)
plt.scatter(values, cdf, color='black', linewidth=10)
plt.scatter(values[-2:], cdf[-2:], color='#f95d5f', linewidth=10, zorder=5)
plt.show()

CDF with spline interpolation, image by Author

It's not bad. But we see a small problem between the red dots – the CDF must grow proportionally. Let's solve this with a Piecewise Cubic Hermite Interpolating Polynomial.

from scipy.interpolate import PchipInterpolator

spline_monotonic = PchipInterpolator(values, cdf)
cdf_smooth = spline_monotonic(x_new)

plt.plot(x_new, cdf_smooth, color='black', linewidth=4)
plt.scatter(values, cdf, color='black', linewidth=10)
plt.show()

CDF with Piecewise Cubic Hermite Interpolating, photo by Author

Okay, now it's complete.

To calculate the p-value from our current observations (6 visits during the last week) we need to calculate the surface area filled.

Critical site, photo by Author

To do so let's create a simple function count_p_value:

def calculate_p_value(x):
if x < values.min():
return 0
elif x > values.max():
return 1
else:
return spline_monotonic(x)

p_value = calculate_p_value(num_visits_last_week)
print(f"Probability of getting less than {num_visits_last_week} equals: {p_value}")

The probability of getting less than 6 is equal to: 0.375

So the probability is too high (we might compare it to a threshold of 0.1 for example) and we decide not to send the discount to John. The same calculation we need to do for all users.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button