Machine Learning

The Data Science Problem: Answering the “What If?” Questions Without Trials | by Rémy Garnier | January, 2025

nimda January 9, 2025

0 23 4 minutes read

The Data Science Problem: Answering the “What If?” Questions Without Trials | by Rémy Garnier | January, 2025

Now, we have the characteristics of our model. We will divide our data into 3 sets:

1- Training data set : It is the data set on which we will train our model

2 – Check the dataset : Data used to test the performance of our model.

3- After the dataset is prepared: Data used to calculate uplift using our model.

from sklearn.model_selection import train_test_splitstart_modification_date = dt.datetime(2024, 2,1)
X_before_modification = X[X.index < start_modification_date]
y_before_modification = y[y.index < start_modification_date].kpi
X_after_modification = X[X.index >= start_modification_date]
y_after_modification = y[y.index >= start_modification_date].kpi
X_train, X_test , y_train , y_test = train_test_split(X_before_modification, y_before_modification, test_size= 0.25, shuffle = False)

Note : You can use a fourth subset of the data to select a particular model. Here we will not make many choices of models, so it does not matter much. But it will happen when you start choosing your model among ten others.

Note 2: Cross-validation is also highly possible and recommended.

Note 3 : I recommend splitting the data without shuffling (shuffling = False). It will allow you to notice the temporal evolution of your model.

from sklearn.ensemble import RandomForestRegressormodel = RandomForestRegressor(min_samples_split=4)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

And here you train your forecast. We use a random forest regressor for simplicity because it allows us to handle non-linearity, missing data, and outliers. Gradients Boosting Trees algorithms are also very good for this application.

Most papers on Synthetic Control will use linear regression here, but we think it is not useful here because we are not really interested in the interpretation of the model. Furthermore, interpreting such regressions can be difficult.

False Test

Our prediction will be on the test set. The main hypothesis we will make is that the performance of the model will remain the same when we calculate the increase. That's why we often use a lot of data in We consider 3 important indicators to evaluate the quality of false positives :

1-Bias: Bias controls the existence of a gap between your fake data and the real data. It's a hard limit on your computing power because it can't be reduced by waiting more time after a fix.

bias = float((y_pred -  y_test).mean()/(y_before_modification.mean()))
bias
> 0.0030433481322823257

We usually express the bias as a percentage of the mean value of the KPI. It's less than 1%, so we shouldn't expect to measure results larger than that. If your bias is too large, you should check for temporal drift (and add a trend to your estimate). You can also adjust your forecast and find bias in the forecast, as long as you control the effect of this adjustment to the new data.

2-Standard Deviation σ: We also want to control how much the predictions are scattered from the true values. We therefore use the standard deviation, which is also expressed as a percentage of the mean kpi value.

sigma = float((y_pred -  y_test).std()/(y_before_modification.mean()))
sigma
> 0.0780972738325956

The good news is that the uncertainty caused by the deviation is reduced when the number of data points increases. We prefer an unbiased forecast, so it may be necessary to accept an increase in deviation if allowed to limit the bias.

It may also be interesting to look at bias and variance by looking at the distribution of prediction errors. It would be useful to see if our calculations of bias and deviation are valid, or affected by outliers and outliers.

import seaborn as sns 
import matplotlib.pyplot as pltf, ax = plt.subplots(figsize=(8, 6))
sns.histplot(pd.DataFrame((y_pred -  y_test)/y_past.mean()), x = 'kpi', bins = 35, kde = True, stat = 'probability')
f.suptitle('Relative Error Distribution')
ax.set_xlabel('Relative Error')
plt.show()

3- Automatic communication a: Generally, errors are automatically correlated. It means that if your prediction is higher than the actual value on a particular day, there is a greater chance of it being higher on the next day. It is problematic because many classical statistical tools require independence between observations. What happens on one day must affect the next. We use autocorrelation as a measure of dependence between one day and the next.

df_test = pd.DataFrame(zip(y_pred, y_test), columns = ['Prevision','Real'], index = y_test.index)
df_test =  df_test.assign(
ecart = df_test.Prevision - df_test.Real)
alpha = df_test.ecart.corr(df_test.ecart.shift(1))
alpha
> 0.24554635095548982

High autocorrelation is problematic but manageable. Possible causes are unobservable covariates. If, for example, the store you want to measure has organized a special event, it may increase its sales for a few days. This will lead to an unexpected sequence of days that exceed the previous one.

df_test = pd.DataFrame(zip(y_pred, y_test), columns = ['Prevision','Reel'], index = y_test.index)f, ax = plt.subplots(figsize=(15, 6))
sns.lineplot(data = df_test, x = 'date', y= 'Reel', label = 'True Value')
sns.lineplot(data = df_test, x = 'date', y= 'Prevision', label = 'Forecasted Value')
ax.axvline(start_modification_date, ls = '--', color = 'black', label = 'Start of the modification')
ax.legend()
f.suptitle('KPI TX_1')
plt.show()

The actual value and the predicted value in the test set.

In the image above, you can see an illustration of the auto-correlation phenomenon. At the end of April 2023, for several days, the predicted values are higher than the actual value. Errors are independent of each other.

Calculation of Impact

Now we can calculate the effect of the conversion. We compare the prediction after conversion with the actual value. As always, it is expressed as a percentage of the average value of the KPI.

y_pred_after_modification = model.predict(X_after_modification)
uplift =float((y_after_modification - y_pred_after_modification).mean()/y_before_modification.mean())
uplift
> 0.04961773643584396

We find a relative increase of 4.9% The “true” value (the data used was automatically adjusted) was 3.0%, so we are not far from it. And indeed, the actual value is often higher than the forecast :

Actual value and predicted value after conversion

We can calculate a confidence interval for this value. If our predictor has no bias, the size of its confidence interval can be expressed as:

Where σ is the standard deviation of the forecast, α is its autocorrelation, and N is the number of days after correction.

N = y_after_modification.shape[0]
ec = sigma/(sqrt(N) *(1-alpha))print('68%% IC : [%.2f %% , %.2f %%]' % (100*(uplift - ec),100 * (uplift + ec) ))
print('95%% IC : [%.2f %% , %.2f %%]' % (100*(uplift -2 *ec),100 * (uplift +2*ec) ))

68% IC : [3.83 % , 6.09 %]
95% IC : [2.70 % , 7.22 %]

The 95% CI range is around 4.5% for 84 days. It makes sense in many applications, because it is possible to do a trial or proof of concept for 3 months.

Note: the confidence interval is very sensitive to the deviation of the initial forecast. That's why it's a good idea to take some time to choose a model (only on the training set) before choosing a good model.