A Method for Robust Variable Selection in Goal Modeling

0 14 5 minutes read

A Method for Robust Variable Selection in Goal Modeling

failure for one reason: bad variable selection. You choose variables that apply to your training data. They break up with new data. The model looks good in development and breaking into production.

There is a better way. This article shows you how to select variables that are stable, interpretable, and robust, regardless of how you classify the data.

Core Idea: Stability Over Performance

A variable is powerful if it is important for all subsets of your data, not for the full dataset.

To test this, we split the training data into 4 folds using stratified cross-validation. We automatically and annually adjust for fluctuations to ensure that each round is representative of the full population.

from sklearn.model_selection import StratifiedKFold.
skf = StratifiedKFold(n_splits=4, shuffle=True, random_state=42)
train_imputed["fold"] = -1

for fold, (_, test_idx) in enumerate(skf.split(train_imputed, train_imputed["def_year"])):
train_imputed.loc[test_idx, "fold"] = fold

Then we create four pairs (train, test). Each pair uses three folds for training and one fold for testing. We apply every selection rule to the training set only, not to the test set. This prevents data leakage.

folds = build_and_save_folds(train_imputed, fold_col="fold", save_dir="folds/")

A variable survives selection only if it passes the criteria in all four folders. One weak barn is enough to finish it off.

Dataset

We use the Credit Scoring Dataset from Kaggle. It consists of 32,581 loans issued by individual borrowers.

Loans cover medical, personal, educational, and professional needs – as well as debt consolidation. Loan amounts range from $500 to $35,000.

The dataset has two types of variables:

Features of the contract: loan amount, interest rate, loan purpose, credit grade, time since origination
Borrower characteristics: age, income, years of experience, housing situation

We identified 7 continuous variables:

person_salary
person_age
person_emp_height
loan_amnt
loan_int_rate
Loan_amount_percentage
cb_person_cred_hist_length

We identified 4 categories of variables:

personal_household_ownership
cb_person_default_on_file
loan_purpose
loan_grade

The target is default: 1 if the borrower defaulted, 0 otherwise.
We covered missing values and outliers in a previous article. Here, we focus on variable selection.

How to Sort: The Four Rules

The filtering method uses statistical measures of correlation. It does not require a predictive model. It is fast, readable, and easy to explain to non-technical stakeholders.

We use four rules in sequence. Each rule gives output to the next.

Rule 1: Drop continuous variables that are not linked to defaults

We performed a Kruskal-Wallis test between each continuous variable and the default target. If the ip-value exceeds 5% in at least one fold, we discard the variable. It is not reliably connected by default.

rule1_vars = filter_uncorrelated_with_target(
folds=folds,
variables=continuous_vars,
target="def_year",
pvalue_threshold=0.05,
)

Result: All continuous variables pass Rule 1. All continuous variables show significant and autocorrelation across all four folders.

Rule 2: Drop the class variable automatically linked to default

We calculate Cramér's V between each phase variable and the default target. Cramér's V measures the correlation between two categorical variables. It ranges from 0 (no link) to 1 (complete link).
We downgrade a variable if its Cramér's V falls below 10% in at least one fold. A tight fit requires a V greater than 50%.

rule2_vars = filter_categorical_variables(
folds=folds,
cat_variables=categorical_vars,
target="def_year",
low_threshold=0.10,
high_threshold=0.50,
)

Result: We keep 3 of the 4 class variables. Variable loan_int dropped; its fixed link is too weak for one fold.

Rule 3: Reduce unnecessary variables

Two continuous variables that carry the same information damage the model. They create multicollinearity.

We calculate the Spearman correlation between each pair of continuous variables. If the correlation reaches 60% or more in at least one fold, we discard one variable of the pair. We retain the one with the strongest link to default , measured by the lowest Kruskal-Wallis p value.

selected_continuous = filter_correlated_variables_kfold(
folds=folds,
variables=rule1_vars,
target="def_year",
threshold=0.60,
)

Result: We maintain 5 continuous variables. We drop it loan_amnt againcb_person_cred_hist_length– both were significantly correlated with other variables retained. This is similar to what we found in this article.

Rule 4: Drop unnecessary class variables

We use the same logic for phase transitions. We calculate Cramér's V between all pairs of phase variables retained after Rule 2. If V reaches 50% or more in at least one fold, we discard the least variable connected to the default.

selected_categorical = filter_correlated_categorical_variables(
folds=folds,
cat_variables=rule2_vars,
target="def_year",
high_threshold=0.50,
)

Result: We store 2 class variables. We drop it loan_gradewhich is strongly correlated with other variables stored, and has a weak link with default.

Final Selection: 7 Variables

The filtering method selects 7 variables in total, 5 continuous and 2 categorical. Each is closely linked to automation. None of them are useless. And they all hold in all the folds.

This selection is readable. You can show all decisions to the administrator or business stakeholders. You can explain why each variable was kept or dropped. That's important in getting a credit score.

Each rule applies to the training set of each fold. A variable is deprecated if it fails in any single iteration. This is what makes the choice strong.
In the next article, we will study the monotonicity and temporal stability of these 7 variables. Variables can be significant today and unstable over time. Both properties are important in point generation models.

Key points from the article :

Most data scientists select variables based on training data. They enter new data. Rule 1 fixes this: we perform the Kruskal-Wallis test on all folds separately. The connection between continuous and automatic changes should be important all four bars.
Categorical variables are the silent killers of scoring models. They appear to be autocorrelated in the full dataset. They split into a subset. Rule 2 holds: we calculate Cramér's V for each fold independently. Less than 10% on any one roll, no more.
Two continuous variables that mean the same thing do not duplicate your signal. They destroy your model. Rule 3 identifies all related pairs (Spearman ≥ 60%) in all folders. When two variables fight, the one with the weakest link in default loses.
Class redundancy is not visible until your model fails the test. Rule 4 states it: we calculate Cramér's V between all categorical variables. More than 50% in any barn, the person leaves. We keep one that is mostly associated with automatic variables.

Did you find this helpful? Star the repo on GitHub and stay tuned for the next post on monotonicity and temporal stability.

How can you select variables strongly in your models?

Photo Credits

All images and visualizations in this article were created by the author using Python (pandas, matplotlib, seaborn, and plotly) and excel, unless otherwise noted.

References

[1] Lorenzo Beretta and Alessandro Santaniello.
Nearest Neighbor Enforcement Algorithms: A Critical Evaluation.
National Library of Medicine, 2016.

[2] Nexialog Consulting.
Traitement des données maquantes dans le milieu bancaire.
Working paper, 2022.

[3] John T. Hancock and Taghi M. Khoshgoftaar.
A Survey of Class Data in Sensor Networks.
Journal of Big Data, 7(28), 2020.

[4] Melissa J. Azur, Elizabeth A. Stuart, Constantine Frangakis, and Philip J. Leaf.
Multiple Imputation by Chained Equations: What Is It and How Does It Work?
International Journal of Psychological Research Methods, 2011.

[5] Majid Sarmad.
Robust Data Analysis of Industrial Experimental Designs: Advanced Methods and Software.
Department of Mathematics, University of Durham, England, 2006.

[6] Daniel J. Stekhoven and Peter Bühlmann.
MissForest—Non-Parametric Missing Value Imputation for Mixed-Type Data.Bioinformatics, 2011.

[7] Supriyanto Wibisono, Anwar, and Amin.
Multivariate Weather Anomaly Detection Using the DBSCAN Clustering Algorithm.
Journal of Physics: Conference Series, 2021.

[8] Laborda, J., & Ryoo, S. (2021). Feature selection in credit scoring models. Mathematics, 9(7), 746.

Data and License

The dataset used in this article is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This license allows anyone to share and adapt the dataset for any purpose, including commercial use, as long as proper attribution is given to the source.

For more information, see the official license text: CC0: Public Domain.