5 Ways to Use Variable Variation

Although continuous variables in real-world data sets provide detailed information, they are not always the most efficient form of modeling and interpretation. This is where variable discretization comes into play.
Understanding variable discretization is important for data science students building solid ML foundations and for AI engineers designing interpretable systems.
Early in my data science journey, I focused on tuning hyperparameters, experimenting with different algorithms, and developing performance metrics.
When experimenting with dynamic classification methods, I noticed how stable and interpretable some ML models are. So, I decided to explain these methods in this article.
is it variable discretization?
Others are active better with different variables. For example, if we want to train a decision tree model on a dataset with continuous variables, it is better to convert these variables into discrete variables to reduce the training time of the model.
Variable discretization is the process of converting continuous variables into discrete variables by creation barrelswhich is a set of continuous intervals.
Advantages of variable discretization
- Decision trees and naive binary modules work best with multiple variables.
- Different features are easy to understand and explain.
- Discretization can reduce the impact of skewed and outlier variables on the data.
In short, discretization simplifies data and allows models to be trained quickly.
Disadvantages of variable discretization
The main disadvantage of dynamic segmentation is the loss of information that occurs due to the creation of bins. We need to find the smallest number of bins without much loss of information. The algorithm cannot find this number by itself. The user needs to enter the number of bins as a model hyperparameter. After that, the algorithm will find the cut points to match the number of bins.
Supervised and unsupervised classification
The main categories of classification methods are supervised and unsupervised. Unsupervised methods determine the bounds of the bins by using the underlying distribution of the variables, while supervised methods use ground truth values to determine these bounds.
Types of variable discretization
We will discuss the following types of variable discretization.
- Equal width division
- Equal-frequency discretization
- Discretization of an arbitrary interval
- K- means clustering-based discretization
- Decision tree-based discretization
Equal width division
As the name suggests, this method creates bins of equal size. The range of the bin is calculated by dividing the range of values of the variable, Xin the number of barrels, k.
Range = {Max(X) — Min(X)} / k
Here, k user-defined hyperparameter.
For example, if the values of X range between 0 and 50 and k=5, we get 10 as the width of the bin and the bins are 0–10, 10–20, 20–30, 30–40 and 40–50. If k=2, the bin width is 25 and the bins are 0–25 and 25–50. Therefore, the width of the bin varies according to the amount of k is a hyperparameter. Equal-width discretization evaluates a different number of data points in each bin. The diameter of the barrel is the same.
Let's use equal-width discretization using the iris dataset. strategy='uniform' in the middle KBinsDiscretizer() forming bins of equal diameter.
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import KBinsDiscretizer
# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
# Select one feature
feature = 'sepal length (cm)'
X = df[[feature]]
# Initialize
equal_width = KBinsDiscretizer(
n_bins=15,
encode='ordinal',
strategy='uniform'
)
bins_equal_width = equal_width.fit_transform(X)
plt.hist(bins_equal_width, bins=15)
plt.title("Equal Width Discretization")
plt.xlabel(feature)
plt.ylabel("Count")
plt.show()
The histogram shows bins of equal width.
Equal-frequency discretization
This method assigns variable values to bins that contain the same number of data points. The diameter of the barrel is not the same. The width of the bin is determined by quantiles, which divide the data into four equal parts. Here again, the number of bins is defined by the user as a hyperparameter.
A major disadvantage of equal-frequency discretization is that there will be many empty bins or bins with few data points if the distribution of data points is skewed. This will lead to a huge loss of information.
Let's use equal-width discretization using the iris dataset. strategy='quantile' in the middle KBinsDiscretizer() creates balanced bins. Each bin has (approximately) an equal number of data points.
# Import libraries
import pandas as pd
from sklearn.datasets import load_iris
# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
# Select one feature
feature = 'sepal length (cm)'
X = df[[feature]]
# Initialize
equal_freq = KBinsDiscretizer(
n_bins=3,
encode='ordinal',
strategy='quantile'
)
bins_equl_freq = equal_freq.fit_transform(X)
Discretization of an arbitrary interval
In this way, the user assigns the data points of the variable to the bins in such a way that it makes sense (arbitrarily). For example, you might assign values to variables the temperature in representative bins “it's cold”, “normal” again “hot”. What matters is given the common sense. There is no need to have the same bin width or equal number of data points in the bin.
Here, we explain how to create bin boundaries based on domain information.
# Import libraries
import pandas as pd
from sklearn.datasets import load_iris
# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
# Select one feature
feature = 'sepal length (cm)'
X = df[[feature]]
# Define custom bins
custom_bins = [4, 5.5, 6.5, 8]
df['arbitrary'] = pd.cut(
df[feature],
bins=custom_bins,
labels=[0,1,2]
)
K- means clustering-based discretization
K clustering means grouping similar data points into clusters. This feature can be used for variable discretization. In this method, the bins are the clusters identified by the k algorithm. Here again, we need to define the number of clusters, kas a model hyperparameter. There are several ways to determine the correct value k. Read this article to learn those methods.
Here, we use KMeans an algorithm for creating groups that act as latent classes.
# Import libraries
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
# Select one feature
feature = 'sepal length (cm)'
X = df[[feature]]
kmeans = KMeans(n_clusters=3, random_state=42)
df['kmeans'] = kmeans.fit_predict(X)
Decision tree-based discretization
A decision tree-based selection process uses decision trees to find bin boundaries. Unlike other methods, this one automatically finds the total number of bins. Therefore, the user does not need to define the number of bins as a hyperparameter.
The classification methods we have discussed so far are supervised methods. However, this method is an unsupervised method which means that we use target values, yto determine the boundaries.
# Import libraries
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
# Select one feature
feature = 'sepal length (cm)'
X = df[[feature]]
# Get the target values
y = iris.target
tree = DecisionTreeClassifier(
max_leaf_nodes=3,
random_state=42
)
tree.fit(X, y)
# Get leaf node for each sample
df['decision_tree'] = tree.apply(X)
tree = DecisionTreeClassifier(
max_leaf_nodes=3,
random_state=42
)
tree.fit(X, y)
This is an overview of variable discretization methods. The implementation of each method will be discussed in separate articles.
This is the end of today's article.
Please let me know if you have any questions or feedback.
What about AI courses?
See you in the next article. Enjoy reading from you!
Iris dataset information
- Quote: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository[IrvineCA:UniversityofCaliforniaSchoolofInformationandComputerScience[IrvineCA:UniversityofCaliforniaSchoolofInformationandComputerScience[I-IrvineCA:UniversityofCaliforniaSchoolofInformationandComputerScience[IrvineCA:UniversityofCaliforniaSchoolofInformationandComputerScience
- Source:
- License: RA Fisher holds the copyright for this dataset. Michael Marshall contributed this dataset to the public under Creative Commons Public Domain Attribution License (CC0). You can read more about the different types of dataset licenses here.
Designed and written by:
Rukshan Pramoditha
2025–03–04



