Although continuous variables in real-world datasets provide detailed information, they are not always the most effective form for modelling and interpretation. This is where variable discretization comes into play.
Understanding variable discretization is essential for data science students building strong ML foundations and AI engineers designing interpretable systems.
Early in my data science journey, I mainly focused on tuning hyperparameters, experimenting with different algorithms, and optimising performance metrics.
When I experimented with variable discretization methods, I noticed how certain ML models became more stable and interpretable. So, I decided to explain these methods in this article.
is variable discretization?
Some work better with discrete variables. For example, if we want to train a decision tree model on a dataset with continuous variables, it is better to transform these variables into discrete variables to reduce the model training time.
Variable discretization is the process of transforming continuous variables into discrete variables by creating bins, which are a set of continuous intervals.
Advantages of variable discretization
- Decision trees and naive bayes modles work better with discrete variables.
- Discrete features are easy to understand and interpret.
- Discretization can reduce the impact of skewed variables and outliers in data.
In summary, discretization simplifies data and allows models to train faster.
Disadvantages of variable discretization
The main disadvantage of variable discretization is the loss of information occurred due to the creation of bins. We need to find the minimum number of bins without a significant loss of information. The algorithm can’t find this number itself. The user needs to input the number of bins as a model hyperparameter. Then, the algorithm will find the cut points to match the number of bins.
Supervised and unsupervised discretization
The main categories of discretization methods are supervised and unsupervised. Unsupervised methods determine the bounds of the bins by using the underlying distribution of the variable, while supervised methods use ground truth values to determine these bounds.
Types of variable discretization
We will discuss the following types of variable discretization.
- Equal-width discretization
- Equal-frequency discretization
- Arbitrary-interval discretization
- K-means clustering-based discretization
- Decision tree-based discretization
Equal-width discretization
As the name suggests, this method creates bins of equal size. The width of a bin is calculated by dividing the range of values of a variable, X, by the number of bins, k.
Width = {Max(X) — Min(X)} / k
Here, k is a hyperparameter defined by the user.
For example, if the values of X range between 0 and 50 and k=5, we get 10 as the bin width and the bins are 0–10, 10–20, 20–30, 30–40 and 40–50. If k=2, the bin width is 25 and the bins are 0–25 and 25–50. So, the bin width differs based on the value of the k hyperparameter. Equal-width discretization assings a different number of data points to each bin. The bin widths are the same.
Let’s implement equal-width discretization using the Iris dataset. strategy='uniform' in KBinsDiscretizer() creates bins of equal width.
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import KBinsDiscretizer
# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
# Select one feature
feature = 'sepal length (cm)'
X = df[[feature]]
# Initialize
equal_width = KBinsDiscretizer(
n_bins=15,
encode='ordinal',
strategy='uniform'
)
bins_equal_width = equal_width.fit_transform(X)
plt.hist(bins_equal_width, bins=15)
plt.title("Equal Width Discretization")
plt.xlabel(feature)
plt.ylabel("Count")
plt.show()
The histogram shows equal-range width bins.
Equal-frequency discretization
This method allocates the values of the variable into the bins that contain a similar number of data points. The bin widths are not the same. The bin width is determined by quantiles, which divide the data into four equal parts. Here also, the number of bins is defined by the user as a hyperparameter.
The major disadvantage of equal-frequency discretization is that there will be many empty bins or bins with a few data points if the distribution of the data points is skewed. This will result in a significant loss of information.
Let’s implement equal-width discretization using the Iris dataset. strategy='quantile' in KBinsDiscretizer() creates balanced bins. Each bin has (approximately) an equal number of data points.
# Import libraries
import pandas as pd
from sklearn.datasets import load_iris
# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
# Select one feature
feature = 'sepal length (cm)'
X = df[[feature]]
# Initialize
equal_freq = KBinsDiscretizer(
n_bins=3,
encode='ordinal',
strategy='quantile'
)
bins_equl_freq = equal_freq.fit_transform(X)
Arbitrary-interval discretization
In this method, the user allocates the data points of a variable into bins in such a way that it makes sense (arbitrary). For example, you may allocate the values of the variable temperature in bins representing “cold”, “normal” and “hot”. The priority is given to the general sense. There is no need to have the same bin width or an equal number of data points in a bin.
Here, we manually define bin boundaries based on domain knowledge.
# Import libraries
import pandas as pd
from sklearn.datasets import load_iris
# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
# Select one feature
feature = 'sepal length (cm)'
X = df[[feature]]
# Define custom bins
custom_bins = [4, 5.5, 6.5, 8]
df['arbitrary'] = pd.cut(
df[feature],
bins=custom_bins,
labels=[0,1,2]
)
K-means clustering-based discretization
K-means clustering focuses on grouping similar data points into clusters. This feature can be used for variable discretization. In this method, bins are the clusters identified by the k-means algorithm. Here also, we need to define the number of clusters, k, as a model hyperparameter. There are several methods to determine the optimal value of k. Read this article to learn those methods.
Here, we use KMeans algorithm to create groups which act as discretized categories.
# Import libraries
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
# Select one feature
feature = 'sepal length (cm)'
X = df[[feature]]
kmeans = KMeans(n_clusters=3, random_state=42)
df['kmeans'] = kmeans.fit_predict(X)
Decision tree-based discretization
The decision tree-based discretization process uses decision trees to find the bounds of the bins. Unlike other methods, this one automatically finds the optimal number of bins. So, the user does not need to define the number of bins as a hyperparameter.
The discretization methods that we discussed so far are supervised methods. However, this method is an unsupervised method meaning that we also use target values, y, to determine the bounds.
# Import libraries
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
# Select one feature
feature = 'sepal length (cm)'
X = df[[feature]]
# Get the target values
y = iris.target
tree = DecisionTreeClassifier(
max_leaf_nodes=3,
random_state=42
)
tree.fit(X, y)
# Get leaf node for each sample
df['decision_tree'] = tree.apply(X)
tree = DecisionTreeClassifier(
max_leaf_nodes=3,
random_state=42
)
tree.fit(X, y)
This is the overview of variablee discretization methods. The implementation of each method will be discussed in separate articles.
This is the end of today’s article.
Please let me know if you have any questions or feedback.
How about an AI course?
See you in the next article. Happy learning to you!
Iris dataset info
- Citation: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
- Source: https://archive.ics.uci.edu/ml/datasets/iris
- License: R.A. Fisher holds the copyright of this dataset. Michael Marshall donated this dataset to the public under the Creative Commons Public Domain Dedication License (CC0). You can learn more about different dataset license types here.
Designed and written by:
Rukshan Pramoditha
2025–03–04