
Data Cleaning and Exploration with Machine Learning
By :

Sometimes, we will want to convert a continuous feature into a categorical feature. The process of creating k equally spaced intervals from the minimum to the maximum value of a distribution is called binning or, the somewhat less-friendly term, discretization. Binning can address several important issues with a feature: skew, excessive kurtosis, and the presence of outliers.
Binning might be a good choice with the COVID case data. Let's try that (this might also be useful with other variables in the dataset, including total deaths and population, but we will only work with total cases for now. total_cases
is the target variable in the following code, so it is a column – the only column – on the y_train
DataFrame):
EqualFrequencyDiscretiser
and EqualWidthDiscretiser
from feature_engine
. Additionally, we need to create training and testing DataFrames from the COVID data...