-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating

Data Science with Python
By :

Previously, we saw how we can combine data from different sources into a unified dataframe. Now, we have a lot of columns that have different types of data. Our goal is to transform the data into a machine-learning-digestible format. All machine learning algorithms are based on mathematics. So, we need to convert all the columns into numerical format. Before that, let's see all the different types of data we have.
Taking a broader perspective, data is classified into numerical and categorical data:
Numerical data is further divided into the following:
Categorical data is further divided into the following:
From these different types of data, we will focus on categorical data. In the next section, we'll discuss how to handle categorical data.
There are some algorithms that can work well with categorical data, such as decision trees. But most machine learning algorithms cannot operate directly with categorical data. These algorithms require the input and output both to be in numerical form. If the output to be predicted is categorical, then after prediction we convert them back to categorical data from numerical data. Let's discuss some key challenges that we face while dealing with categorical data:
Encoding
To address the problems associated with categorical data, we can use encoding. This is the process by which we convert a categorical variable into a numerical form. Here, we will look at three simple methods of encoding categorical data.
Replacing
This is a technique in which we replace the categorical data with a number. This is a simple replacement and does not involve much logical processing. Let's look at an exercise to get a better idea of this.
In this exercise, we will use the student dataset that we saw earlier. We will load the data into a pandas dataframe and simply replace all the categorical data with numbers. Follow these steps to complete this exercise:
The student dataset can be found at this location: https://github.com/TrainingByPackt/Data-Science-with-Python/blob/master/Chapter01/Data/student.csv.
import pandas as pd
import numpy as np
dataset = "https://github.com/TrainingByPackt/Data-Science-with-Python/blob/master/Chapter01/Data/student.csv"
df = pd.read_csv(dataset, header = 0)
df_categorical = df.select_dtypes(exclude=[np.number])
df_categorical
The preceding code generates the following output:
df_categorical['Grade'].unique()
The preceding code generates the following output:
df_categorical.Grade.value_counts()
The output of this step is as follows:
df_categorical.Gender.value_counts()
The output of this code is as follows:
df_categorical.Employed.value_counts()
The output of this code is as follows:
df_categorical.Grade.replace({"1st Class":1, "2nd Class":2, "3rd Class":3}, inplace= True)
df_categorical.Gender.replace({"Male":0,"Female":1}, inplace= True)
df_categorical.Employed.replace({"yes":1,"no":0}, inplace = True)
df_categorical.head()
You have successfully converted the categorical data to numerical data using a simple manual replacement method. We will now move on to look at another method of encoding categorical data.
Label Encoding
This is a technique in which we replace each value in a categorical column with numbers from 0 to N-1. For example, say we've got a list of employee names in a column. After performing label encoding, each employee name will be assigned a numeric label. But this might not be suitable for all cases because the model might consider numeric values to be weights assigned to the data. Label encoding is the best method to use for ordinal data. The scikit-learn library provides LabelEncoder(), which helps with label encoding. Let's look at an exercise in the next section.
In this exercise, we will load the Banking_Marketing.csv dataset into a pandas dataframe and convert categorical data to numeric data using label encoding. Follow these steps to complete this exercise:
The Banking_Marketing.csv dataset can be found here: https://github.com/TrainingByPackt/Master-Data-Science-with-Python/blob/master/Chapter%201/Data/Banking_Marketing.csv.
import pandas as pd
import numpy as np
dataset = 'https://github.com/TrainingByPackt/Master-Data-Science-with-Python/blob/master/Chapter%201/Data/Banking_Marketing.csv'
df = pd.read_csv(dataset, header=0)
df = df.dropna()
data_column_category = df.select_dtypes(exclude=[np.number]).columns
data_column_category
To understand how the selection looks, refer to the following screenshot:
df[data_column_category].head()
The preceding code generates the following output:
#import the LabelEncoder class
from sklearn.preprocessing import LabelEncoder
#Creating the object instance
label_encoder = LabelEncoder()
for i in data_column_category:
df[i] = label_encoder.fit_transform(df[i])
print("Label Encoded Data: ")
df.head()
The preceding code generates the following output:
In the preceding screenshot, we can see that all the values have been converted from categorical to numerical. Here, the original values have been transformed and replaced with the newly encoded data.
You have successfully converted categorical data to numerical data using the LabelEncoder method. In the next section, we'll explore another type of encoding: one-hot encoding.
One-Hot Encoding
In label encoding, categorical data is converted to numerical data, and the values are assigned labels (such as 1, 2, and 3). Predictive models that use this numerical data for analysis might sometimes mistake these labels for some kind of order (for example, a model might think that a label of 3 is "better" than a label of 1, which is incorrect). In order to avoid this confusion, we can use one-hot encoding. Here, the label-encoded data is further divided into n number of columns. Here, n denotes the total number of unique labels generated while performing label encoding. For example, say that three new labels are generated through label encoding. Then, while performing one-hot encoding, the columns will be divided into three parts. So, the value of n is 3. Let's look at an exercise to get further clarification.
In this exercise, we will load the Banking_Marketing.csv dataset into a pandas dataframe and convert the categorical data into numeric data using one-hot encoding. Follow these steps to complete this exercise:
The Banking_Marketing dataset can be found here: https://github.com/TrainingByPackt/Data-Science-with-Python/blob/master/Chapter01/Data/Banking_Marketing.csv.
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
dataset = 'https://github.com/TrainingByPackt/Master-Data-Science-with-Python/blob/master/Chapter%201/Data/Banking_Marketing.csv'
#reading the data into the dataframe into the object data
df = pd.read_csv(dataset, header=0)
df = df.dropna()
data_column_category = df.select_dtypes(exclude=[np.number]).columns
data_column_category
The preceding code generates the following output:
df[data_column_category].head()
The preceding code generates the following output:
#performing label encoding
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
for i in data_column_category:
df[i] = label_encoder.fit_transform(df[i])
print("Label Encoded Data: ")
df.head()
The preceding code generates the following output:
#Performing Onehot Encoding
onehot_encoder = OneHotEncoder(sparse=False)
onehot_encoded = onehot_encoder.fit_transform(df[data_column_category])
#Creating a dataframe with encoded data with new column name
onehot_encoded_frame = pd.DataFrame(onehot_encoded, columns = onehot_encoder.get_feature_names(data_column_category))
onehot_encoded_frame.head()
The preceding code generates the following output:
onehot_encoded_frame.columns
The preceding code generates the following output:
df_onehot_getdummies = pd.get_dummies(df[data_column_category], prefix=data_column_category)
data_onehot_encoded_data = pd.concat([df_onehot_getdummies,df[data_column_number]],axis = 1)
data_onehot_encoded_data.columns
The preceding code generates the following output:
You have successfully converted categorical data to numerical data using the OneHotEncoder method.
We will now move onto another data preprocessing step – how to deal with a range of magnitudes in your data.
Change the font size
Change margin width
Change background colour