-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating

Data Science with Python
By :

Data cleaning includes processes such as filling in missing values and handling inconsistencies. It detects corrupt data and replaces or modifies it.
Missing Values
The concept of missing values is important to understand if you want to master the skill of successful management and understanding of data. Let's take a look at the following figure:
As you can see, the data belongs to a bank; each row is a separate customer and each column contains their details, such as age and credit amount. There are some cells that have either NA or are just empty. This is missing data. Each piece of information about the customer is crucial for the bank. If any of the information is missing, then it will be difficult for the bank to predict the risk of providing a loan to the customer.
Handling Missing Data
Intelligent handling of missing data will result in building a robust model capable of handling complex tasks. There are many ways to handle missing data. Let's now look at some of those ways.
Removing the Data
Checking missing values is the first and the most important step in data pre-processing. A model cannot accept data with missing values. This is a very simple and commonly used method to handle missing values: we delete a row if the missing value corresponds to the places in the row, or we delete a column if it has more than 70%-75% of missing data. Again, the threshold value is not fixed and depends on how much you wish to fix.
The benefit of this approach is that it is quick and easy to do, and in many cases no data is better than bad data. The drawback is that you may end up losing important information, because you're deleting a whole feature based on a few missing values.
In this exercise, we will be loading the Banking_Marketing.csv dataset into the pandas DataFrame and handling the missing data. This dataset is related to direct marketing campaigns of a Portuguese banking institution. The marketing campaigns involved phone calls to clients to try and get them to subscribe to a particular product. The dataset contains the details of each client contacted, and whether they subscribed to the product. Follow these steps to complete this exercise:
The Banking_Marketing.csv dataset can be found at this location: https://github.com/TrainingByPackt/Data-Science-with-Python/blob/master/Chapter01/Data/Banking_Marketing.csv.
import pandas as pd
dataset = 'https://github.com/TrainingByPackt/Data-Science-with-Python/blob/master/Chapter01/Data/Banking_Marketing.csv'
#reading the data into the dataframe into the object data
df = pd.read_csv(dataset, header=0)
df.dtypes
The preceding code generates the following output:
df.isna().sum()
The preceding code generates the following output:
In the preceding figure, we can see that there is data missing from three columns, namely age, contact, and duration. There are two NAs in the age column, six NAs in contact, and seven NAs in duration.
#removing Null values
data = data.dropna()
df.isna().sum()
The preceding code generates the following output:
You have successfully removed all missing data from the DataFrame. In the next section, we'll look at the second method of dealing with missing data, which uses imputation.
Mean/Median/Mode Imputation
In the case of numerical data, we can compute its mean or median and use the result to replace missing values. In the case of the categorical (non-numerical) data, we can compute its mode to replace the missing value. This is known as imputation.
The benefit of using imputation, rather than just removing data, is that it prevents data loss. The drawback is that you don't know how accurate using the mean, median, or mode is going to be in a given situation.
Let's look at an exercise in which we will use imputation method to solve missing data problems.
In this exercise, we will be loading the Banking_Marketing.csv dataset into the pandas DataFrame and handle the missing data. We'll make use of the imputation method. Follow these steps to complete this exercise:
The Banking_Marketing.csv dataset can be found at this location: https://github.com/TrainingByPackt/Data-Science-with-Python/blob/master/Chapter01/Data/Banking_Marketing.csv.
import pandas as pd
dataset = 'https://github.com/TrainingByPackt/Data-Science-with-Python/blob/master/Chapter01/Data/Banking_Marketing.csv'
df = pd.read_csv(dataset, header=0)
mean_age = df.age.mean()
print(mean_age)
The preceding code generates the following output:
df.age.fillna(mean_age, inplace=True)
median_duration = df.duration.median()
print(median_duration)
df. duration.fillna(median_duration,inplace=True)
mode_contact = df.contact.mode()[0]
print(mode_contact)
df.contact.fillna(mode_contact,inplace=True)
Unlike mean and median, there may be more than one mode in a column. So, we just take the first mode with index 0.
You have successfully imputed the missing data in different ways and made the data complete and clean.
Another part of data cleaning is dealing with outliers, which will be discussed in the next section.
Outliers
Outliers are values that are very large or very small with respect to the distribution of the other data. We can only find outliers in numerical data. Box plots are one good way to find the outliers in a dataset, as you can see in the following figure:
An outlier is not always bad data! With the help of business understanding and client interaction, you can discern whether to remove or retain the outlier.
Let's learn how to find outliers using a simple example. Consider a sample dataset of temperatures from a place at different times:
71, 70, 90, 70, 70, 60, 70, 72, 72, 320, 71, 69
We can now do the following:
60,69, 70, 70, 70, 70, 71, 71, 72, 72, 90, 320
Here, the middle terms are 70 and 71 after sorting the list.
The median is (70 + 71) / 2 = 70.5
First half of the data = 60, 69, 70, 70, 70, 70
Points 3 and 4 of the bottom 6 are both equal to 70.
The average is (70 + 70) / 2 = 70
Q1 = 70
Q3 is the middle value (median) of the second half of the dataset.
Second half of the data = 71, 71, 72, 72, 90, 320
Points 3 and 4 of the upper 6 are 72 and 72.
The average is (72 + 72) / 2 = 72
Q3 = 72
IQR = Q3 – Q1 = 72 – 70
IQR = 2
Lower fence = Q1 – 1.5 (IQR) = 70 – 1.5(2) = 67
Upper fence = Q3 + 1.5 (IQR) = 71.5 + 1.5(2) = 74.5
Boundaries of our fences = 67 and 74.5
Any data points lower than the lower fence and greater than the upper fence are outliers. Thus, the outliers from our example are 60, 90 and 320.
In this exercise, we will be loading the german_credit_data.csv dataset into the pandas DataFrame and removing the outliers. The dataset contains 1,000 entries with 20 categorial/symbolic attributes prepared by Prof. Hofmann. In this dataset, each entry represents a person who takes credit from a bank. Each person is classified as a good or bad credit risk according to the set of attributes. Follow these steps to complete this exercise:
The link to the german_credit_data.csv dataset can be found here: https://github.com/TrainingByPackt/Data-Science-with-Python/blob/master/Chapter01/Data/german_credit_data.csv.
import pandas as pd
import numpy as np
%matplotlib inline
import seaborn as sbn
dataset = 'https://github.com/TrainingByPackt/Data-Science-with-Python/blob/master/Chapter01/Data/german_credit_data.csv'
#reading the data into the dataframe into the object data
df = pd.read_csv(dataset, header=0)
In the preceding code, %matplotlib inline is a magic function that is essential if we want the plot to be visible in the notebook.
sbn.boxplot(df['Age'])
The preceding code generates the following output:
We can see that some data points are outliers in the boxplot.
Q1 = df["Age"].quantile(0.25)
Q3 = df["Age"].quantile(0.75)
IQR = Q3 - Q1
print(IQR)
>>> 15.0
In the preceding code, Q1 is the first quartile and Q3 is the third quartile.
Lower_Fence = Q1 - (1.5 * IQR)
Upper_Fence = Q3 + (1.5 * IQR)
print(Lower_Fence)
print(Upper_Fence)
>>> 4.5
>>> 64.5
df[((df["Age"] < Lower_Fence) |(df["Age"] > Upper_Fence))]
The preceding code generates the following output:
df = df[~((df ["Age"] < Lower_Fence) |(df["Age"] > Upper_Fence))]
df
The preceding code generates the following output:
You have successfully found the outliers using the IQR. In the next section, we will explore another method of pre-processing called data integration.
Change the font size
Change margin width
Change background colour