-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating

Data Science with Python
By :

Solution
Let's perform various pre-processing tasks on the Bank Marketing Subscription dataset. We'll also be splitting the dataset into training and testing data. Follow these steps to complete this activity:
import pandas as pd
Link = 'https://github.com/TrainingByPackt/Data-Science-with-Python/blob/master/Chapter01/Data/Banking_Marketing.csv'
#reading the data into the dataframe into the object data
df = pd.read_csv(Link, header=0)
#Finding number of rows and columns
print("Number of rows and columns : ",df.shape)
The preceding code generates the following output:
#Printing all the columns
print(list(df.columns))
The preceding code generates the following output:
#Basic Statistics of each column
df.describe().transpose()
The preceding code generates the following output:
#Basic Information of each column
print(df.info())
The preceding code generates the following output:
In the preceding figure, you can see that none of the columns contains any null values. Also, the type of each column is provided.
#finding the data types of each column and checking for null
null_ = df.isna().any()
dtypes = df.dtypes
sum_na_ = df.isna().sum()
info = pd.concat([null_,sum_na_,dtypes],axis = 1,keys = ['isNullExist','NullSum','type'])
info
Have a look at the output for this in the following figure:
#removing Null values
df = df.dropna()
#Total number of null in each column
print(df.isna().sum())# No NA
Have a look at the output for this in the following figure:
df.education.value_counts()
Have a look at the output for this in the following figure:
df.education.unique()
The output is as follows:
df.education.replace({"basic.9y":"Basic","basic.6y":"Basic","basic.4y":"Basic"},inplace=True)
df.education.unique()
In the preceding figure, you can see that basic.9y, basic.6y, and basic.4y are grouped together as Basic.
#Select all the non numeric data using select_dtypes function
data_column_category = df.select_dtypes(exclude=[np.number]).columns
The preceding code generates the following output:
cat_vars=data_column_category
for var in cat_vars:
cat_list='var'+'_'+var
cat_list = pd.get_dummies(df[var], prefix=var)
data1=df.join(cat_list)
df=data1
df.columns
The preceding code generates the following output:
#Categorical features
cat_vars=data_column_category
#All features
data_vars=df.columns.values.tolist()
#neglecting the categorical column for which we have done encoding
to_keep = []
for i in data_vars:
if i not in cat_vars:
to_keep.append(i)
#selecting only the numerical and encoded catergorical column
data_final=df[to_keep]
data_final.columns
The preceding code generates the following output:
#Segregating Independent and Target variable
X=data_final.drop(columns='y')
y=data_final['y']
from sklearn. model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print("FULL Dateset X Shape: ", X.shape )
print("Train Dateset X Shape: ", X_train.shape )
print("Test Dateset X Shape: ", X_test.shape )
The output is as follows: