Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Python Feature Engineering Cookbook
  • Toc
  • feedback
Python Feature Engineering Cookbook

Python Feature Engineering Cookbook

By : Galli
3.6 (9)
close
Python Feature Engineering Cookbook

Python Feature Engineering Cookbook

3.6 (9)
By: Galli

Overview of this book

Feature engineering is invaluable for developing and enriching your machine learning models. In this cookbook, you will work with the best tools to streamline your feature engineering pipelines and techniques and simplify and improve the quality of your code. Using Python libraries such as pandas, scikit-learn, Featuretools, and Feature-engine, you’ll learn how to work with both continuous and discrete datasets and be able to transform features from unstructured datasets. You will develop the skills necessary to select the best features as well as the most suitable extraction techniques. This book will cover Python recipes that will help you automate feature engineering to simplify complex processes. You’ll also get to grips with different feature engineering strategies, such as the box-cox transform, power transform, and log transform across machine learning, reinforcement learning, and natural language processing (NLP) domains. By the end of this book, you’ll have discovered tips and practical solutions to all of your feature engineering problems.
Table of Contents (13 chapters)
close

Highlighting outliers

An outlier is a data point that is significantly different from the remaining data. On occasions, outliers are very informative; for example, when looking for credit card transactions, an outlier may be an indication of fraud. In other cases, outliers are rare observations that do not add any additional information. These cases may also affect the performance of some machine learning models.

"An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism." [D. Hawkins. Identification of Outliers, Chapman and Hall, 1980.]

Getting ready

In this recipe, we will learn how to identify outliers using boxplots and the inter-quartile range (IQR) proximity rule. According to the IQR proximity rule, a value is an outlier if it falls outside these boundaries:

Upper boundary = 75th quantile + (IQR * 1.5)

Lower boundary = 25th quantile - (IQR * 1.5)

Here, IQR is given by the following equation:

IQR = 75th quantile - 25th quantile

Typically, we calculate the IQR proximity rule boundaries by multiplying the IQR by 1.5. However, it is also common practice to find extreme values by multiplying the IQR by 3.

How to do it...

Let's begin by importing the necessary libraries and preparing the dataset:

  1. Import the required Python libraries and the dataset:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_boston
  1. Load the Boston House Prices dataset from scikit-learn and retain three of its variables in a dataframe:
boston_dataset = load_boston()
boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)[['RM', 'LSTAT', 'CRIM']]
  1. Make a boxplot for the RM variable: 
sns.boxplot(y=boston['RM'])
plt.title('Boxplot')

The output of the preceding code is as follows:

We can change the final size of the plot using the figure() method from Matplotlib. We need to call this command before making the plot with seaborn:
plt.figure(figsize=(3,6))
sns.boxplot(y=boston['RM'])
plt.title('Boxplot')

To find the outliers in a variable, we need to find the distribution boundaries according to the IQR proximity rule, which we discussed in the Getting ready section of this recipe.

  1. Create a function that takes a dataframe, a variable name, and the factor to use in the IQR calculation and returns the IQR proximity rule boundaries:
def find_boundaries(df, variable, distance):

IQR = df[variable].quantile(0.75) - df[variable].quantile(0.25)

lower_boundary = df[variable].quantile(0.25) - (IQR * distance)
upper_boundary = df[variable].quantile(0.75) + (IQR * distance)

return upper_boundary, lower_boundary
  1. Calculate and then display the IQR proximity rule boundaries for the RM variable:
upper_boundary, lower_boundary = find_boundaries(boston, 'RM', 1.5)
upper_boundary, lower_boundary

The find_boundaries() function returns the values above and below which we can consider a value to be an outlier, as shown here:

(7.730499999999999, 4.778500000000001)
If you want to find very extreme values, you can use 3 as the distance of find_boundaries() instead of 1.5.

Now, we need to find the outliers in the dataframe.

  1. Create a boolean vector to flag observations outside the boundaries we determined in step 5:
outliers = np.where(boston['RM'] > upper_boundary, True,
np.where(boston['RM'] < lower_boundary, True, False))
  1. Create a new dataframe with the outlier values and then display the top five rows:
outliers_df = boston.loc[outliers, 'RM']
outliers_df.head()

We can see the top five outliers in the RM variable in the following output:

97     8.069
98     7.820
162    7.802
163    8.375
166    7.929
Name: RM, dtype: float64

To remove the outliers from the dataset, execute boston.loc[~outliers, 'RM'].

How it works...

In this recipe, we identified outliers in the numerical variables of the Boston House Prices dataset from scikit-learn using boxplots and the IQR proximity rule. To proceed with this recipe, we loaded the dataset from scikit-learn and created a boxplot for one of its numerical variables as an example. Next, we created a function to identify the boundaries using the IQR proximity rule and used the function to determine the boundaries of the numerical RM variable. Finally, we identified the values of RM that were higher or lower than those boundaries, that is, the outliers.

To load the data, we imported the dataset from sklearn.datasets and used load_boston(). Next, we captured the data in a dataframe using pandas DataFrame(), indicating that the data was stored in the data attribute and that the variable names were stored in the feature_names attribute. To retain only the RM, LSTAT, and CRIM variables, we passed the column names in double brackets [[]] at the back of pandas DataFrame().

To display the boxplot, we used seaborn's boxplot() method and passed the pandas Series with the RM variable as an argument. In the boxplot displayed after step 3, the IQR is delimited by the rectangle, and the upper and lower boundaries corresponding to either, the 75th quantile plus 1.5 times the IQR, or the 25th quantile minus 1.5 times the IQR. This is indicated by the whiskers. The outliers are the asterisks lying outside the whiskers.

To identify those outliers in our dataframe, in step 4, we created a function to find the boundaries according to the IQR proximity rule. The function took the dataframe and the variable as arguments and calculated the IQR and the boundaries using the formula described in the Getting ready section of this recipe. With the pandas quantile() method, we calculated the values for the 25th (0.25) and 75th quantiles (0.75). The function returned the upper and lower boundaries for the RM variable.

To find the outliers of RM, we used NumPy's where() method, which produced a boolean vector with True if the value was an outlier. Briefly, where() scanned the rows of the RM variable, and if the value was bigger than the upper boundary, it assigned True, whereas if the value was smaller, the second where() nested inside the first one and checked whether the value was smaller than the lower boundary, in which case it also assigned True, otherwise False. Finally, we used the loc[] method from pandas to capture only those values in the RM variable that were outliers in a new dataframe.

Unlock full access

Continue reading for free

A Packt free trial gives you instant online access to our library of over 7000 practical eBooks and videos, constantly updated with the latest in tech
bookmark search playlist download font-size

Change the font size

margin-width

Change margin width

day-mode

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Delete Bookmark

Modal Close icon
Are you sure you want to delete it?
Cancel
Yes, Delete