Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Python Feature Engineering Cookbook
  • Toc
  • feedback
Python Feature Engineering Cookbook

Python Feature Engineering Cookbook

By : Galli
3.6 (9)
close
Python Feature Engineering Cookbook

Python Feature Engineering Cookbook

3.6 (9)
By: Galli

Overview of this book

Feature engineering is invaluable for developing and enriching your machine learning models. In this cookbook, you will work with the best tools to streamline your feature engineering pipelines and techniques and simplify and improve the quality of your code. Using Python libraries such as pandas, scikit-learn, Featuretools, and Feature-engine, you’ll learn how to work with both continuous and discrete datasets and be able to transform features from unstructured datasets. You will develop the skills necessary to select the best features as well as the most suitable extraction techniques. This book will cover Python recipes that will help you automate feature engineering to simplify complex processes. You’ll also get to grips with different feature engineering strategies, such as the box-cox transform, power transform, and log transform across machine learning, reinforcement learning, and natural language processing (NLP) domains. By the end of this book, you’ll have discovered tips and practical solutions to all of your feature engineering problems.
Table of Contents (13 chapters)
close

Pinpointing rare categories in categorical variables

Different labels appear in a variable with different frequencies. Some categories of a variable appear a lot, that is, they are very common among the observations, whereas other categories appear only in a few observations. In fact, categorical variables often contain a few dominant labels that account for the majority of the observations and a large number of labels that appear only seldom. Categories that appear in a tiny proportion of the observations are rare. Typically, we consider a label to be rare when it appears in less than 5% or 1% of the population. In this recipe, we will learn how to identify infrequent labels in a categorical variable.

Getting ready

To follow along with this recipe, download the Car Evaluation dataset from the UCI Machine Learning Repository by following the instructions in the Technical requirements section of this chapter.

How to do it...

Let's begin by importing the necessary libraries and getting the data ready:

  1. Import the required Python libraries:
import pandas as pd
import matplotlib.pyplot as plt
  1. Let's load the Car Evaluation dataset, add the column names, and display the first five rows:
data = pd.read_csv('car.data', header=None)
data.columns = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']
data.head()

We get the following output when the code is executed from a Jupyter Notebook:

By default, pandas read_csv() uses the first row of the data as the column names. If the column names are not part of the raw data, we need to specifically tell pandas not to assign the column names by adding the header = None argument.
  1. Let's display the unique categories of the variable class:
data['class'].unique()

We can see the unique values of class in the following output:

array(['unacc', 'acc', 'vgood', 'good'], dtype=object)
  1. Let's calculate the number of cars per category of the class variable and then divide them by the total number of cars in the dataset to obtain the percentage of cars per category. Then, we'll print the result:
label_freq = data['class'].value_counts() / len(data)
print(label_freq)

The output of the preceding code block is a pandas Series, with the percentage of cars per category expressed as decimals:

unacc    0.700231
acc      0.222222
good     0.039931
vgood    0.037616
Name: class, dtype: float64
  1. Let's make a bar plot showing the frequency of each category and highlight the 5% mark with a red line:
fig = label_freq.sort_values(ascending=False).plot.bar()
fig.axhline(y=0.05, color='red')
fig.set_ylabel('percentage of cars within each category')
fig.set_xlabel('Variable: class')
fig.set_title('Identifying Rare Categories')
plt.show()

The following is the output of the preceding block code:

The good and vgood categories are present in less than 5% of cars, as indicated by the red line in the preceding plot.

How it works...

In this recipe, we quantified and plotted the percentage of observations per category, that is, the category frequency in a categorical variable of a publicly available dataset.

To load the data, we used pandas read_csv() and set the header argument to None, since the column names were not part of the raw data. Next, we added the column names manually by passing the variable names as a list to the columns attribute of the dataframe.

To determine the frequency of each category in the class variable, we counted the number of cars per category using pandas value_counts() and divided the result by the total cars in the dataset, which is determined with the Python built-in len method. Python's len method counted the number of rows in the dataframe. We captured the returned percentage of cars per category, expressed as decimals, in the label_freq variable.

To make a plot of the category frequency, we sorted the categories in label_freq from that of most cars to that of the fewest cars using the pandas sort_values() method. Next, we used plot.bar() to produce a bar plot. With axhline(), from Matplotlib, we added a horizontal red line at the height of 0.05 to indicate the 5% percentage limit, under which we considered a category as rare. We added x and y labels and a title with plt.xlabel(), plt.ylabel(), and plt.title() from Matplotlib.

Unlock full access

Continue reading for free

A Packt free trial gives you instant online access to our library of over 7000 practical eBooks and videos, constantly updated with the latest in tech
bookmark search playlist download font-size

Change the font size

margin-width

Change margin width

day-mode

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Delete Bookmark

Modal Close icon
Are you sure you want to delete it?
Cancel
Yes, Delete