Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Python Feature Engineering Cookbook
  • Toc
  • feedback
Python Feature Engineering Cookbook

Python Feature Engineering Cookbook

By : Galli
3.6 (9)
close
Python Feature Engineering Cookbook

Python Feature Engineering Cookbook

3.6 (9)
By: Galli

Overview of this book

Feature engineering is invaluable for developing and enriching your machine learning models. In this cookbook, you will work with the best tools to streamline your feature engineering pipelines and techniques and simplify and improve the quality of your code. Using Python libraries such as pandas, scikit-learn, Featuretools, and Feature-engine, you’ll learn how to work with both continuous and discrete datasets and be able to transform features from unstructured datasets. You will develop the skills necessary to select the best features as well as the most suitable extraction techniques. This book will cover Python recipes that will help you automate feature engineering to simplify complex processes. You’ll also get to grips with different feature engineering strategies, such as the box-cox transform, power transform, and log transform across machine learning, reinforcement learning, and natural language processing (NLP) domains. By the end of this book, you’ll have discovered tips and practical solutions to all of your feature engineering problems.
Table of Contents (13 chapters)
close

Technical requirements

Throughout this book, we will use many open source Python libraries for numerical computing. I recommend installing the free Anaconda Python distribution (https://www.anaconda.com/distribution/), which contains most of these packages. To install the Anaconda distribution, follow these steps:

  1. Visit the Anaconda website: https://www.anaconda.com/distribution/.
  2. Click the Download button.
  3. Download the latest Python 3 distribution that's appropriate for your operating system.
  4. Double-click the downloaded installer and follow the instructions that are provided.
The recipes in this book were written in Python 3.7. However, they should work in Python 3.5 and above. Check that you are using similar or higher versions of the numerical libraries we'll be using, that is, NumPy, pandas, scikit-learn, and others. The versions of these libraries are indicated in the requirement.txt file in the accompanying GitHub repository (https://github.com/PacktPublishing/Python-Feature-Engineering-Cookbook).

In this chapter, we will use pandas, NumPy, Matplotlib, seaborn, SciPy, and scikit-learn. pandas provides high-performance analysis tools. NumPy provides support for large, multi-dimensional arrays and matrices and contains a large collection of mathematical functions to operate over these arrays and over pandas dataframes. Matplotlib and seaborn are the standard libraries for plotting and visualization. SciPy is the standard library for statistics and scientific computing, while scikit-learn is the standard library for machine learning.

To run the recipes in this chapter, I used Jupyter Notebooks since they are great for visualization and data analysis and make it easy to examine the output of each line of code. I recommend that you follow along with Jupyter Notebooks as well, although you can execute the recipes in other interfaces.

The recipe commands can be run using a .py script from a command prompt (such as the Anaconda Prompt or the Mac Terminal) using an IDE such as Spyder or PyCharm or from Jupyter Notebooks, as in the accompanying GitHub repository (https://github.com/PacktPublishing/Python-Feature-Engineering-Cookbook).

In this chapter, we will use two public datasets: the KDD-CUP-98 dataset and the Car Evaluation dataset. Both of these are available at the UCI Machine Learning Repository.

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository (http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science.

To download the KDD-CUP-98 dataset, follow these steps:

  1. Visit the following website: https://archive.ics.uci.edu/ml/machine-learning-databases/kddcup98-mld/epsilon_mirror/.
  2. Click the cup98lrn.zip link to begin the download:

  1. Unzip the file and save cup98LRN.txt in the same folder where you'll run the commands of the recipes.

To download the Car Evaluation dataset, follow these steps:

  1. Go to the UCI website: https://archive.ics.uci.edu/ml/machine-learning-databases/car/.
  2. Download the car.data file:

  1. Save the file in the same folder where you'll run the commands of the recipes.

We will also use the Titanic dataset that's available at http://www.openML.orgTo download and prepare the Titanic dataset, open a Jupyter Notebook and run the following commands:

import numpy as np
import pandas as pd

def get_first_cabin(row):
try:
return row.split()[0]
except:
return np.nan

url = "https://www.openml.org/data/get_csv/16826755/phpMYEkMl"
data = pd.read_csv(url)
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].apply(get_first_cabin)
data.to_csv('titanic.csv', index=False)

The preceding code block will download a copy of the data from http://www.openML.org and store it as a titanic.csv file in the same directory from where you execute the commands.

There is a Jupyter Notebook with instructions on how to download and prepare the titanic dataset in the accompanying GitHub repository: https://github.com/PacktPublishing/Python-Feature-Engineering-Cookbook/blob/master/Chapter01/DataPrep_Titanic.ipynb.

Unlock full access

Continue reading for free

A Packt free trial gives you instant online access to our library of over 7000 practical eBooks and videos, constantly updated with the latest in tech
bookmark search playlist download font-size

Change the font size

margin-width

Change margin width

day-mode

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Delete Bookmark

Modal Close icon
Are you sure you want to delete it?
Cancel
Yes, Delete