Cleaning up raw datasets with fastai

Now that we have explored a variety of datasets that are curated by fastai, there is one more topic left to cover in this chapter: how to clean up datasets with fastai. Cleaning up datasets includes dealing with missing values and converting categorical values into numeric identifiers. We need to apply these cleanup steps to datasets because deep learning models can only be trained with numeric data. If we try to train the model with datasets that contain non-numeric data, including missing values and alphanumeric identifiers in categorical columns, the training process will fail. In this section, we are going to review the facilities provided by fastai to make it easy to clean up datasets, and thus make the datasets ready to train deep learning models.

Getting ready

Ensure you have followed the steps in Chapter 1, Getting Started with fastai, to get a fastai environment set up. Confirm that you can open the cleaning_up_datasets.ipynb notebook in the ch2 directory of your repository.

How to do it…

In this section, you will be running through the cleaning_up_datasets.ipynb notebook to address missing values in the ADULT_SAMPLE dataset and replace categorical values with numeric identifiers.

Once you have the notebook open in your fastai environment, complete the following steps:

Run the first two cells to import the necessary libraries and set up the notebook for fastai.
Recall the Examining tabular datasets with fastai section of this chapter. When you checked to see which columns in the ADULT_SAMPLE dataset had missing values, you found that some columns did indeed have missing values. We are going to identify the columns in ADULT_SAMPLE that have missing values, and use the facilities of fastai to apply transformations to the dataset that deal with the missing values in those columns, and then replace those categorical values with numeric identifiers.
First, let's ingest the ADULT_SAMPLE curated dataset again:
```
path = untar_data(URLs.ADULT_SAMPLE)
```
Now, create a pandas DataFrame for the dataset and check for the number of missing values in each column. Note which columns have missing values:
```
df = pd.read_csv(path/'adult.csv')
df.isnull().sum()
```
To deal with these missing values (and prepare categorical columns), we will use the fastai TabularPandas class (https://docs.fast.ai/tabular.core.html#TabularPandas). To use this class, we need to prepare the following parameters:
a) procs is the list of transformations that will be applied to TabularPandas. Here, we will specify that we want missing values to be filled (FillMissing) and that we will replace values in categorical columns with numeric identifiers (Categorify).
b) dep_var specifies which column is the dependent variable; that is, the target that we want to ultimately predict with the model. In the case of ADULT_SAMPLE, the dependent variable is salary.
c) cont and cat are lists of the columns in the dataset. They are continuous and categorical, respectively. Continuous columns contain numeric values, such as integers or floating-point values. Categorical values contain category identifiers, such as names of US states, days of the week, or colors. We use the cont_cat_split() (https://docs.fast.ai/tabular.core.html#cont_cat_split) function to automatically identify the continuous and categorical columns:
```
procs = [FillMissing,Categorify]
dep_var = 'salary'
cont,cat = cont_cat_split(df, 1, dep_var=dep_var)
```
Now, create a TabularPandas object called df_no_missing using these parameters. This object will contain the dataset with missing values replaced and the values in the categorical columns replaced with numeric identifiers:
```
df_no_missing = TabularPandas(df, procs, cat, cont, y_names = dep_var)
```
Apply the show API to df_no_missing to display samples of its contents. Note that the values in the categorical columns are maintained when the object is displayed using show(). What about replacing the categorical values with numeric identifiers? Don't worry – we'll see that result in the next step:
Figure 2.21 – The first few records of df_no_missing
Now, display some sample contents of df_no_missing using the items.head() API. This time, the categorical columns contain the numeric identifiers rather than the original values. This is an example of a benefit provided by fastai: the switch between the original categorical values and the numeric identifiers is handled elegantly. If you need to see the original values, you can use the show() API, which transforms the numeric values in categorical columns back into their original values, while the items.head() API shows the actual numeric identifiers in the categorical columns:
Figure 2.22 – The first few records of df_no_missing with numeric identifiers in categorical columns
Finally, let's confirm that the missing values were handled correctly. As you can see, the two columns that originally had missing values no longer have missing values in df_no_missing:

Figure 2.23 – Missing values in df_no_missing

By following these steps, you have seen how fastai makes it easy to prepare a dataset to train a deep learning model. It does this by replacing missing values and converting the values in the categorical columns into numeric identifiers.

How it works…

In this section, you saw several ways that fastai makes it easy to perform common data preparation steps. The TabularPandas class provides a lot of value by making it easy to execute common steps to prepare a tabular dataset (including replacing missing values and dealing with categorical columns). The cont_cat_split() function automatically identifies continuous and categorical columns in your dataset. In conclusion, fastai makes the cleanup process easy and less error prone than it would be if you had to hand code all the functions required to accomplish these dataset cleanup steps.

Deep Learning with fastai Cookbook

By : Ryan

Deep Learning with fastai Cookbook

By: Ryan

Overview of this book

Cleaning up raw datasets with fastai

Getting ready

How to do it…

How it works…

Deep Learning with fastai Cookbook

By : Ryan

Deep Learning with fastai Cookbook

By: Ryan

Overview of this book

Cleaning up raw datasets with fastai

Getting ready

How to do it…

How it works…

Create a Note

Delete Bookmark

Delete Note

Edit Note

Confirmation

Buy this book with your credits?