-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating

Deep Learning with fastai Cookbook
By :

Now that we have explored a variety of datasets that are curated by fastai, there is one more topic left to cover in this chapter: how to clean up datasets with fastai. Cleaning up datasets includes dealing with missing values and converting categorical values into numeric identifiers. We need to apply these cleanup steps to datasets because deep learning models can only be trained with numeric data. If we try to train the model with datasets that contain non-numeric data, including missing values and alphanumeric identifiers in categorical columns, the training process will fail. In this section, we are going to review the facilities provided by fastai to make it easy to clean up datasets, and thus make the datasets ready to train deep learning models.
Ensure you have followed the steps in Chapter 1, Getting Started with fastai, to get a fastai environment set up. Confirm that you can open the cleaning_up_datasets.ipynb
notebook in the ch2
directory of your repository.
In this section, you will be running through the cleaning_up_datasets.ipynb
notebook to address missing values in the ADULT_SAMPLE
dataset and replace categorical values with numeric identifiers.
Once you have the notebook open in your fastai environment, complete the following steps:
ADULT_SAMPLE
dataset had missing values, you found that some columns did indeed have missing values. We are going to identify the columns in ADULT_SAMPLE
that have missing values, and use the facilities of fastai to apply transformations to the dataset that deal with the missing values in those columns, and then replace those categorical values with numeric identifiers.ADULT_SAMPLE
curated dataset again:path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv') df.isnull().sum()
TabularPandas
class (https://docs.fast.ai/tabular.core.html#TabularPandas). To use this class, we need to prepare the following parameters: a) procs is the list of transformations that will be applied to TabularPandas
. Here, we will specify that we want missing values to be filled (FillMissing
) and that we will replace values in categorical columns with numeric identifiers (Categorify
).
b) dep_var specifies which column is the dependent variable; that is, the target that we want to ultimately predict with the model. In the case of ADULT_SAMPLE
, the dependent variable is salary
.
c) cont and cat are lists of the columns in the dataset. They are continuous and categorical, respectively. Continuous columns contain numeric values, such as integers or floating-point values. Categorical values contain category identifiers, such as names of US states, days of the week, or colors. We use the cont_cat_split()
(https://docs.fast.ai/tabular.core.html#cont_cat_split) function to automatically identify the continuous and categorical columns:
procs = [FillMissing,Categorify] dep_var = 'salary' cont,cat = cont_cat_split(df, 1, dep_var=dep_var)
TabularPandas
object called df_no_missing
using these parameters. This object will contain the dataset with missing values replaced and the values in the categorical columns replaced with numeric identifiers:df_no_missing = TabularPandas(df, procs, cat, cont, y_names = dep_var)
show
API to df_no_missing
to display samples of its contents. Note that the values in the categorical columns are maintained when the object is displayed using show()
. What about replacing the categorical values with numeric identifiers? Don't worry – we'll see that result in the next step:Figure 2.21 – The first few records of df_no_missing
df_no_missing
using the items.head()
API. This time, the categorical columns contain the numeric identifiers rather than the original values. This is an example of a benefit provided by fastai: the switch between the original categorical values and the numeric identifiers is handled elegantly. If you need to see the original values, you can use the show()
API, which transforms the numeric values in categorical columns back into their original values, while the items.head()
API shows the actual numeric identifiers in the categorical columns:Figure 2.22 – The first few records of df_no_missing with numeric identifiers in categorical columns
df_no_missing
:Figure 2.23 – Missing values in df_no_missing
By following these steps, you have seen how fastai makes it easy to prepare a dataset to train a deep learning model. It does this by replacing missing values and converting the values in the categorical columns into numeric identifiers.
In this section, you saw several ways that fastai makes it easy to perform common data preparation steps. The TabularPandas
class provides a lot of value by making it easy to execute common steps to prepare a tabular dataset (including replacing missing values and dealing with categorical columns). The cont_cat_split()
function automatically identifies continuous and categorical columns in your dataset. In conclusion, fastai makes the cleanup process easy and less error prone than it would be if you had to hand code all the functions required to accomplish these dataset cleanup steps.
Change the font size
Change margin width
Change background colour