-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating

Deep Learning with fastai Cookbook
By :

In the previous section, we looked at the whole set of datasets curated by fastai. In this section, we are going to dig into a tabular dataset from the curated list. We will ingest the dataset, look at some example records, and then explore characteristics of the dataset, including the number of records and the number of unique values in each column.
Ensure you have followed the steps in Chapter 1, Getting Started with fastai, to get a fastai environment set up. Confirm that you can open the examining_tabular_datasets.ipynb
notebook in the ch2
directory of your repository.
I am grateful for the opportunity to include the ADULT_SAMPLE dataset featured in this section.
Dataset citation
Ron Kohavi. (1996) Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid (http://robotics.stanford.edu/~ronnyk/nbtree.pdf).
In this section, you will be running through the examining_tabular_datasets.ipynb
notebook to examine the ADULT_SAMPLE
dataset.
Once you have the notebook open in your fastai environment, complete the following steps:
path = untar_data(URLs.ADULT_SAMPLE)
path.ls()
so that you can examine the directory structure of the dataset:Figure 2.8 – Output of path.ls()
adult.csv
file. Run the following cell to ingest this CSV file into a pandas DataFrame:df = pd.read_csv(path/'adult.csv')
head()
command to get a sample of records from the beginning of the dataset:Figure 2.9 – Sample of records from the beginning of the dataset
df.shape
df.nunique()
df.isnull().sum()
df_young = df[df.age <= 40] df_young.head()
Congratulations! You have ingested a tabular dataset curated by fastai and done a basic examination of the dataset.
The dataset that you explored in this section, ADULT_SAMPLE
, is one of the datasets you would have seen in the source for URLs
in the previous section. Note that while the source for URLs
identifies which datasets are related to image or NLP (text) applications, it does not explicitly identify the tabular or recommender system datasets. ADULT_SAMPLE
is one of the datasets listed under main datasets
:
Figure 2.10 – Main datasets from the source for URLs
How did I determine that ADULT_SAMPLE
was a tabular dataset? First, the paper by Howard and Gugger (https://arxiv.org/pdf/2002.04688.pdf) identifies ADULT_SAMPLE
as a tabular dataset. Second, I just had to ingest it and try it out to confirm it could be ingested into a pandas DataFrame.
What about the other curated datasets that aren't explicitly categorized in the source for URLs
? Here's a summary of the datasets listed in the source for URLs
under main datasets
:
a) ADULT_SAMPLE
a) HUMAN_NUMBERS
b) IMDB
c) IMDB_SAMPLE
a) ML_SAMPLE
b) ML_100k
a) All of the other datasets listed in URLs
under main datasets
.