-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating

Deep Learning with fastai Cookbook
By :

In Chapter 1, Getting Started with fastai, you encountered the MNIST dataset and saw how easy it was to make this dataset available to train a fastai deep learning model. You were able to train the model without needing to worry about the location of the dataset or its structure (apart from the names of the folders containing the training and validation datasets). You were able to examine elements of the dataset conveniently.
In this section, we'll take a closer look at the complete set of datasets that fastai curates and explain how you can get additional information about these datasets.
Ensure you have followed the steps in Chapter 1, Getting Started with fastai, so that you have a fastai environment set up. Confirm that you can open the fastai_dataset_walkthrough.ipynb
notebook in the ch2
directory of your cloned repository.
In this section, you will be running through the fastai_dataset_walkthrough.ipynb
notebook, as well as the fastai dataset documentation, so that you understand the datasets that fastai curates. Once you have the notebook open in your fastai environment, complete the following steps:
Figure 2.1 – Cells to load the libraries, set up the notebook, and define the MNIST dataset
untar_data
: URLs.MINST
. What is this? Let's try the ??
shortcut to examine the source code for a URLs
object:Figure 2.2 – Source for URLs
image classification datasets
section of the source code for URLs
, we can find the definition of URLs.MNIST
:MNIST = f'{S3_IMAGE}mnist_png.tgz'
URLs
class, we can get the whole URL for MNIST:S3_IMAGE = f'{S3}imageclas/' S3 = 'https://s3.amazonaws.com/fast-ai-'
URLs.MNIST
: https://s3.amazonaws.com/fast-ai-imageclas/mnist_png.tgz
mnist_png ├── testing │ ├── 0 │ ├── 1 │ ├── 2 │ ├── 3 │ ├── 4 │ ├── 5 │ ├── 6 │ ├── 7 │ ├── 8 │ └── 9 └── training ├── 0 ├── 1 ├── 2 ├── 3 ├── 4 ├── 5 ├── 6 ├── 7 ├── 8 └── 9
URLs
, download the dataset, and unpack it? There is – using path.ls()
:Figure 2.3 – Using path.ls() to get the dataset's directory structure
training
and testing
. You can call ls()
to get the structure of the training
subdirectory:Figure 2.4 – The structure of the training subdirectory
ls()
function, what else can we learn from the output of ??URLs
???URLs
by group. First, let's look at the datasets listed under main datasets
. This list includes tabular datasets (ADULT_SAMPLE
), text datasets (IMDB_SAMPLE
), recommender system datasets (ML_SAMPLE
), and a variety of image datasets (CIFAR, IMAGENETTE, COCO_SAMPLE
):ADULT_SAMPLE = f'{URL}adult_sample.tgz' BIWI_SAMPLE = f'{URL}biwi_sample.tgz' CIFAR = f'{URL}cifar10.tgz' COCO_SAMPLE = f'{S3_COCO}coco_sample.tgz' COCO_TINY = f'{S3_COCO}coco_tiny.tgz' HUMAN_NUMBERS = f'{URL}human_numbers.tgz' IMDB = f'{S3_NLP}imdb.tgz' IMDB_SAMPLE = f'{URL}imdb_sample.tgz' ML_SAMPLE = f'{URL}movie_lens_sample.tgz' ML_100k = 'http://files.grouplens.org/datasets/movielens/ml-100k.zip' MNIST_SAMPLE = f'{URL}mnist_sample.tgz' MNIST_TINY = f'{URL}mnist_tiny.tgz' MNIST_VAR_SIZE_TINY = f'{S3_IMAGE}mnist_var_size_tiny.tgz' PLANET_SAMPLE = f'{URL}planet_sample.tgz' PLANET_TINY = f'{URL}planet_tiny.tgz' IMAGENETTE = f'{S3_IMAGE}imagenette2.tgz' IMAGENETTE_160 = f'{S3_IMAGE}imagenette2-160.tgz' IMAGENETTE_320 = f'{S3_IMAGE}imagenette2-320.tgz' IMAGEWOOF = f'{S3_IMAGE}imagewoof2.tgz' IMAGEWOOF_160 = f'{S3_IMAGE}imagewoof2-160.tgz' IMAGEWOOF_320 = f'{S3_IMAGE}imagewoof2-320.tgz' IMAGEWANG = f'{S3_IMAGE}imagewang.tgz' IMAGEWANG_160 = f'{S3_IMAGE}imagewang-160.tgz' IMAGEWANG_320 = f'{S3_IMAGE}imagewang-320.tgz'
# image classification datasets CALTECH_101 = f'{S3_IMAGE}caltech_101.tgz' CARS = f'{S3_IMAGE}stanford-cars.tgz' CIFAR_100 = f'{S3_IMAGE}cifar100.tgz' CUB_200_2011 = f'{S3_IMAGE}CUB_200_2011.tgz' FLOWERS = f'{S3_IMAGE}oxford-102-flowers.tgz' FOOD = f'{S3_IMAGE}food-101.tgz' MNIST = f'{S3_IMAGE}mnist_png.tgz' PETS = f'{S3_IMAGE}oxford-iiit-pet.tgz' # NLP datasets AG_NEWS = f'{S3_NLP}ag_news_csv.tgz' AMAZON_REVIEWS = f'{S3_NLP}amazon_review_full_csv.tgz' AMAZON_REVIEWS_POLARITY = f'{S3_NLP}amazon_review_polarity_csv.tgz' DBPEDIA = f'{S3_NLP}dbpedia_csv.tgz' MT_ENG_FRA = f'{S3_NLP}giga-fren.tgz' SOGOU_NEWS = f'{S3_NLP}sogou_news_csv.tgz' WIKITEXT = f'{S3_NLP}wikitext-103.tgz' WIKITEXT_TINY = f'{S3_NLP}wikitext-2.tgz' YAHOO_ANSWERS = f'{S3_NLP}yahoo_answers_csv.tgz' YELP_REVIEWS = f'{S3_NLP}yelp_review_full_csv.tgz' YELP_REVIEWS_POLARITY = f'{S3_NLP}yelp_review_polarity_csv.tgz' # Image localization datasets BIWI_HEAD_POSE = f"{S3_IMAGELOC}biwi_head_pose.tgz" CAMVID = f'{S3_IMAGELOC}camvid.tgz' CAMVID_TINY = f'{URL}camvid_tiny.tgz' LSUN_BEDROOMS = f'{S3_IMAGE}bedroom.tgz' PASCAL_2007 = f'{S3_IMAGELOC}pascal_2007.tgz' PASCAL_2012 = f'{S3_IMAGELOC}pascal_2012.tgz' # Audio classification datasets MACAQUES = 'https://storage.googleapis.com/ml-animal-sounds-datasets/macaques.zip' ZEBRA_FINCH = 'https://storage.googleapis.com/ml-animal-sounds-datasets/zebra_finch.zip' # Medical Imaging datasets SIIM_SMALL = f'{S3_IMAGELOC}siim_small.tgz'
URLs
, how can we find out more information about them?a) The fastai documentation (https://course.fast.ai/datasets) documents some of the datasets listed in URLs
. Note that this documentation is not consistent with what's listed in the source of URLs
. For example, the naming of the datasets is not consistent and the documentation page does not cover all the datasets. When in doubt, treat the source of URLs
as your single source of truth about fastai curated datasets.
b) Use the path.ls()
function to examine the directory structure, as shown in the following example, which lists the directories under the training
subdirectory of the MNIST dataset:
Figure 2.5 – Structure of the training subdirectory
c) Check out the file structure that gets installed when you run untar_data
. For example, in Gradient, the datasets get installed in storage/data
, so you can go into that directory in Gradient to inspect the directories for the curated dataset you're interested in.
d) For example, let's say untar_data
is run with URLs.PETS
as the argument:
path = untar_data(URLs.PETS)
e) Here, you can find the dataset in storage/data/oxford-iiit-pet
, and you can see the directory's structure:
oxford-iiit-pet ├── annotations │ ├── trimaps │ └── xmls └── images
??
, followed by the name of the function. For example, to see the definition of the ls()
function, you can use ??Path.ls
:Figure 2.6 – Source for Path.ls()
doc()
function. For example, the output of doc(Path.ls)
shows the signature of the function, along with links to the source code (https://github.com/fastai/fastcore/blob/master/fastcore/xtras.py#L111) and the documentation (https://fastcore.fast.ai/xtras#Path.ls) for this function: Figure 2.7 – Output of doc(Path.ls)
You have now explored the list of oven-ready datasets curated by fastai. You have also learned how to get the directory structure of these datasets, as well as how to examine the source and documentation of a function from within a notebook.
As you saw in this section, fastai defines URLs for each of the curated datasets in the URLs
class. When you call untar_data
with one of the curated datasets as the argument, if the files for the dataset have not already been copied, these files get downloaded to your filesystem (storage/data
in a Gradient instance). The object you get back from untar_data
allows you to examine the directory structure of the dataset, and then pass it along to the next stage in the process of creating a fastai deep learning model. By wrapping a large sampling of interesting datasets in such a convenient way, fastai makes it easy for you to create deep learning models with these datasets, and also lets you focus your efforts on creating and improving the deep learning model rather than fiddling with the details of ingesting the datasets.
You might be asking yourself why we went to the trouble of examining the source code for the URLs
class to get details about the curated datasets. After all, these datasets are documented in https://course.fast.ai/datasets. The problem is that this documentation page doesn't give a complete list of all the curated datasets, and it doesn't clearly explain what you need to know to make the correct untar_data
calls for a particular curated dataset. The incomplete documentation for the curated datasets demonstrates one of the weaknesses of fastai – inconsistent documentation. Sometimes, the documentation is complete, but sometimes, it is lacking details, so you will need to look at the source code directly to figure out what's going on, like we had to do in this section for the curated datasets. This problem is compounded by Google search returning hits for documentation for earlier versions of fastai. If you are searching for some details about fastai, avoid hits for fastai version 1 (https://fastai1.fast.ai/) and keep to the documentation for the current version of fastai: https://docs.fast.ai/.
Change the font size
Change margin width
Change background colour