Getting the complete set of oven-ready fastai datasets

In Chapter 1, Getting Started with fastai, you encountered the MNIST dataset and saw how easy it was to make this dataset available to train a fastai deep learning model. You were able to train the model without needing to worry about the location of the dataset or its structure (apart from the names of the folders containing the training and validation datasets). You were able to examine elements of the dataset conveniently.

In this section, we'll take a closer look at the complete set of datasets that fastai curates and explain how you can get additional information about these datasets.

Getting ready

Ensure you have followed the steps in Chapter 1, Getting Started with fastai, so that you have a fastai environment set up. Confirm that you can open the fastai_dataset_walkthrough.ipynb notebook in the ch2 directory of your cloned repository.

How to do it…

In this section, you will be running through the fastai_dataset_walkthrough.ipynb notebook, as well as the fastai dataset documentation, so that you understand the datasets that fastai curates. Once you have the notebook open in your fastai environment, complete the following steps:

Run the first three cells of the notebook to load the required libraries, set up the notebook for fastai, and define the MNIST dataset:
Figure 2.1 – Cells to load the libraries, set up the notebook, and define the MNIST dataset
Consider the argument to untar_data: URLs.MINST. What is this? Let's try the ?? shortcut to examine the source code for a URLs object:
Figure 2.2 – Source for URLs
By looking at the image classification datasets section of the source code for URLs, we can find the definition of URLs.MNIST:
```
MNIST           = f'{S3_IMAGE}mnist_png.tgz'
```
Working backward through the source code for the URLs class, we can get the whole URL for MNIST:
```
S3_IMAGE     = f'{S3}imageclas/'
S3  = 'https://s3.amazonaws.com/fast-ai-'
```

Putting it all together, we get the URL for URLs.MNIST:

https://s3.amazonaws.com/fast-ai-imageclas/mnist_png.tgz

You can download this file for yourself and untar it. You will see that the directory structure of the untarred package looks like this:

mnist_png
├── testing
│   ├── 0
│   ├── 1
│   ├── 2
│   ├── 3
│   ├── 4
│   ├── 5
│   ├── 6
│   ├── 7
│   ├── 8
│   └── 9
└── training
     ├── 0
     ├── 1
     ├── 2
     ├── 3
     ├── 4
     ├── 5
     ├── 6
     ├── 7
     ├── 8
     └── 9

In the untarred directory structure, each of the testing and training directories contain subdirectories for each digit. These digit directories contain image files for that digit. This means that the label of the dataset – the value that we want the model to predict – is encoded in the directory that the image file resides in.
Is there a way to get the directory structure of one of the curated datasets without having to determine its URL from the definition of URLs, download the dataset, and unpack it? There is – using path.ls():
Figure 2.3 – Using path.ls() to get the dataset's directory structure
This tells us that there are two subdirectories in the dataset: training and testing. You can call ls() to get the structure of the training subdirectory:
Figure 2.4 – The structure of the training subdirectory
Now that we have learned how to get the directory structure of the MNIST dataset using the ls() function, what else can we learn from the output of ??URLs?

First, let's look at the other datasets listed in the output of ??URLs by group. First, let's look at the datasets listed under main datasets. This list includes tabular datasets (ADULT_SAMPLE), text datasets (IMDB_SAMPLE), recommender system datasets (ML_SAMPLE), and a variety of image datasets (CIFAR, IMAGENETTE, COCO_SAMPLE):

     ADULT_SAMPLE           = f'{URL}adult_sample.tgz'
     BIWI_SAMPLE            = f'{URL}biwi_sample.tgz'
     CIFAR                     = f'{URL}cifar10.tgz'
     COCO_SAMPLE            = f'{S3_COCO}coco_sample.tgz'
     COCO_TINY               = f'{S3_COCO}coco_tiny.tgz'
     HUMAN_NUMBERS         = f'{URL}human_numbers.tgz'
     IMDB                       = f'{S3_NLP}imdb.tgz'
     IMDB_SAMPLE            = f'{URL}imdb_sample.tgz'
     ML_SAMPLE               = f'{URL}movie_lens_sample.tgz'
     ML_100k                  = 'http://files.grouplens.org/datasets/movielens/ml-100k.zip'
     MNIST_SAMPLE           = f'{URL}mnist_sample.tgz'
     MNIST_TINY              = f'{URL}mnist_tiny.tgz'
     MNIST_VAR_SIZE_TINY = f'{S3_IMAGE}mnist_var_size_tiny.tgz'
     PLANET_SAMPLE         = f'{URL}planet_sample.tgz'
     PLANET_TINY            = f'{URL}planet_tiny.tgz'
     IMAGENETTE              = f'{S3_IMAGE}imagenette2.tgz'
     IMAGENETTE_160        = f'{S3_IMAGE}imagenette2-160.tgz'
     IMAGENETTE_320        = f'{S3_IMAGE}imagenette2-320.tgz'
     IMAGEWOOF               = f'{S3_IMAGE}imagewoof2.tgz'
     IMAGEWOOF_160         = f'{S3_IMAGE}imagewoof2-160.tgz'
     IMAGEWOOF_320         = f'{S3_IMAGE}imagewoof2-320.tgz'
     IMAGEWANG               = f'{S3_IMAGE}imagewang.tgz'
     IMAGEWANG_160         = f'{S3_IMAGE}imagewang-160.tgz'
     IMAGEWANG_320         = f'{S3_IMAGE}imagewang-320.tgz'

Next, let's look at the datasets in the other categories: image classification datasets, NLP datasets, image localization datasets, audio classification datasets, and medical image classification datasets. Note that the list of curated datasets includes datasets that aren't directly associated with any of the four main application areas supported by fastai. The audio datasets, for example, apply to a use case outside the four main application areas:

     # image classification datasets
     CALTECH_101  = f'{S3_IMAGE}caltech_101.tgz'
     CARS            = f'{S3_IMAGE}stanford-cars.tgz'
     CIFAR_100     = f'{S3_IMAGE}cifar100.tgz'
     CUB_200_2011 = f'{S3_IMAGE}CUB_200_2011.tgz'
     FLOWERS        = f'{S3_IMAGE}oxford-102-flowers.tgz'
     FOOD            = f'{S3_IMAGE}food-101.tgz'
     MNIST           = f'{S3_IMAGE}mnist_png.tgz'
     PETS            = f'{S3_IMAGE}oxford-iiit-pet.tgz'
     # NLP datasets
     AG_NEWS                        = f'{S3_NLP}ag_news_csv.tgz'
     AMAZON_REVIEWS              = f'{S3_NLP}amazon_review_full_csv.tgz'
     AMAZON_REVIEWS_POLARITY = f'{S3_NLP}amazon_review_polarity_csv.tgz'
     DBPEDIA                        = f'{S3_NLP}dbpedia_csv.tgz'
     MT_ENG_FRA                    = f'{S3_NLP}giga-fren.tgz'
     SOGOU_NEWS                    = f'{S3_NLP}sogou_news_csv.tgz'
     WIKITEXT                       = f'{S3_NLP}wikitext-103.tgz'
     WIKITEXT_TINY               = f'{S3_NLP}wikitext-2.tgz'
     YAHOO_ANSWERS               = f'{S3_NLP}yahoo_answers_csv.tgz'
     YELP_REVIEWS                 = f'{S3_NLP}yelp_review_full_csv.tgz'
     YELP_REVIEWS_POLARITY   = f'{S3_NLP}yelp_review_polarity_csv.tgz'
     # Image localization datasets
     BIWI_HEAD_POSE      = f"{S3_IMAGELOC}biwi_head_pose.tgz"
     CAMVID                  = f'{S3_IMAGELOC}camvid.tgz'
     CAMVID_TINY           = f'{URL}camvid_tiny.tgz'
     LSUN_BEDROOMS        = f'{S3_IMAGE}bedroom.tgz'
     PASCAL_2007           = f'{S3_IMAGELOC}pascal_2007.tgz'
     PASCAL_2012           = f'{S3_IMAGELOC}pascal_2012.tgz'
     # Audio classification datasets
     MACAQUES               = 'https://storage.googleapis.com/ml-animal-sounds-datasets/macaques.zip'
     ZEBRA_FINCH           = 'https://storage.googleapis.com/ml-animal-sounds-datasets/zebra_finch.zip'
     # Medical Imaging datasets
     SIIM_SMALL            = f'{S3_IMAGELOC}siim_small.tgz'

Now that we have listed all the datasets defined in URLs, how can we find out more information about them?
a) The fastai documentation (https://course.fast.ai/datasets) documents some of the datasets listed in URLs. Note that this documentation is not consistent with what's listed in the source of URLs. For example, the naming of the datasets is not consistent and the documentation page does not cover all the datasets. When in doubt, treat the source of URLs as your single source of truth about fastai curated datasets.
b) Use the path.ls() function to examine the directory structure, as shown in the following example, which lists the directories under the training subdirectory of the MNIST dataset:
Figure 2.5 – Structure of the training subdirectory
c) Check out the file structure that gets installed when you run untar_data. For example, in Gradient, the datasets get installed in storage/data, so you can go into that directory in Gradient to inspect the directories for the curated dataset you're interested in.
d) For example, let's say untar_data is run with URLs.PETS as the argument:
```
path = untar_data(URLs.PETS)
```
e) Here, you can find the dataset in storage/data/oxford-iiit-pet, and you can see the directory's structure:
```
oxford-iiit-pet
├── annotations
│   ├── trimaps
│   └── xmls
└── images
```
If you want to see the definition of a function in a notebook, you can run a cell with ??, followed by the name of the function. For example, to see the definition of the ls() function, you can use ??Path.ls:
Figure 2.6 – Source for Path.ls()
To see the documentation for any function, you can use the doc() function. For example, the output of doc(Path.ls) shows the signature of the function, along with links to the source code (https://github.com/fastai/fastcore/blob/master/fastcore/xtras.py#L111) and the documentation (https://fastcore.fast.ai/xtras#Path.ls) for this function:

Figure 2.7 – Output of doc(Path.ls)

You have now explored the list of oven-ready datasets curated by fastai. You have also learned how to get the directory structure of these datasets, as well as how to examine the source and documentation of a function from within a notebook.

How it works…

As you saw in this section, fastai defines URLs for each of the curated datasets in the URLs class. When you call untar_data with one of the curated datasets as the argument, if the files for the dataset have not already been copied, these files get downloaded to your filesystem (storage/data in a Gradient instance). The object you get back from untar_data allows you to examine the directory structure of the dataset, and then pass it along to the next stage in the process of creating a fastai deep learning model. By wrapping a large sampling of interesting datasets in such a convenient way, fastai makes it easy for you to create deep learning models with these datasets, and also lets you focus your efforts on creating and improving the deep learning model rather than fiddling with the details of ingesting the datasets.

There's more…

You might be asking yourself why we went to the trouble of examining the source code for the URLs class to get details about the curated datasets. After all, these datasets are documented in https://course.fast.ai/datasets. The problem is that this documentation page doesn't give a complete list of all the curated datasets, and it doesn't clearly explain what you need to know to make the correct untar_data calls for a particular curated dataset. The incomplete documentation for the curated datasets demonstrates one of the weaknesses of fastai – inconsistent documentation. Sometimes, the documentation is complete, but sometimes, it is lacking details, so you will need to look at the source code directly to figure out what's going on, like we had to do in this section for the curated datasets. This problem is compounded by Google search returning hits for documentation for earlier versions of fastai. If you are searching for some details about fastai, avoid hits for fastai version 1 (https://fastai1.fast.ai/) and keep to the documentation for the current version of fastai: https://docs.fast.ai/.

Deep Learning with fastai Cookbook

By : Ryan

Deep Learning with fastai Cookbook

By: Ryan

Overview of this book

Getting the complete set of oven-ready fastai datasets

Getting ready

How to do it…

How it works…

There's more…

Deep Learning with fastai Cookbook

By : Ryan

Deep Learning with fastai Cookbook

By: Ryan

Overview of this book

Getting the complete set of oven-ready fastai datasets

Getting ready

How to do it…

How it works…

There's more…

Create a Note

Delete Bookmark

Delete Note

Edit Note

Confirmation

Buy this book with your credits?