Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Book Overview & Buying Deep Learning with fastai Cookbook
  • Table Of Contents Toc
  • Feedback & Rating feedback
Deep Learning with fastai Cookbook

Deep Learning with fastai Cookbook

By : Ryan
4.5 (15)
close
close
Deep Learning with fastai Cookbook

Deep Learning with fastai Cookbook

4.5 (15)
By: Ryan

Overview of this book

fastai is an easy-to-use deep learning framework built on top of PyTorch that lets you rapidly create complete deep learning solutions with as few as 10 lines of code. Both predominant low-level deep learning frameworks, TensorFlow and PyTorch, require a lot of code, even for straightforward applications. In contrast, fastai handles the messy details for you and lets you focus on applying deep learning to actually solve problems. The book begins by summarizing the value of fastai and showing you how to create a simple 'hello world' deep learning application with fastai. You'll then learn how to use fastai for all four application areas that the framework explicitly supports: tabular data, text data (NLP), recommender systems, and vision data. As you advance, you'll work through a series of practical examples that illustrate how to create real-world applications of each type. Next, you'll learn how to deploy fastai models, including creating a simple web application that predicts what object is depicted in an image. The book wraps up with an overview of the advanced features of fastai. By the end of this fastai book, you'll be able to create your own deep learning applications using fastai. You'll also have learned how to use fastai to prepare raw datasets, explore datasets, train deep learning models, and deploy trained models.
Table of Contents (10 chapters)
close
close

Examining text datasets with fastai

In the previous section, we looked at how a curated tabular dataset could be ingested. In this section, we are going to dig into a text dataset from the curated list.

Getting ready

Ensure you have followed the steps in Chapter 1, Getting Started with fastai, to get a fastai environment set up. Confirm that you can open the examining_text_datasets.ipynb notebook in the ch2 directory of your repository.

I am grateful for the opportunity to use the WIKITEXT_TINY dataset (https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) featured in this section.

Dataset citation

Stephen Merity, Caiming Xiong, James Bradbury, Richard Socher. (2016). Pointer Sentinel Mixture Models (https://arxiv.org/pdf/1609.07843.pdf).

How to do it…

In this section, you will be running through the examining_text_datasets.ipynb notebook to examine the WIKITEXT_TINY dataset. As its name suggests, this is a small set of text that's been gleaned from good and featured Wikipedia articles.

Once you have the notebook open in your fastai environment, complete the following steps:

  1. Run the first two cells to import the necessary libraries and set up the notebook for fastai.
  2. Run the following cell to copy the dataset into your filesystem (if it's not already there) and to define the path for the dataset:
    path = untar_data(URLs.WIKITEXT_TINY)
  3. Run the following cell to get the output of path.ls() so that you can examine the directory structure of the dataset:
    Figure 2.11 – Output of path.ls()

    Figure 2.11 – Output of path.ls()

  4. There are two CSV files that make up this dataset. Let's ingest each of them into a pandas DataFrame, starting with train.csv:
    df_train = pd.read_csv(path/'train.csv')
  5. When you use head() to check the DataFrame, you'll notice that something's wrong – the CSV file has no header with column names, but by default, read_csv assumes the first row is the header, so the first row gets misinterpreted as a header. As shown in the following screenshot, the first row of output is in bold, which indicates that the first row is being interpreted as a header, even though it contains a regular data row:
    Figure 2.12 – First record in df_train

    Figure 2.12 – First record in df_train

  6. To fix this problem, rerun the read_csv function, but this time with the header=None parameter, to specify that the CSV file doesn't have a header:
    df_train = pd.read_csv(path/'train.csv',header=None)
  7. Check head() again to confirm that the problem has been resolved:
    Figure 2.13 – Revising the first record in df_train

    Figure 2.13 – Revising the first record in df_train

  8. Ingest test.csv into a DataFrame using the header=None parameter:
    df_test = pd.read_csv(path/'test.csv',header=None)
  9. We want to tokenize the dataset and transform it into a list of words. Since we want a common set of tokens for the entire dataset, we will begin by combining the test and train DataFrames:
    df_combined = pd.concat([df_train,df_test])
  10. Confirm the shape of the train, test, and combined dataframes – the number of rows in the combined DataFrame should be the sum of the number of rows in the train and test DataFrames:
    print("df_train: ",df_train.shape)
    print("df_test: ",df_test.shape)
    print("df_combined: ",df_combined.shape)
  11. Now, we're ready to tokenize the DataFrame. The tokenize_df() function takes the list of columns containing the text we want to tokenize as a parameter. Since the columns of the DataFrame are not labeled, we need to refer to the column we want to tokenize using its position rather than its name:
    df_tok, count = tokenize_df(df_combined,[df_combined.columns[0]])
  12. Check the contents of the first few records of df_tok, which is the new DataFrame containing the tokenized contents of the combined DataFrame:
    Figure 2.14 – The first few records of df_tok

    Figure 2.14 – The first few records of df_tok

  13. Check the count for a few sample words to ensure they are roughly what you expected. Pick a very common word, a moderately common word, and a rare word:
    print("very common word (count['the']):", count['the'])
    print("moderately common word (count['prepared']):", count['prepared'])
    print("rare word (count['gaga']):", count['gaga'])

Congratulations! You have successfully ingested, explored, and tokenized a curated text dataset.

How it works…

The dataset that you explored in this section, WIKITEXT_TINY, is one of the datasets you would have seen in the source for URLs in the Getting the complete set of oven-ready fastai datasets section. Here, you can see that WIKITEXT_TINY is in the NLP datasets section of the source for URLs:

Figure 2.15 – WIKITEXT_TINY in the NLP datasets list in the source for URLs

Figure 2.15 – WIKITEXT_TINY in the NLP datasets list in the source for URLs

Create a Note

Modal Close icon
You need to login to use this feature.
notes
bookmark search playlist download font-size

Change the font size

margin-width

Change margin width

day-mode

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Delete Bookmark

Modal Close icon
Are you sure you want to delete it?
Cancel
Yes, Delete

Delete Note

Modal Close icon
Are you sure you want to delete it?
Cancel
Yes, Delete

Edit Note

Modal Close icon
Write a note (max 255 characters)
Cancel
Update Note

Confirmation

Modal Close icon
claim successful

Buy this book with your credits?

Modal Close icon
Are you sure you want to buy this book with one of your credits?
Close
YES, BUY