-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating

Deep Learning with fastai Cookbook
By :

In the previous section, we looked at how a curated tabular dataset could be ingested. In this section, we are going to dig into a text dataset from the curated list.
Ensure you have followed the steps in Chapter 1, Getting Started with fastai, to get a fastai environment set up. Confirm that you can open the examining_text_datasets.ipynb
notebook in the ch2
directory of your repository.
I am grateful for the opportunity to use the WIKITEXT_TINY dataset (https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) featured in this section.
Dataset citation
Stephen Merity, Caiming Xiong, James Bradbury, Richard Socher. (2016). Pointer Sentinel Mixture Models (https://arxiv.org/pdf/1609.07843.pdf).
In this section, you will be running through the examining_text_datasets.ipynb
notebook to examine the WIKITEXT_TINY
dataset. As its name suggests, this is a small set of text that's been gleaned from good and featured Wikipedia articles.
Once you have the notebook open in your fastai environment, complete the following steps:
path = untar_data(URLs.WIKITEXT_TINY)
path.ls()
so that you can examine the directory structure of the dataset:Figure 2.11 – Output of path.ls()
train.csv
:df_train = pd.read_csv(path/'train.csv')
head()
to check the DataFrame, you'll notice that something's wrong – the CSV file has no header with column names, but by default, read_csv
assumes the first row is the header, so the first row gets misinterpreted as a header. As shown in the following screenshot, the first row of output is in bold, which indicates that the first row is being interpreted as a header, even though it contains a regular data row:Figure 2.12 – First record in df_train
read_csv
function, but this time with the header=None
parameter, to specify that the CSV file doesn't have a header:df_train = pd.read_csv(path/'train.csv',header=None)
head()
again to confirm that the problem has been resolved:Figure 2.13 – Revising the first record in df_train
test.csv
into a DataFrame using the header=None
parameter:df_test = pd.read_csv(path/'test.csv',header=None)
df_combined = pd.concat([df_train,df_test])
print("df_train: ",df_train.shape) print("df_test: ",df_test.shape) print("df_combined: ",df_combined.shape)
tokenize_df()
function takes the list of columns containing the text we want to tokenize as a parameter. Since the columns of the DataFrame are not labeled, we need to refer to the column we want to tokenize using its position rather than its name:df_tok, count = tokenize_df(df_combined,[df_combined.columns[0]])
df_tok
, which is the new DataFrame containing the tokenized contents of the combined DataFrame:Figure 2.14 – The first few records of df_tok
print("very common word (count['the']):", count['the']) print("moderately common word (count['prepared']):", count['prepared']) print("rare word (count['gaga']):", count['gaga'])
Congratulations! You have successfully ingested, explored, and tokenized a curated text dataset.
The dataset that you explored in this section, WIKITEXT_TINY
, is one of the datasets you would have seen in the source for URLs
in the Getting the complete set of oven-ready fastai datasets section. Here, you can see that WIKITEXT_TINY
is in the NLP datasets section of the source for URLs
:
Figure 2.15 – WIKITEXT_TINY in the NLP datasets list in the source for URLs