Processing Tabular Data

In this section, you will learn how to load tabular data into a Python development environment so that it can be used for TensorFlow modeling. You will use pandas and scikit-learn to utilize the classes and functions that are useful for processing data. You will also explore methods that can be used to preprocess this data.

Tabular data can be loaded into memory by using the pandas read_csv function and passing the path into the dataset. The function is well suited and easy to use for loading in tabular data and can be used as follows:

df = pd.read_csv('path/to/dataset')

In order to normalize the data, you can use a scaler that is available in scikit-learn. There are multiple scalers that can be applied; StandardScaler will normalize the data so that the fields of the dataset have a mean of 0 and a standard deviation of 1. Another common scaler that is used is MinMaxScaler, which will rescale the dataset so that the fields have a minimum value of 0 and a maximum value of 1.

To use a scaler, it must be initialized and fit to the dataset. By doing this, the dataset can be transformed by the scaler. In fact, the fitting and transformation processes can be performed in one step by using the fit_transform method, as follows:

scaler = StandardScaler()
transformed_df = scaler.fit_transform(df)

In the first exercise, you will learn how to use pandas and scikit-learn to load a dataset and preprocess it so that it is suitable for modeling.

Exercise 2.01: Loading Tabular Data and Rescaling Numerical Fields

The dataset, Bias_correction_ucl.csv, contains information for bias correction of the next-day maximum and minimum air temperature forecast for Seoul, South Korea. The fields represent temperature measurements of the given date, the weather station at which the metrics were measured, model forecasts of weather-related metrics such as humidity, and projections for the temperature of the following day. You are required to preprocess the data to make all the columns normally distributed with a mean of 0 and a standard deviation of 1. You will demonstrate the effects with the Present_Tmax column, which represents the maximum temperature on the given date at a given weather station.

Note

The dataset can be found here: https://packt.link/l83pR.

Perform the following steps to complete this exercise:

Open a new Jupyter notebook to implement this exercise. Save the file as Exercise2-01.ipnyb.
In a new Jupyter Notebook cell, import the pandas library, as follows:
```
import pandas as pd
```
Note
You can find the documentation for pandas at the following link: https://pandas.pydata.org/docs/.
Create a new pandas DataFrame named df and read the Bias_correction_ucl.csv file into it. Examine whether your data is properly loaded by printing the resultant DataFrame:
```
df = pd.read_csv('Bias_correction_ucl.csv')
df
```
Note
Make sure you change the path (highlighted) to the CSV file based on its location on your system. If you're running the Jupyter notebook from the same directory where the CSV file is stored, you can run the preceding code without any modification.
The output will be as follows:
Figure 2.5: The output from printing the DataFrame
Drop the date column using the drop method of the DataFrame and pass in the name of the column. The date column will be dropped as it is a non-numerical field and rescaling will not be possible when non-numerical fields exist. Since you are dropping a column, both the axis=1 argument and the inplace=True argument should be passed:
```
df.drop('Date', inplace=True, axis=1)
```
Plot a histogram of the Present_Tmax column that represents the maximum temperature across dates and weather stations within the dataset:
```
ax = df['Present_Tmax'].hist(color='gray')
ax.set_xlabel("Temperature")
ax.set_ylabel("Frequency")
```
The output will be as follows:
Figure 2.6: A Temperature versus Frequency histogram of the Present_Tmax column
The resultant histogram shows the distribution of values for the Present_Tmax column. You can see that the temperature values vary from 20 to 38 degrees Celsius. Plotting a histogram of the feature values is a good way to view the distribution of values to understand whether scaling is required as a preprocessing step.
Import the StandardScaler class from scikit-learn's preprocessing package. Initialize the scaler, fit the scaler, and transform the DataFrame using the scaler's fit_transform method. Create a new DataFrame, df2, using the transformed DataFrame since the result of the fit_transform method is a NumPy array. The standard scaler will transform the numerical fields so that the mean of the field is 0 and the standard deviation is 1:
```
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df2 = scaler.fit_transform(df)
df2 = pd.DataFrame(df2, columns=df.columns)
```
Note
The values for the mean and standard deviation of the resulting transformed data can be input into the scaler.

Plot a histogram of the transformed Present_Tmax column:

ax = df2['Present_Tmax'].hist(color='gray')
ax.set_xlabel("Normalized Temperature")
ax.set_ylabel("Frequency")

The output will be as follows:

Figure 2.7: A histogram of the rescaled Present_Tmax column

The resulting histogram shows that the temperature values range from around -3 to 3 degrees Celsius, as evidenced by the range on the x axis of the histogram. By using the standard scaler, the values will always have a mean of 0 and a standard deviation of 1. Having the features normalized can speed up the model training process.

In this exercise, you successfully imported tabular data using the pandas library and performed some preprocessing using the scikit-learn library. The preprocessing of data included dropping the date column and scaling the numerical fields so that they have a mean value of 0 and a standard deviation of 1.

In the following activity, you will load in tabular data using the pandas library and scale that data using the MinMax scaler present in scikit-learn. You will do so on the same dataset that you used in the prior exercise, which describes the bias correction of air temperature forecasts for Seoul, South Korea.

Activity 2.01: Loading Tabular Data and Rescaling Numerical Fields with a MinMax Scaler

In this activity, you are required to load tabular data and rescale the data using a MinMax scaler. The dataset, Bias_correction_ucl.csv, contains information for bias correction of the next-day maximum and minimum air temperature forecast for Seoul, South Korea. The fields represent temperature measurements of the given date, the weather station at which the metrics were measured, model forecasts of weather-related metrics such as humidity, and projections for the temperature the following day. You are required to scale the columns so that the minimum value of each column is 0 and the maximum value is 1.

Perform the following steps to complete this activity:

Open a new Jupyter notebook to implement this activity.
Import pandas and the Bias_correction_ucl.csv dataset.
Read the dataset using the pandas read_csv function.
Drop the date column of the DataFrame.
Plot a histogram of the Present_Tmax column.
Import MinMaxScaler and fit it to and transform the feature DataFrame.
Plot a histogram of the transformed Present_Tmax column.
You should get an output similar to the following:

Figure 2.8: Expected output of Activity 2.01

Note

The solution to this activity can be found via this link.

One method of converting non-numerical fields such as categorical or date fields is to one-hot encode them. The one-hot encoding process creates a new column for each unique value in the provided column, while each row has a value of 0 except for the one that corresponds to the correct column. The column headers of the newly created dummy columns correspond to the unique values. One-hot encoding can be achieved by using the get_dummies function of the pandas library and passing in the column to be encoded. An optional argument is to provide a prefix feature that adds a prefix to the column headers. This can be useful for referencing the columns:

dummies = pd.get_dummies(df['feature1'], prefix='feature1')

Note

When using the get_dummies function, NaN values are converted into all zeros.

In the following exercise, you'll learn how to preprocess non-numerical fields. You will utilize the same dataset that you used in the previous exercise and activity, which describes the bias correction of air temperature forecasts for Seoul, South Korea.

Exercise 2.02: Preprocessing Non-Numerical Data

In this exercise, you will preprocess the date column by one-hot encoding the year and the month from the date column using the get_dummies function. You will join the one-hot-encoded columns with the original DataFrame and ensure that all the fields in the resultant DataFrame are numerical.

Perform the following steps to complete this exercise:

Open a new Jupyter notebook to implement this exercise. Save the file as Exercise2-02.ipnyb.
In a new Jupyter Notebook cell, import the pandas library, as follows:
```
import pandas as pd
```
Create a new pandas DataFrame named df and read the Bias_correction_ucl.csv file into it. Examine whether your data is properly loaded by printing the resultant DataFrame:
```
df = pd.read_csv('Bias_correction_ucl.csv')
```
Note
Make sure you change the path (highlighted) to the CSV file based on its location on your system. If you're running the Jupyter notebook from the same directory where the CSV file is stored, you can run the preceding code without any modification.
Change the data type of the date column to Date using the pandas to_datetime function:
```
df['Date'] = pd.to_datetime(df['Date'])
```
Create dummy columns for year using the pandas get_dummies function. Pass in the year of the date column as the first argument and add a prefix to the columns of the resultant DataFrame. Print out the resultant DataFrame:
```
year_dummies = pd.get_dummies(df['Date'].dt.year, \
                              prefix='year')
year_dummies
```
The output will be as follows:
Figure 2.9: Output of the get_dummies function applied to the year of the date column
The resultant DataFrame contains only 0s and 1s. 1 corresponds to the value present in the original date column. Null values will have 0s for all columns in the newly created DataFrame.
Repeat this for the month by creating dummy columns from the month of the date column. Print out the resulting DataFrame:
```
month_dummies = pd.get_dummies(df['Date'].dt.month, \
                               prefix='month')
month_dummies
```
The output will be as follows:
Figure 2.10: The output of the get_dummies function applied to the month of the date column
The resultant DataFrame now contains only 0s and 1s for the month in the date column.

Concatenate the original DataFrame and the dummy DataFrames you created in Steps 5 and 6:

df = pd.concat([df, month_dummies, year_dummies], \
               axis=1)

Drop the original date column since it is now redundant:
```
df.drop('Date', axis=1, inplace=True)
```
Verify that all the columns are now of the numerical data type:
```
df.dtypes
```
The output will be as follows:

Figure 2.11: Output of the dtypes attribute of the resultant DataFrame

Here, you can see that all the data types of the resultant DataFrame are numerical. This means they can now be passed into an ANN for modeling.

In this exercise, you successfully imported tabular data and preprocessed the date column using the pandas and scikit-learn libraries. You utilized the get_dummies function to convert categorical data into numerical data types.

Note

Another method to attain a numerical data type from date data types is by using the pandas.Series.dt accessor object. More information about the available options can be found here: https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.html.

Processing non-numerical data is an important step in creating performant models. If possible, any domain knowledge should be imparted to the training data features. For example, when forecasting the temperature using the date, like the dataset used in the prior exercises and activity of this chapter, encoding the month would be helpful since the temperature is likely highly correlated with the month of the year. Encoding the day of the week, however, may not be useful as there is likely no correlation between the day of the week and temperature. Using this domain knowledge can aid the model to learn the underlying relationship between the features and the target.

In the next section, you will learn how to process image data so that it can be input into machine learning models.

The TensorFlow Workshop

By : Matthew Moocarme, Abhranshu Bagchi, Anthony So , Maddalone

The TensorFlow Workshop

By: Matthew Moocarme, Abhranshu Bagchi, Anthony So , Maddalone

Overview of this book

Processing Tabular Data

Exercise 2.01: Loading Tabular Data and Rescaling Numerical Fields

Activity 2.01: Loading Tabular Data and Rescaling Numerical Fields with a MinMax Scaler

Exercise 2.02: Preprocessing Non-Numerical Data

The TensorFlow Workshop

By : Matthew Moocarme, Abhranshu Bagchi, Anthony So , Maddalone

The TensorFlow Workshop

By: Matthew Moocarme, Abhranshu Bagchi, Anthony So , Maddalone

Overview of this book

Processing Tabular Data

Exercise 2.01: Loading Tabular Data and Rescaling Numerical Fields

Activity 2.01: Loading Tabular Data and Rescaling Numerical Fields with a MinMax Scaler

Exercise 2.02: Preprocessing Non-Numerical Data

Create a Note

Delete Bookmark

Delete Note

Edit Note

Confirmation

Buy this book with your credits?