-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating

The TensorFlow Workshop
By :

In this section, you will learn how to load tabular data into a Python development environment so that it can be used for TensorFlow modeling. You will use pandas and scikit-learn to utilize the classes and functions that are useful for processing data. You will also explore methods that can be used to preprocess this data.
Tabular data can be loaded into memory by using the pandas read_csv
function and passing the path into the dataset. The function is well suited and easy to use for loading in tabular data and can be used as follows:
df = pd.read_csv('path/to/dataset')
In order to normalize the data, you can use a scaler that is available in scikit-learn. There are multiple scalers that can be applied; StandardScaler
will normalize the data so that the fields of the dataset have a mean of 0
and a standard deviation of 1
. Another common scaler that is used is MinMaxScaler
, which will rescale the dataset so that the fields have a minimum value of 0
and a maximum value of 1
.
To use a scaler, it must be initialized and fit to the dataset. By doing this, the dataset can be transformed by the scaler. In fact, the fitting and transformation processes can be performed in one step by using the fit_transform
method, as follows:
scaler = StandardScaler() transformed_df = scaler.fit_transform(df)
In the first exercise, you will learn how to use pandas and scikit-learn to load a dataset and preprocess it so that it is suitable for modeling.
The dataset, Bias_correction_ucl.csv
, contains information for bias correction of the next-day maximum and minimum air temperature forecast for Seoul, South Korea. The fields represent temperature measurements of the given date, the weather station at which the metrics were measured, model forecasts of weather-related metrics such as humidity, and projections for the temperature of the following day. You are required to preprocess the data to make all the columns normally distributed with a mean of 0
and a standard deviation of 1
. You will demonstrate the effects with the Present_Tmax
column, which represents the maximum temperature on the given date at a given weather station.
Note
The dataset can be found here: https://packt.link/l83pR.
Perform the following steps to complete this exercise:
Exercise2-01.ipnyb
. import pandas as pd
Note
You can find the documentation for pandas at the following link: https://pandas.pydata.org/docs/.
df
and read the Bias_correction_ucl.csv
file into it. Examine whether your data is properly loaded by printing the resultant DataFrame:df = pd.read_csv('Bias_correction_ucl.csv') df
Note
Make sure you change the path (highlighted) to the CSV file based on its location on your system. If you're running the Jupyter notebook from the same directory where the CSV file is stored, you can run the preceding code without any modification.
The output will be as follows:
Figure 2.5: The output from printing the DataFrame
date
column using the drop
method of the DataFrame and pass in the name of the column. The date
column will be dropped as it is a non-numerical field and rescaling will not be possible when non-numerical fields exist. Since you are dropping a column, both the axis=1
argument and the inplace=True
argument should be passed:df.drop('Date', inplace=True, axis=1)
Present_Tmax
column that represents the maximum temperature across dates and weather stations within the dataset:ax = df['Present_Tmax'].hist(color='gray') ax.set_xlabel("Temperature") ax.set_ylabel("Frequency")
The output will be as follows:
Figure 2.6: A Temperature versus Frequency histogram of the Present_Tmax column
The resultant histogram shows the distribution of values for the Present_Tmax
column. You can see that the temperature values vary from 20 to 38 degrees Celsius. Plotting a histogram of the feature values is a good way to view the distribution of values to understand whether scaling is required as a preprocessing step.
StandardScaler
class from scikit-learn's preprocessing package. Initialize the scaler, fit the scaler, and transform the DataFrame using the scaler's fit_transform
method. Create a new DataFrame, df2
, using the transformed DataFrame since the result of the fit_transform
method is a NumPy array. The standard scaler will transform the numerical fields so that the mean of the field is 0
and the standard deviation is 1
:from sklearn.preprocessing import StandardScaler scaler = StandardScaler() df2 = scaler.fit_transform(df) df2 = pd.DataFrame(df2, columns=df.columns)
Note
The values for the mean and standard deviation of the resulting transformed data can be input into the scaler.
Present_Tmax
column:ax = df2['Present_Tmax'].hist(color='gray') ax.set_xlabel("Normalized Temperature") ax.set_ylabel("Frequency")
The output will be as follows:
Figure 2.7: A histogram of the rescaled Present_Tmax column
The resulting histogram shows that the temperature values range from around -3
to 3
degrees Celsius, as evidenced by the range on the x axis of the histogram. By using the standard scaler, the values will always have a mean of 0
and a standard deviation of 1
. Having the features normalized can speed up the model training process.
In this exercise, you successfully imported tabular data using the pandas library and performed some preprocessing using the scikit-learn library. The preprocessing of data included dropping the date
column and scaling the numerical fields so that they have a mean value of 0
and a standard deviation of 1
.
In the following activity, you will load in tabular data using the pandas library and scale that data using the MinMax
scaler present in scikit-learn. You will do so on the same dataset that you used in the prior exercise, which describes the bias correction of air temperature forecasts for Seoul, South Korea.
In this activity, you are required to load tabular data and rescale the data using a MinMax
scaler. The dataset, Bias_correction_ucl.csv
, contains information for bias correction of the next-day maximum and minimum air temperature forecast for Seoul, South Korea. The fields represent temperature measurements of the given date, the weather station at which the metrics were measured, model forecasts of weather-related metrics such as humidity, and projections for the temperature the following day. You are required to scale the columns so that the minimum value of each column is 0
and the maximum value is 1
.
Perform the following steps to complete this activity:
Bias_correction_ucl.csv
dataset.read_csv
function.date
column of the DataFrame.Present_Tmax
column.MinMaxScaler
and fit it to and transform the feature DataFrame.Present_Tmax
column.You should get an output similar to the following:
Figure 2.8: Expected output of Activity 2.01
Note
The solution to this activity can be found via this link.
One method of converting non-numerical fields such as categorical or date fields is to one-hot encode them. The one-hot encoding process creates a new column for each unique value in the provided column, while each row has a value of 0
except for the one that corresponds to the correct column. The column headers of the newly created dummy columns correspond to the unique values. One-hot encoding can be achieved by using the get_dummies
function of the pandas library and passing in the column to be encoded. An optional argument is to provide a prefix feature that adds a prefix to the column headers. This can be useful for referencing the columns:
dummies = pd.get_dummies(df['feature1'], prefix='feature1')
Note
When using the get_dummies
function, NaN
values are converted into all zeros.
In the following exercise, you'll learn how to preprocess non-numerical fields. You will utilize the same dataset that you used in the previous exercise and activity, which describes the bias correction of air temperature forecasts for Seoul, South Korea.
In this exercise, you will preprocess the date
column by one-hot encoding the year and the month from the date
column using the get_dummies
function. You will join the one-hot-encoded columns with the original DataFrame and ensure that all the fields in the resultant DataFrame are numerical.
Perform the following steps to complete this exercise:
Exercise2-02.ipnyb
.import pandas as pd
df
and read the Bias_correction_ucl.csv
file into it. Examine whether your data is properly loaded by printing the resultant DataFrame:df = pd.read_csv('Bias_correction_ucl.csv')
Note
Make sure you change the path (highlighted) to the CSV file based on its location on your system. If you're running the Jupyter notebook from the same directory where the CSV file is stored, you can run the preceding code without any modification.
date
column to Date
using the pandas to_datetime
function:df['Date'] = pd.to_datetime(df['Date'])
year
using the pandas get_dummies
function. Pass in the year of the date
column as the first argument and add a prefix to the columns of the resultant DataFrame. Print out the resultant DataFrame:year_dummies = pd.get_dummies(df['Date'].dt.year, \ prefix='year') year_dummies
The output will be as follows:
Figure 2.9: Output of the get_dummies function applied to the year of the date column
The resultant DataFrame contains only 0s and 1s. 1
corresponds to the value present in the original date
column. Null values will have 0s for all columns in the newly created DataFrame.
date
column. Print out the resulting DataFrame:month_dummies = pd.get_dummies(df['Date'].dt.month, \ prefix='month') month_dummies
The output will be as follows:
Figure 2.10: The output of the get_dummies function applied to the month of the date column
The resultant DataFrame now contains only 0s and 1s for the month in the date
column.
df = pd.concat([df, month_dummies, year_dummies], \ axis=1)
date
column since it is now redundant:df.drop('Date', axis=1, inplace=True)
df.dtypes
The output will be as follows:
Figure 2.11: Output of the dtypes attribute of the resultant DataFrame
Here, you can see that all the data types of the resultant DataFrame are numerical. This means they can now be passed into an ANN for modeling.
In this exercise, you successfully imported tabular data and preprocessed the date
column using the pandas and scikit-learn libraries. You utilized the get_dummies
function to convert categorical data into numerical data types.
Note
Another method to attain a numerical data type from date data types is by using the pandas.Series.dt
accessor object. More information about the available options can be found here: https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.html.
Processing non-numerical data is an important step in creating performant models. If possible, any domain knowledge should be imparted to the training data features. For example, when forecasting the temperature using the date, like the dataset used in the prior exercises and activity of this chapter, encoding the month would be helpful since the temperature is likely highly correlated with the month of the year. Encoding the day of the week, however, may not be useful as there is likely no correlation between the day of the week and temperature. Using this domain knowledge can aid the model to learn the underlying relationship between the features and the target.
In the next section, you will learn how to process image data so that it can be input into machine learning models.
Change the font size
Change margin width
Change background colour