Dealing with missing values

In this recipe, we’ll cover how to impute time series missing values. We’ll discuss different methods of imputing missing values and the factors to consider when choosing a method. We’ll show an example of how to solve this problem using pandas.

Getting ready

Missing values are an issue that plagues all kinds of data, including time series. Observations are often unavailable for various reasons, such as sensor failure or annotation errors. In such cases, data imputation can be used to overcome this problem. Data imputation works by assigning a value based on some rule, such as the mean or some predefined value.

How to do it…

We start by simulating missing data. The following code removes 60% of observations from a sample of two years of the solar radiation time series:

import numpy as np
sample_with_nan = series_daily.head(365 * 2).copy()
size_na=int(0.6 * len(sample_with_nan))
idx = np.random.choice(a=range(len(sample_with_nan)),
                       size=size_na,
                       replace=False)
sample_with_nan[idx] = np.nan

We leverage the np.random.choice() method from numpy to select a random sample of the time series. The observations of this sample are changed to a missing value (np.nan).

In datasets without temporal order, it is common to impute missing values using central statistics such as the mean or median. This can be done as follows:

average_value = sample_with_nan.mean()
imp_mean = sample_with_nan.fillna(average_value)

Time series imputation must take into account the temporal nature of observations. This means that the assigned value should follow the dynamics of the series. A more common approach in time series is to impute missing data with the last known observation. This approach is implemented in the ffill() method:

imp_ffill = sample_with_nan.ffill()

Another, less common, approach that uses the order of observations is bfill():

imp_bfill = sample_with_nan.bfill()

The bfill() method imputes missing data with the next available observation in the dataset.

How it works…

The following figure shows the reconstructed time series after imputation with each method:

Figure 1.2: Imputing missing data with different strategies

The mean imputation approach misses the time series dynamics, while both ffill and bfill lead to a reconstructed time series with similar dynamics as the original time series. Usually, ffill is preferable because it does not break the temporal order of observations, that is, using future information to alter (impute) the past.

There’s more…

The imputation process can also be carried out using some conditions, such as limiting the number of imputed observations. You can learn more about this in the documentation pages of these functions, for example, https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.ffill.html.

Deep Learning for Time Series Cookbook

By : Cerqueira, Luís Roque

Deep Learning for Time Series Cookbook

By: Cerqueira, Luís Roque

Overview of this book

Dealing with missing values

Getting ready

How to do it…

How it works…

There’s more…

Deep Learning for Time Series Cookbook

By : Cerqueira, Luís Roque

Deep Learning for Time Series Cookbook

By: Cerqueira, Luís Roque

Overview of this book

Dealing with missing values

Getting ready

How to do it…

How it works…

There’s more…

Create a Note

Delete Bookmark

Confirmation

Buy this book with your credits?