First, let's import the necessary Python libraries:
- Import the required Python libraries:
import pandas as pd
import matplotlib.pyplot as plt
- Let's load a few variables from the dataset into a pandas dataframe and inspect the first five rows:
cols = ['AGE', 'NUMCHLD', 'INCOME', 'WEALTH1', 'MBCRAFT', 'MBGARDEN', 'MBBOOKS', 'MBCOLECT', 'MAGFAML','MAGFEM', 'MAGMALE']
data = pd.read_csv('cup98LRN.txt', usecols=cols)
data.head()
After loading the dataset, this is how the output of head() looks like when we run it from a Jupyter Notebook:

- Let's calculate the number of missing values in each variable:
data.isnull().sum()
The number of missing values per variable can be seen in the following output:
AGE 23665
NUMCHLD 83026
INCOME 21286
WEALTH1 44732
MBCRAFT 52854
MBGARDEN 52854
MBBOOKS 52854
MBCOLECT 52914
MAGFAML 52854
MAGFEM 52854
MAGMALE 52854
dtype: int64
- Let's quantify the percentage of missing values in each variable:
data.isnull().mean()
The percentages of missing values per variable can be seen in the following output, expressed as decimals:
AGE 0.248030
NUMCHLD 0.870184
INCOME 0.223096
WEALTH1 0.468830
MBCRAFT 0.553955
MBGARDEN 0.553955
MBBOOKS 0.553955
MBCOLECT 0.554584
MAGFAML 0.553955
MAGFEM 0.553955
MAGMALE 0.553955
dtype: float64
- Finally, let's make a bar plot with the percentage of missing values per variable:
data.isnull().mean().plot.bar(figsize=(12,6))
plt.ylabel('Percentage of missing values')
plt.xlabel('Variables')
plt.title('Quantifying missing data')
The bar plot that's returned by the preceding code block displays the percentage of missing data per variable:

We can change the figure size using the figsize argument within pandas plot.bar() and we can add x and y labels and a title with the plt.xlabel(), plt.ylabel(), and plt.title() methods from Matplotlib to enhance the aesthetics of the plot.