EDA is perhaps one of the most important steps in the machine learning workflow, and pandas makes it extremely easy to visualize data in Python. pandas provides a high-level API for the popular matplotlib library, which makes it easy to construct plots directly from DataFrames.
As an example, let's visualize the Iris dataset using pandas to uncover important insights. Let's plot a scatterplot to visualize how sepal_width is related to sepal_length. We can construct a scatterplot easily using the DataFrame.plot.scatter() method, which is built into all DataFrames:
# Define marker shapes by class
import matplotlib.pyplot as plt
marker_shapes = ['.', '^', '*']
# Then, plot the scatterplot
ax = plt.axes()
for i, species in enumerate(df['class'].unique()):
species_data = df[df['class'] == species]
species_data.plot.scatter(x='sepal_length',
y='sepal_width',
marker=marker_shapes[i],
s=100,
title="Sepal Width vs Length by Species",
label=species, figsize=(10,7), ax=ax)
We'll get a scatterplot, as shown in the following screenshot:
From the scatterplot, we can notice some interesting insights. First, the relationship between sepal_width and sepal_length is dependent on the species. Setosa (dots) has a fairly linear relationship between sepal_width and sepal_length, while versicolor (triangle) and virginica (star) tends to have much greater sepal_length than Setosa. If we're designing a machine learning algorithm to predict the type of species of flower, we know that the sepal_width and sepal_length are important features to include in our model.
Next, let's plot a histogram to investigate the distribution. Consistent with scatterplots, pandas DataFrames provides a built in method to plot histograms using the DataFrame.plot.hist() function:
df['petal_length'].plot.hist(title='Histogram of Petal Length')
And we can see the output in the following screenshot:
We can see that the distribution of petal lengths is essentially bimodal. It appears that certain species of flowers have shorter petals than the rest. We can also plot a boxplot of the data. The boxplot is an important data visualization tool used by data scientists to understand the distribution of the data based on the first quartile, median, and the third quartile:
df.plot.box(title='Boxplot of Sepal Length & Width, and Petal Length & Width')
The output is given in the following screenshot:
From the boxplot, we can see that the variance of sepal_width is much smaller than the other numeric variables, with petal_length having the greatest variance.
We have now seen how convenient and easy it is to visualize data using pandas directly. Keep in mind that EDA is a crucial step in the machine learning pipeline, and it is something that we will continue to do in every project for the rest of the book.