Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Book Overview & Buying Pandas 1.x Cookbook
  • Table Of Contents Toc
  • Feedback & Rating feedback
Pandas 1.x Cookbook

Pandas 1.x Cookbook

By : Matthew Harrison, Theodore Petrou
4.5 (28)
close
close
Pandas 1.x Cookbook

Pandas 1.x Cookbook

4.5 (28)
By: Matthew Harrison, Theodore Petrou

Overview of this book

The pandas library is massive, and it's common for frequent users to be unaware of many of its more impressive features. The official pandas documentation, while thorough, does not contain many useful examples of how to piece together multiple commands as one would do during an actual analysis. This book guides you, as if you were looking over the shoulder of an expert, through situations that you are highly likely to encounter. This new updated and revised edition provides you with unique, idiomatic, and fun recipes for both fundamental and advanced data manipulation tasks with pandas. Some recipes focus on achieving a deeper understanding of basic principles, or comparing and contrasting two similar operations. Other recipes will dive deep into a particular dataset, uncovering new and unexpected insights along the way. Many advanced recipes combine several different features across the pandas library to generate results.
Table of Contents (17 chapters)
close
close
15
Other Books You May Enjoy
16
Index

Summarizing a DataFrame

In the Calling Series methods recipe in Chapter 1, Pandas Foundations, a variety of methods operated on a single column or Series of data. Many of these were aggregation or reducing methods that returned a single scalar value. When these same methods are called from a DataFrame, they perform that operation for each column at once and reduce the results for each column in the DataFrame. They return a Series with the column names in the index and the summary for each column as the value.

In this recipe, we explore a variety of the most common DataFrame attributes and methods with the movie dataset.

How to do it...

  1. Read in the movie dataset, and examine the basic descriptive properties, .shape, .size, and .ndim, along with running the len function:
    >>> movies = pd.read_csv("data/movie.csv")
    >>> movies.shape
    (4916, 28)
    >>> movies.size
    137648
    >>> movies.ndim
    2
    >>> len(movies)
    4916
    
  2. The .count method shows the number of non-missing values for each column. It is an aggregation method as it summarizes every column in a single value. The output is a Series that has the original column names as its index:
    >>> movies.count()
    color                      4897
    director_name              4814
    num_critic_for_reviews     4867
    duration                   4901
    director_facebook_likes    4814
                               ... 
    title_year                 4810
    actor_2_facebook_likes     4903
    imdb_score                 4916
    aspect_ratio               4590
    movie_facebook_likes       4916
    Length: 28, dtype: int64
    
  3. The other methods that compute summary statistics, .min, .max, .mean, .median, and .std, return Series that have the column names of the numeric columns in the index and their aggregations as the values:
    >>> movies.min()
    num_critic_for_reviews        1.00
    duration                      7.00
    director_facebook_likes       0.00
    actor_3_facebook_likes        0.00
    actor_1_facebook_likes        0.00
                                ...   
    title_year                 1916.00
    actor_2_facebook_likes        0.00
    imdb_score                    1.60
    aspect_ratio                  1.18
    movie_facebook_likes          0.00
    Length: 16, dtype: float64
    
  4. The .describe method is very powerful and calculates all the descriptive statistics and quartiles at once. The end result is a DataFrame with the descriptive statistics names as its index. I like to transpose the results using .T as I can usually fit more information on the screen that way:
    >>> movies.describe().T
                   count         mean  ...       75%       max
    num_criti...  4867.0   137.988905  ...    191.00     813.0
    duration      4901.0   107.090798  ...    118.00     511.0
    director_...  4814.0   691.014541  ...    189.75   23000.0
    actor_3_f...  4893.0   631.276313  ...    633.00   23000.0
    actor_1_f...  4909.0  6494.488491  ...  11000.00  640000.0
    ...              ...          ...  ...       ...       ...
    title_year    4810.0  2002.447609  ...   2011.00    2016.0
    actor_2_f...  4903.0  1621.923516  ...    912.00  137000.0
    imdb_score    4916.0     6.437429  ...      7.20       9.5
    aspect_ratio  4590.0     2.222349  ...      2.35      16.0
    movie_fac...  4916.0  7348.294142  ...   2000.00  349000.0
    
  5. It is possible to specify exact quantiles in the .describe method using the percentiles parameter:
    >>> movies.describe(percentiles=[0.01, 0.3, 0.99]).T
                   count         mean  ...       99%       max
    num_criti...  4867.0   137.988905  ...    546.68     813.0
    duration      4901.0   107.090798  ...    189.00     511.0
    director_...  4814.0   691.014541  ...  16000.00   23000.0
    actor_3_f...  4893.0   631.276313  ...  11000.00   23000.0
    actor_1_f...  4909.0  6494.488491  ...  44920.00  640000.0
    ...              ...          ...  ...       ...       ...
    title_year    4810.0  2002.447609  ...   2016.00    2016.0
    actor_2_f...  4903.0  1621.923516  ...  17000.00  137000.0
    imdb_score    4916.0     6.437429  ...      8.50       9.5
    aspect_ratio  4590.0     2.222349  ...      4.00      16.0
    movie_fac...  4916.0  7348.294142  ...  93850.00  349000.0
    

How it works...

Step 1 gives basic information on the size of the dataset. The .shape attribute returns a tuple with the number of rows and columns. The .size attribute returns the total number of elements in the DataFrame, which is just the product of the number of rows and columns. The .ndim attribute returns the number of dimensions, which is two for all DataFrames. When a DataFrame is passed to the built-in len function, it returns the number of rows.

The methods in step 2 and step 3 aggregate each column down to a single number. Each column name is now the index label in a Series with its aggregated result as the corresponding value.

If you look closely, you will notice that the output from step 3 is missing all the object columns from step 2. This method ignores string columns by default.

Note that numeric columns have missing values but have a result returned by .describe. By default, pandas handles missing values in numeric columns by skipping them. It is possible to change this behavior by setting the skipna parameter to False. This will cause pandas to return NaN for all these aggregation methods if there exists at least a single missing value.

The .describe method displays the summary statistics of the numeric columns. You can expand its summary to include more quantiles by passing a list of numbers between 0 and 1 to the percentiles parameter. See the Developing a data analysis routine recipe for more on the .describe method.

There's more...

To see how the .skipna parameter affects the outcome, we can set its value to False and rerun step 3 from the preceding recipe. Only numeric columns without missing values will calculate a result:

>>> movies.min(skipna=False)
num_critic_for_reviews     NaN
duration                   NaN
director_facebook_likes    NaN
actor_3_facebook_likes     NaN
actor_1_facebook_likes     NaN
                          ... 
title_year                 NaN
actor_2_facebook_likes     NaN
imdb_score                 1.6
aspect_ratio               NaN
movie_facebook_likes       0.0
Length: 16, dtype: float64

Create a Note

Modal Close icon
You need to login to use this feature.
notes
bookmark search playlist download font-size

Change the font size

margin-width

Change margin width

day-mode

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Delete Bookmark

Modal Close icon
Are you sure you want to delete it?
Cancel
Yes, Delete

Delete Note

Modal Close icon
Are you sure you want to delete it?
Cancel
Yes, Delete

Confirmation

Modal Close icon
claim successful

Buy this book with your credits?

Modal Close icon
Are you sure you want to buy this book with one of your credits?
Close
YES, BUY