Reducing the memory usage of pandas DataFrames

When you are dealing with lots of information – for example, when analyzing whole genome sequencing data – memory usage may become a limitation for your analysis. It turns out that naïve pandas is not very efficient from a memory perspective, and we can substantially reduce its consumption.

In this recipe, we are going to revisit our VAERS data and look at several ways to reduce pandas memory usage. The impact of these changes can be massive: in many cases, reducing memory consumption may mean the difference between being able to use pandas or requiring a more alternative and complex approach, such as Dask or Spark.

Getting ready

We will be using the data from the first recipe. If you have run it, you are all set; if not, please follow the steps discussed there. You can find this code in Chapter02/Pandas_Memory.py.

How to do it…

Follow these steps:

First, let’s load the data and inspect the size of the DataFrame:

import numpy as np
import pandas as pd
vdata = pd.read_csv("2021VAERSDATA.csv.gz", encoding="iso-8859-1")
vdata.info(memory_usage="deep")

Here is an abridged version of the output:

RangeIndex: 654986 entries, 0 to 654985
Data columns (total 35 columns):
#   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
0   VAERS_ID      654986 non-null  int64  
2   STATE         572236 non-null  object 
3   AGE_YRS       583424 non-null  float64
6   SEX           654986 non-null  object 
8   SYMPTOM_TEXT  654828 non-null  object 
9   DIED          8536 non-null    object 
31  BIRTH_DEFECT  383 non-null     object 
34  ALLERGIES     330630 non-null  object 
dtypes: float64(5), int64(2), object(28)
memory usage: 1.3 GB

Here, we have information about the number of rows and the type and non-null values of each row. Finally, we can see that the DataFrame requires a whopping 1.3 GB.

We can also inspect the size of each column:

for name in vdata.columns:
    col_bytes = vdata[name].memory_usage(index=False, deep=True)
    col_type = vdata[name].dtype
    print(
        name,
        col_type, col_bytes // (1024 ** 2))

Here is an abridged version of the output:

VAERS_ID int64 4
STATE object 34
AGE_YRS float64 4
SEX object 36
RPT_DATE object 20
SYMPTOM_TEXT object 442
DIED object 20
ALLERGIES object 34

SYMPTOM_TEXT occupies 442 MB, so 1/3 of our entire table.

Now, let’s look at the DIED column. Can we find a more efficient representation?

vdata.DIED.memory_usage(index=False, deep=True)
vdata.DIED.fillna(False).astype(bool).memory_usage(index=False, deep=True)

The original column takes 21,181,488 bytes, whereas our compact representation takes 656,986 bytes. That’s 32 times less!

What about the STATE column? Can we do better?

vdata["STATE"] = vdata.STATE.str.upper()
states = list(vdata["STATE"].unique())
vdata["encoded_state"] = vdata.STATE.apply(lambda state: states.index(state))
vdata["encoded_state"] = vdata["encoded_state"].astype(np.uint8)
vdata["STATE"].memory_usage(index=False, deep=True)
vdata["encoded_state"].memory_usage(index=False, deep=True)

Here, we convert the STATE column, which is text, into encoded_state, which is a number. This number is the position of the state’s name in the list state. We use this number to look up the list of states. The original column takes around 36 MB, whereas the encoded column takes 0.6 MB.

As an alternative to this approach, you can look at categorical variables in pandas. I prefer to use them as they have wider applications.

We can apply most of these optimizations when we load the data, so let’s prepare for that. But now, we have a chicken-and-egg problem: to be able to know the content of the state table, we have to do a first pass to get the list of states, like so:
```
states = list(pd.read_csv(
    "vdata_sample.csv.gz",
    converters={
       "STATE": lambda state: state.upper()
    },
    usecols=["STATE"]
)["STATE"].unique())
```

We have a converter that simply returns the uppercase version of the state. We only return the STATE column to save memory and processing time. Finally, we get the STATE column from the DataFrame (which has only a single column).

The ultimate optimization is not to load the data. Imagine that we don’t need SYMPTOM_TEXT – that is around 1/3 of the data. In that case, we can just skip it. Here is the final version:

vdata = pd.read_csv(
    "vdata_sample.csv.gz",
    index_col="VAERS_ID",
    converters={
       "DIED": lambda died: died == "Y",
       "STATE": lambda state: states.index(state.upper())
    },
    usecols=lambda name: name != "SYMPTOM_TEXT"
)
vdata["STATE"] = vdata["STATE"].astype(np.uint8)
vdata.info(memory_usage="deep")

We are now at 714 MB, which is a bit over half of the original. This could be still substantially reduced by applying the methods we used for STATE and DIED to all other columns.

Bioinformatics with Python Cookbook

By : Tiago Antao

Bioinformatics with Python Cookbook

By: Tiago Antao

Overview of this book

Reducing the memory usage of pandas DataFrames

Getting ready

How to do it…

See also

Bioinformatics with Python Cookbook

By : Tiago Antao

Bioinformatics with Python Cookbook

By: Tiago Antao

Overview of this book

Reducing the memory usage of pandas DataFrames

Getting ready

How to do it…

See also

Create a Note

Delete Bookmark

Delete Note

Edit Note

Confirmation

Buy this book with your credits?