
Bioinformatics with Python Cookbook
By :

When you are dealing with lots of information – for example, when analyzing whole genome sequencing data – memory usage may become a limitation for your analysis. It turns out that naïve pandas is not very efficient from a memory perspective, and we can substantially reduce its consumption.
In this recipe, we are going to revisit our VAERS data and look at several ways to reduce pandas memory usage. The impact of these changes can be massive: in many cases, reducing memory consumption may mean the difference between being able to use pandas or requiring a more alternative and complex approach, such as Dask or Spark.
We will be using the data from the first recipe. If you have run it, you are all set; if not, please follow the steps discussed there. You can find this code in Chapter02/Pandas_Memory.py
.
Follow these steps:
import numpy as np import pandas as pd vdata = pd.read_csv("2021VAERSDATA.csv.gz", encoding="iso-8859-1") vdata.info(memory_usage="deep")
Here is an abridged version of the output:
RangeIndex: 654986 entries, 0 to 654985 Data columns (total 35 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 VAERS_ID 654986 non-null int64 2 STATE 572236 non-null object 3 AGE_YRS 583424 non-null float64 6 SEX 654986 non-null object 8 SYMPTOM_TEXT 654828 non-null object 9 DIED 8536 non-null object 31 BIRTH_DEFECT 383 non-null object 34 ALLERGIES 330630 non-null object dtypes: float64(5), int64(2), object(28) memory usage: 1.3 GB
Here, we have information about the number of rows and the type and non-null values of each row. Finally, we can see that the DataFrame requires a whopping 1.3 GB.
for name in vdata.columns: col_bytes = vdata[name].memory_usage(index=False, deep=True) col_type = vdata[name].dtype print( name, col_type, col_bytes // (1024 ** 2))
Here is an abridged version of the output:
VAERS_ID int64 4 STATE object 34 AGE_YRS float64 4 SEX object 36 RPT_DATE object 20 SYMPTOM_TEXT object 442 DIED object 20 ALLERGIES object 34
SYMPTOM_TEXT
occupies 442 MB, so 1/3 of our entire table.
DIED
column. Can we find a more efficient representation?vdata.DIED.memory_usage(index=False, deep=True) vdata.DIED.fillna(False).astype(bool).memory_usage(index=False, deep=True)
The original column takes 21,181,488 bytes, whereas our compact representation takes 656,986 bytes. That’s 32 times less!
STATE
column? Can we do better?vdata["STATE"] = vdata.STATE.str.upper() states = list(vdata["STATE"].unique()) vdata["encoded_state"] = vdata.STATE.apply(lambda state: states.index(state)) vdata["encoded_state"] = vdata["encoded_state"].astype(np.uint8) vdata["STATE"].memory_usage(index=False, deep=True) vdata["encoded_state"].memory_usage(index=False, deep=True)
Here, we convert the STATE
column, which is text, into encoded_state
, which is a number. This number is the position of the state’s name in the list state. We use this number to look up the list of states. The original column takes around 36 MB, whereas the encoded column takes 0.6 MB.
As an alternative to this approach, you can look at categorical variables in pandas. I prefer to use them as they have wider applications.
states = list(pd.read_csv( "vdata_sample.csv.gz", converters={ "STATE": lambda state: state.upper() }, usecols=["STATE"] )["STATE"].unique())
We have a converter that simply returns the uppercase version of the state. We only return the STATE
column to save memory and processing time. Finally, we get the STATE
column from the DataFrame (which has only a single column).
SYMPTOM_TEXT
– that is around 1/3 of the data. In that case, we can just skip it. Here is the final version:vdata = pd.read_csv( "vdata_sample.csv.gz", index_col="VAERS_ID", converters={ "DIED": lambda died: died == "Y", "STATE": lambda state: states.index(state.upper()) }, usecols=lambda name: name != "SYMPTOM_TEXT" ) vdata["STATE"] = vdata["STATE"].astype(np.uint8) vdata.info(memory_usage="deep")
We are now at 714 MB, which is a bit over half of the original. This could be still substantially reduced by applying the methods we used for STATE
and DIED
to all other columns.
The following is some extra information that may be useful: