
Bioinformatics with Python Cookbook
By :

Most of your analysis will make use of NumPy, even if you don’t use it explicitly. NumPy is an array manipulation library that is behind libraries such as pandas, Matplotlib, Biopython, and scikit-learn, among many others. While much of your bioinformatics work may not require explicit direct use of NumPy, you should be aware of its existence as it underpins almost everything you do, even if only indirectly via the other libraries.
In this recipe, we will use VAERS data to demonstrate how NumPy is behind many of the core libraries that we use. This is a very light introduction to the library so that you are aware that it exists and that it is behind almost everything. Our example will extract the number of cases from the five US states with more adverse effects, splitting them into age bins: 0 to 19 years, 20 to 39, up to 100 to 119.
Once again, we will be using the data from the first recipe, so make sure it’s available. The code for it can be found in Chapter02/NumPy.py
.
Follow these steps:
import numpy as np import pandas as pd import matplotlib.pyplot as plt vdata = pd.read_csv( "2021VAERSDATA.csv.gz", encoding="iso-8859-1") vdata["STATE"] = vdata["STATE"].str.upper() top_states = pd.DataFrame({ "size": vdata.groupby("STATE").size().sort_values(ascending=False).head(5)}).reset_index() top_states["rank"] = top_states.index top_states = top_states.set_index("STATE") top_vdata = vdata[vdata["STATE"].isin(top_states.index)] top_vdata["state_code"] = top_vdata["STATE"].apply( lambda state: top_states["rank"].at[state] ).astype(np.uint8) top_vdata = top_vdata[top_vdata["AGE_YRS"].notna()] top_vdata.loc[:,"AGE_YRS"] = top_vdata["AGE_YRS"].astype(int) top_states
The top states are as follows. This rank will be used later to construct a NumPy matrix:
Figure 2.2 – US states with largest numbers of adverse effects
age_state = top_vdata[["state_code", "AGE_YRS"]] age_state["state_code"] state_code_arr = age_state["state_code"].values type(state_code_arr), state_code_arr.shape, state_code_arr.dtype age_arr = age_state["AGE_YRS"].values type(age_arr), age_arr.shape, age_arr.dtype
Note that the data that underlies pandas is NumPy data (the values
call for both Series returns NumPy types). Also, you may recall that pandas has properties such as .shape
or .dtype
: these were inspired by NumPy and behave the same.
age_state_mat = np.zeros((5,6), dtype=np.uint64) for row in age_state.itertuples(): age_state_mat[row.state_code, row.AGE_YRS//20] += 1 age_state_mat
The array has five rows – one for each state – and six columns – one for each age group. All the cells in the array must have the same type.
We initialize the array with zeros. There are many ways to initialize arrays, but if you have a very large array, initializing it may take a lot of time. Sometimes, depending on your task, it might be OK that the array is empty at the beginning (meaning it was initialized with random trash). In that case, using np.empty
will be much faster. We use pandas iteration here: this is not the best way to do things from a pandas perspective, but we want to make the NumPy part very explicit.
cal = age_state_mat[0,:] kids = age_state_mat[:,0]
Note the syntax to extract a row or a column. It should be familiar to you, given that pandas copied the syntax from NumPy and we encountered it in previous recipes.
def compute_frac(arr_1d): return arr_1d / arr_1d.sum() frac_age_stat_mat = np.apply_along_axis(compute_frac, 1, age_state_mat)
The last line applies the compute_frac
function to all rows. compute_frac
takes a single row and returns a new row where all the elements are divided by the total sum.
perc_age_stat_mat = frac_age_stat_mat * 100 perc_age_stat_mat = perc_age_stat_mat.astype(np.uint8) perc_age_stat_mat
The first line simply multiplies all the elements of the 2D array by 100. Matplotlib is smart enough to traverse different array structures. That line will work if it’s presented with an array with any dimensions and would do exactly what is expected.
Here is the result:
Figure 2.3 – A matrix representing the distribution of vaccine-adverse effects in the five US states with the most cases
fig = plt.figure() ax = fig.add_subplot() ax.matshow(perc_age_stat_mat, cmap=plt.get_cmap("Greys")) ax.set_yticks(range(5)) ax.set_yticklabels(top_states.index) ax.set_xticks(range(6)) ax.set_xticklabels(["0-19", "20-39", "40-59", "60-79", "80-99", "100-119"]) fig.savefig("matrix.png")
Do not dwell too much on the Matplotlib code – we are going to discuss it in the next recipe. The fundamental point here is that you can pass NumPy data structures to Matplotlib. Matplotlib, like pandas, is based on NumPy.
The following is some extra information that may be useful: