
Bioinformatics with Python Cookbook
By :

If there is some functionality that you need and cannot find it in a Python library, your first port of call is to check whether it's implemented in R. For statistical methods, R is still the most complete framework; moreover, some bioinformatics functionalities are also only available in R, most probably offered as a package belonging to the Bioconductor project.
The rpy2 provides provides a declarative interface from Python to R. As you will see, you will be able to write very elegant Python code to perform the interfacing process.
In order to show the interface (and try out one of the most common R data structures, the data frame, and one of the most popular R libraries: ggplot2), we will download its metadata from the Human 1000 genomes project (http://www.1000genomes.org/). As this is not a book on R, we do want to provide any interesting and functional examples.
You will need to get the metadata file from the 1000 genomes sequence index. Please check https://github.com/tiagoantao/bioinf-python/blob/master/notebooks/Datasets.ipynb and download the sequence.index
file. If you are using notebooks, open the 00_Intro/Interfacing_R notebook.ipynb
and just execute the wget
command on top.
This file has information about all FASTQ files in the project (we will use data from the Human 1000 genomes project in the chapters to come). This includes the FASTQ file, the sample ID, and the population of origin and important statistical information per lane, such as the number of reads and number of DNA bases read.
Take a look at the following steps:
read_delim
R function:import rpy2.robjects as robjects read_delim = robjects.r('read.delim') seq_data = read_delim('sequence.index', header=True, stringsAsFactors=False) #In R: # seq.data <- read.delim('sequence.index', header=TRUE, # stringsAsFactors=FALSE)
read.delim
R function that allows you to read files.read_delim
.seq_data
object is a data frame. If you know basic R or the Python pandas library, you are probably aware of this type of data structure; if not, then this is essentially a table: a sequence of rows where each column has the same type. Let's perform a basic inspection of this data frame as follows:print('This dataframe has %d columns and %d rows' % (seq_data.ncol, seq_data.nrow)) print(seq_data.colnames) #In R: # print(colnames(seq.data)) # print(nrow(seq.data)) # print(ncol(seq.data))
my_cols = robjects.r.ncol(seq_data) print(my_cols)
ncol
if they do not have dots in their name; however, be careful. This will display an output, not 26
(the number of columns), but [26]
which is a vector composed of the element 26
. This is because by default, most operations in R return vectors. If you want the number of columns, you have to perform my_cols[0]
. Also, talking about pitfalls, note that R array indexing starts with 1
, whereas Python starts with 0
.as_integer = robjects.r('as.integer') match = robjects.r.match my_col = match('BASE_COUNT', seq_data.colnames)[0] print(seq_data[my_col - 1][:3]) seq_data[my_col - 1] = as_integer(seq_data[my_col - 1]) print(seq_data[my_col - 1][:3])
0
element. It's also 1-indexed
, so we subtract one when working on Python. The as_integer
function will convert a column to integers. The first print will show strings (values surrounded by "
), whereas the second print will show numbers.robjects.r.assign('seq.data', seq_data)
seq.data
with the content of the data frame from the Python namespace. Note that after this operation, both objects will be independent (if you change one, it will not be reflected on the other).While you can perform plotting on Python, R has default built-in plotting functionalities (which we will ignore here). It also has a library called ggplot2 that implements the Grammar of Graphics (a declarative language to specify statistical charts).
import rpy2.robjects.lib.ggplot2 as ggplot2
png()
function as follows:robjects.r.png('out.png')
from rpy2.robjects.functions import SignatureTranslatedFunction ggplot2.theme = SignatureTranslatedFunction(ggplot2.theme, init_prm_translate={'axis_text_x': 'axis.text.x'}) bar = ggplot2.ggplot(seq_data) + ggplot2.geom_bar() + ggplot2.aes_string(x='CENTER_NAME') + ggplot2.theme(axis_text_x=ggplot2.element_text(angle=90, hjust=1)) bar.plot() dev_off = robjects.r('dev.off') dev_off()
axis.text.x
R parameter name to the axis_text_x
Python name in the function theme. We monkey patch it (that is, we replace ggplot2.theme
with a patched version of itself).seq_data data frame
, then we will use a histogram bar plot called geom_bar
, followed by annotating the X
variable (CENTER_NAME
).from IPython.display import Image Image(filename='out.png')
Figure 1: The ggplot2-generated histogram of center names responsible for sequencing lanes of human genomic data of the 1000 genomes project
robjects.r('yri_ceu <- seq.data[seq.data$POPULATION %in% c("YRI", "CEU") & seq.data$BASE_COUNT < 2E9 & seq.data$READ_COUNT < 3E7, ]') robjects.r('yri_ceu$POPULATION <- as.factor(yri_ceu$POPULATION)') robjects.r('yri_ceu$ANALYSIS_GROUP <- as.factor(yri_ceu$ANALYSIS_GROUP)')
POPULATION
and ANALYSIS_GROUPS
to factors, a concept similar to categorical data.yri_ceu = robjects.r('yri_ceu') scatter = ggplot2.ggplot(yri_ceu) + ggplot2.geom_point() + \ ggplot2.aes_string(x='BASE_COUNT', y='READ_COUNT', shape='factor(POPULATION)', col='factor(ANALYSIS_GROUP)') scatter.plot()
geom_point
). Note how easy it is to express that the shape of each point depends on the POPULATION
variable and the color on the ANALYSIS_GROUP
:Figure 2: The ggplot2-generated scatter plot with base and read counts for all sequencing lanes read; the color and shape of each dot reflects categorical data (population and the type of data sequenced)
import pandas.rpy.common as pd_common pd_yri_ceu = pd_common.load_data('yri_ceu') del pd_yri_ceu['PAIRED_FASTQ'] no_paired = pd_common.convert_to_r_dataframe(pd_yri_ceu) robjects.r.assign('no.paired', no_paired) robjects.r("print(colnames(no.paired))")
yri_ceu
in the R namespace, not the one on the Python namespace). We delete the column that indicates the name of the paired FASTQ file on the pandas data frame and copy it back to the R namespace. If you print the column names of the new R data frame, you will see that PAIRED_FASTQ
is missing.pandas.rpy
module is being deprecated (although it's still available).In the interests of maintaining the momentum of the book, we will not delve into pandas programming (there are plenty of books on this), but I recommend that you take a look at it, not only in the context of interfacing with R, but also as a very good library for data management of complex datasets.
It's worth repeating that the advancements on the Python software ecology are occurring at a breakneck pace. This means that if a certain functionality is not available today, it might be released sometime in the near future. So, if you are developing a new project, be sure to check for the very latest developments on the Python front before using a functionality from an R package.
There are plenty of R packages for bioinformatics in the Bioconductor project (http://www.bioconductor.org/). This should probably be your first port of call in the R world for bioinformatics functionalities. However, note that there are many R bioinformatics packages that are not on Bioconductor, so be sure to search the wider R packages on CRAN (refer to the Comprehensive R Archive Network at http://cran.r-project.org/).
There are plenty of plotting libraries for Python. matplotlib is the most common library, but you also have a plethora of other choices. In the context of R, it's worth noting that there is a ggplot2-like implementation for Python based on the Grammar of Graphics description language for charts and this is called—surprise-surprise—ggplot! (http://ggplot.yhathq.com/).
Change the font size
Change margin width
Change background colour