
Bioinformatics with Python Cookbook
By :

After running a genotype caller (for example, GATK or SAMtools), you will have a VCF file reporting on genomic variations, such as SNPs, insertions/deletions (INDELs), copy number variations (CNVs), and so on. In this recipe, we will discuss VCF processing with the cyvcf2
module.
While NGS is all about big data, there is a limit to how much I can ask you to download as a dataset for this book. I believe that 2 to 20 GB of data for a tutorial is asking too much. While the 1,000 Genomes VCF files with realistic annotations are in this OOM, we will want to work with much less data here. Fortunately, the bioinformatics community has developed tools to allow for the partial download of data. As part of the SAMtools/htslib
package (http://www.htslib.org/), you can download tabix
and bgzip
, which will take care of data management. On the command line, perform the following operation:
tabix -fh ftp://ftp- trace.ncbi.nih.gov/1000genomes...