OK OK, stop badgering! I’ll try Python…

Everybody goes on about Python these days. In Bioinformatics it’s one of the two “must” know languages (along with R), often praised in comments designed to lambast my beloved Perl. So, I thought i’d have a go on the flying circus. I previously wrote about a multi-threaded Perl application and thought that a fun, simple exercise would be to recreate that in Python.

The code performs a silly but intensive task – counting the number of CpG sites in the human genome. The “p” in CpG is biological shorthand indicating that the C and G bases are on the same strand of the genome, so we are literally looking for string matches to “CG” in a 3 billion character string. This is particularly amenable to parallelisation as those 3 billion characters are split across 24 files, one for each chromosome as well as the mitochondrial genome. The answer, as defined in the previous post, is 27,999,538 and the best Perl I could write comes to that conclusion in a little over 2 seconds (2.285s to be exact, which is a bit faster than that original post as some hardware updates have occurred since then).

My first Python attempt was to simply recreate the final multi-threaded Perl code as closely as I could, except it turns out that Python has some issues under the hood that prevent threaded applications working as you would expect. Having said that the code works, but takes a paltry 59.7 seconds to run.

import glob
import threading
from queue import Queue

def countcpg(filename):
  with open (filename, "r") as myfile:
    data=myfile.read().replace('\n', '')
  index = 0
  count = 0
  while index > len(data):
    index = data.find('CG', index)
    if index == -1: # means no match i.e. got to end of string
        index += 2
    count += 1
  with lock:
    global tot_cpg
    tot_cpg += count

dir = "/mnt/nas_omics/cpg"
files = glob.glob(dir+'/*.fa')
tot_cpg = 0
lock = threading.Lock()

def worker():
  while True:
    f = q.get()

q = Queue()
for i in range(24):
  t = threading.Thread(target=worker)
  t.daemon = True

for f in files:


Attempt number two, then, was to fall back on a more simple parallelised approach using map and the multiprocessing package which “offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads”. This roars in at 5.381 seconds, which is good enough for me. Maybe some Python pro’s will suggest improvements and point out obvious flaws in what i’ve done.

import glob
from multiprocessing import Pool

def countcpg(filename):
  with open (filename, "r") as myfile:
    data=myfile.read().replace('\n', '')
  index = 0
  count = 0
  while index < len(data):
    index = data.find('CG', index)
    if index == -1: # means no match i.e. got to end of string
    index += 2
    count += 1

# global vals:
dir = "/mnt/nas_omics/cpg"
files = glob.glob(dir+'/*.fa')
tot_cpg = 0

# using multiprocessing:
pool = Pool()
res = pool.map(countcpg, files)

However, the point of this was not a contest to see which language is faster, but rather to provide a learning opportunity. And it was useful – I get the gist of Python now. It’s pretty superficial, but I liked the cleanliness of the python code and that it’s more object oriented than base perl so quite intuitive. But, the indentation! I guess that takes some getting used to. I also keep hearing about iPython too, which i’ll have a go with one day.

Does this all add up to a compelling drive to switch over then? Maybe, but it seems to me that both Perl and Python have stiff competition from R as an everyday scripting and analysis language – the rise of the Hadleyverse and Shiny, coupled with BioConductor makes it a fabulous language to get things done in Bioinformatics. I wouldn’t want to put it up against either Python or Perl in a speed contest though…

Visualising stranded RNA-seq data with Gviz/Bioconductor

Update: Sep 2015 – in response to a reader’s question I have updated the code to a completely reproducible example based on public data. In addition, I hacked the import function so that it will plot data from stranded libraries in either orientation.

Gviz is a really great package for visualising genomics data in R. Recently I have been looking at stranded RNA-seq data, which provides the ability to differentiate sense and antisense expression from a genomic locus thanks to the way in which you generate the libraries (I won’t go into all that here). Most aligners are strand aware so retain this useful information, but there aren’t many (any?) well defined approaches for detecting antisense expression nor for visualising it (in R). So, here I set out to use Gviz for the visualisation of stranded RNA-seq data. The figure demonstrates the results with a particularly nice example from some rat data.

A quick google got me to this post on the BioC mailing list from the Gviz author in which he provides a function to separate the reads in a BAM based on strand. This is not the same thing as working with stranded data – stranded data still has reads aligning to both strands, it’s the orientation of the pairs that determines which strand of the genome was being expressed. But, the post was a useful starter.

To get Gviz to plot stranded data we have to define a new import function to pass to the DataTrack constructor as the default pays no heed to strand. The function requires a path to a BAM file (with index in same directory) and a GRanges object that provides the location in the BAM file we are interested in. We use Rsamtools to read and parse the BAM file for the reads, setting specific flags that assess the orientation of each read and separate them accordingly. For the library I have, the forward/top/5′-3′ strand has reads in the orientation F2R1 and the reverse/bottom/3′-5′ strand is F1R2. For the former, this means the first read of the pair is always on the bottom/reverse of the double stranded template that was sequenced (not the genome!). In TopHat parlance, this is “fr-secondstrand”. So to get all reads from RNA produced by the forward/top/5′-3′ (F2R1 orientation) we scan the BAM file twice with the following flags:

# For the F2:
scanBamFlag(isUnmappedQuery = FALSE, isProperPair = TRUE, isFirstMateRead = FALSE, isMinusStrand = FALSE)

# For the R1:
scanBamFlag(isUnmappedQuery = FALSE, isProperPair = TRUE, isFirstMateRead = TRUE, isMinusStrand = TRUE)

We then combine these such that all pairs of reads originating from the forward/top/5′-3′ strand are together in one GRanges. This is repeated for the reverse/bottom/3′-5′ strand with reads in the F1R2 orientation and we can then calculate coverage over the region of interest and return a GRanges ready for plotting as shown in the figure. Read on for a reproducible example.

Strand specific coverage can be plotted in a Gviz data track using a custom import function. The top track contains the per-base coverage with reads from the forward strand in blue and reads from the reverse strand in purple. The bottom track shows the genomic context of the reads with exons in thick lines, introns as thin lines and the direction of transcription indicated by the arrows on the thin lines. Here the two genes are on opposite strands and are transcribed toward each other, making for a nice example.

Strand specific coverage can be plotted in a Gviz data track using a custom import function. The top track contains the per-base coverage with reads from the forward strand in blue and reads from the reverse strand in purple. The bottom track shows the genomic context of the reads with exons in thick lines, introns as thin lines and the direction of transcription indicated by the arrows on the thin lines. Here the two genes are on opposite strands and are transcribed toward each other, making for a nice example.

To use this in Gviz we create a custom importFunction for the DataTrack constructor. The code of this import function, strandedBamImport, is in this GitHub gist. What follows should be reproducible, but first you will need to download and index a stranded BAM file. The following will grab data from cardiac fibroblasts from ENCODE, but beware it’s an 11G file:

wget https://www.encodeproject.org/files/ENCFF680CQU/@@download/ENCFF680CQU.bam -O ENCFF680CQU.bam
samtools index ENCFF680CQU.bam

Now source the import function and set up the Gviz tracks in R. You need to change the path to the bam file on your system and you need to set the global variable “libType” to one of “fr-firststrand” or “fr-secondstrand”. If you don’t know what orientation your library is in, you can use infer_experiment.py from RSeQC to find out.


# source the custom import function:

# the ENCODE data was aligned to hg19, so we connect to the appropriate ensembl biomart:
mart = useMart(biomart="ENSEMBL_MART_ENSEMBL", host="grch37.ensembl.org", path="/biomart/martservice", dataset="hsapiens_gene_ensembl")

# Specify coordinate of the locus we want to plot, in this case the TGFB1 locus:
myChr   = 19
myStart = 41807492
myEnd   = 41859816

# Now we create the track with the gene model:
biomTrack = BiomartGeneRegionTrack(genome="hg19", biomart=mart, chromosome=myChr, start=myStart, end=myEnd, showId=T, geneSymbols=T, rotate.title=TRUE, col.line=NULL, col="orange", fill="orange",filters=list(biotype="protein_coding"), collapseTranscripts=FALSE )

### EDIT ###
bamFile = "/path/to/ENCFF680CQU.bam" 

### EDIT ###
# Set to one of "fr-firststrand" or "fr-secondstrand"
libType = "fr-secondstrand"

# Next create the data track using the new import function:
dataTrack = DataTrack(bamFile, genome="hg19", chromosome=myChr, importFunction=strandedBamImport, stream=TRUE, legend=TRUE, col=c("cornflowerblue","purple"), groups=c("Forward","Reverse"))

# Finally, plot the tracks:
plotTracks(list(dataTrack,biomTrack), from = myStart, to = myEnd, type="hist", col.histogram=NA, cex.title=1, cex.axis=1, title.width=1.2)

The figure below is what you should generate with the above code.


Q: How many CpG’s in the genome? A: multi-threaded Perl

This started out as a serious question. Whilst analysing some RRBS data I needed to know what percentage of the CpG’s (that is loci where a C nucleotide is followed immediately by a G) in the human genome were covered by our data. The question then, is just how many CpG’s are there in the human genome? “A ha!” I thought – just the kind of question that Perl was built to answer.

To lay out the task for you, the genome as I have it is split across 24 fasta files – one for each chromosome and another for the mitochondrial DNA. These files can be obtained from Ensembl here* and have a combined size of ~3Gb. I need to flaunt my hardware specs for any comparison you might make – we’re running a 2*6 core 2GHz cpu (giving 24 threads) with 96Gb RAM. Obviously this is linux (Red Hat) and i’m using the latest version of Perl (v5.18.1) compiled with multi-threading support enabled. I’m also working off a NAS, so network bandwidth might hold me back.

My first effort was actually quite speedy, I thought, especially as at that point I wasn’t trying to break any speed records. In essence, I slurp each file in turn into memory, replace all the newlines \n (so am essentially making one dirty long string) and then use a regular expression $chr_seq =~ /CG/g; to search for the CpG’s and bung each occurrence into an array. I confess I initially tried $chr_seq =~ /C\n?G, but this was noticeably sluggish. I then count the number of matches in the array, add it to the running total and hey presto a final answer of 27,999,538 is arrived at in just 38 seconds.

use strict ;

my $dir = $ARGV[0] ;
my $tot_cpg = 0 ;

opendir(DIR, $dir) or die $!;
while (defined(my $chr = readdir(DIR))) {
  next unless $chr =~ /\.fa/ ;
  local $/=undef;
  open GENOME,"<$dir$chr" or die $! ;
  my $chr_seq = <GENOME> ;
  close GENOME ;
  $chr_seq =~ s/\n//g ;
  my @matches = $chr_seq =~ /CG/g;
  my $cpg = scalar @matches;
  $tot_cpg = $tot_cpg + $cpg ;

print $tot_cpg ;

Now, 38 seconds is pretty fast. If you consider the average rate of reading to be 250 words a minute and that there are 5 letters in the average word, then a person can read 1,250 characters a minute. There are roughly 3 billion characters in the human genome so it would take approximately 2,999,998,750 minutes (or 5,703 years) to read. So, 38 seconds is nothing. But, Perl can do it even faster; we haven’t even considered parallelising the process yet.

#### multi threaded – 18 secs
Perl, if compiled with the -Dusethreads option, can run multiple processes in parallel. This means we could go from sequential processing of the chromosome files (e.g. do chr 1, then chr 2 etc) to processing them all at the same time. Here we do the counting as before, but this time it’s wrapped in a subroutine count. As we read through all the files in the directory we create a new thread passing it a reference to the sub as well as the file that thread is to process: my $thr = threads->create(\&count,$dir.$chr) ;. We have to keep track of the threads we create so that we can wait for them to finish and harvest the data they return. Luckily I have 24 files to process and 24 threads available to use. We now arrive at our answer in just 18 seconds, a more than 2 fold increase in speed.

use strict ;
use threads ;

my $dir = $ARGV[0] ;
my $tot_cpg = 0 ;
my @threads ;

opendir(DIR, $dir) or die $! ;
while (defined(my $chr = readdir(DIR))) {
	next unless $chr =~ /\.fa/ ;
	my $thr = threads->create(\&count,$dir.$chr) ;
	push @threads, $thr ;

close DIR ;

foreach (@threads) {
	my $cpg = $_->join;
	$tot_cpg = $tot_cpg+$cpg ;

print $tot_cpg ;

sub count {
	open GENOME, "<$_[0]" or die $! ;
	my $chr_seq = <GENOME>;
	close GENOME;
	$chr_seq =~ s/\n//g ;
	my @matches = $chr_seq =~ /CG/g;
	return(scalar @matches) ;

#### remove regexs – 3 secs
18 seconds is pretty blazing, but now I’ve got the bug and wonder just what else I can optimise. The compilation and use of regexs, although highly optimised in Perl, is quite slow so they had to go. I tried all sorts of ways of using split or unpack but in the end settled on index which determines the position of a substring in string. I also switched from using a substitution to remove the newlines to using the translation operator tr. We now count the CpG’s in an astonishing 3 seconds!

sub count {
	open GENOME, "<$_[0]" or die $! ;
	my $chr_seq = <GENOME>;
	close GENOME;
    ### replace s///g with tr//d:
    $chr_seq =~ tr/\n//d ;

	### replace regex with index:
	my $pos = -1;
 	my $c = 0 ;
 	while (($pos = index($chr_seq,'CG',$pos)) > -1) {
 		$c++ ;
 		$pos++ ;
	return($c) ;

>time perl countGpGinGenome.pl ./data/

real 0m3.136s
user 0m27.759s
sys 0m14.491s

3 seconds! Now, i’m not a good enough programmer to go much quicker but I still think there are a few areas that could be optimised further. For example, getting rid of the newline replacement altogether would be sensible. But, it doesn’t really matter. This started out as a biological question and got a bit geeky (too much so if i’m honest) but I learnt so much through this exercise about threaded programming and perl in general that it was worth it. Also, the next time I do RRBS on a new species I won’t waste any time finding out how many CpG’s there are!

*Only use the files matching: Homo_sapiens.GRCh37.72.dna.chromosome.*.fa where * = 1..22|X|MT

iRefR – PPI data access from R

I use a lot of protein-protein interaction (PPI) data as biological networks represent the systems within which our genes and proteins of interest function. There are many sources of PPI data, including BioGRID and IntAct etc. A recent effort has emerged that attempts to pull all of these databases together to provide a single standardised access point to PPI data; iRefIndex. Currently they integrate data from 13 different PPI databases. Why are there 13 in the first place? Because each has their own biological area of interest or specific criteria for curation (manual vs text-mined etc). This post is an opportunity for me to have a play with the R package for iRefIndex (iRefR).

The following code downloads the current version of iRefIndex for human (other species are available):

> library("iRefR")
> library(stringr)
> iref = get_irefindex(tax_id="9606",iref_version="12.0",data_folder="/path/to/save/iRefIndex")

The resulting object is a dirty great 250Mb data frame in MITAB format, or PSI-MITAB2.6 to be exact, the format specs of which can be found here. This seems a little unwieldily to hold in memory to me, so eventually i’ll subset it to something more manageable;

> print(object.size(iref),units="Mb")
248.8 Mb

First though, some summary stats on the info contained within. The data contains a total of 533,551 interactions, 248,215 of which are unique;

> dim(iref)
[1] 533551 54

Duplications in the data arise when an interaction is taken from multiple source databases or papers etc. The irigrid column contains a unique identifier where any given pair of interactants will always have the same identifier.

> length(unique(iref$irigid))
[1] 248215

iRefIndex counts as human any interaction where just one member is a human protein; e.g. if a viral protein interacts with a human protein this is counted as human. Here i’m just interested in human-human interactions, so;

> human_human_list = data.frame(iref$taxa,iref$taxb)
> tmp = do.call(`paste`, c(unname(human_human_list), list(sep=".")))
> iref_human = iref[tmp == "taxid:9606(Homo sapiens).taxid:9606(Homo sapiens)" | tmp == "-.taxid:9606(Homo sapiens)",]

> dim(iref_human)
[1] 489796 54

> length(unique(iref_human$irigid))
[1] 220435

To subset the data frame to a more memory friendly set of data i’m going to keep just a two column data frame of protein names for each unique interaction. As I’m going to plot some graphs later I also want to keep the biologist friendly HUGO name for each protein, which involves a bit of fiddling as this is not a field in its own right within MITAB, enter stringr:

> mA = str_locate(iref_human$aliasA, perl("hgnc:.*?\\|"))
> hugoA = str_sub(iref_human$aliasA,mA[,1]+5,mA[,2]-1)
> mB = str_locate(iref_human$aliasB, perl("hgnc:.*?\\|"))
> hugoB = str_sub(iref_human$aliasB,mB[,1]+5,mB[,2]-1)
> x = data.frame(iref_human$X.uidA,iref_human$uidB,hugoA,hugoB,iref_human$irigid)
> colnames(x) = c("uidA","uidB","hugoA","hugoB","irigid")
> dim(x)
[1] 489796 4

> head(x)
uidA uidB hugoA hugoB irigid
1 uniprotkb:O75554 uniprotkb:Q07666 WBP4 KHDRBS1 1139443
2 uniprotkb:Q13425 uniprotkb:B7Z6M3 SNTB2 DGKZ 1650951
3 uniprotkb:O75554 uniprotkb:Q8N684-3 WBP4 CPSF7 668917
4 uniprotkb:P60468 uniprotkb:Q9H0F7 SEC61B ARL6 658508
5 uniprotkb:O75554 uniprotkb:Q15233-2 WBP4 NONO 1338471
6 uniprotkb:O95816 uniprotkb:Q66LE6 BAG2 PPP2R2D 1478060

> length(unique(x$irigid))
[1] 220435

> print(object.size(x),units="Mb")
15.5 Mb

Now I want to build a PPI network from this data using iGraph. First, we hack together a data frame of node annotations, in this case the hugo name associated with each unique protein identifier. We then combine this with the interaction data into an igraph object:

> v = unique(data.frame(c(as.character(x$uidA),as.character(x$uidB)),c(as.character(x$hugoA),as.character(x$hugoB))))
> colnames(v) = c("uid","hugo")
> ppi.graph = graph.data.frame(x[,c(1:2,5)],vertices=v,directed=F)
> ppi.graph
IGRAPH UN-- 31476 489796 --
+ attr: name (v/c), hugo (v/c), irigid (e/n)

Our graph (or network) has 31,476 nodes (proteins) and 489,796 edges (or interactions). Each node has two annotations; the uid in “name” and the hugo symbol. Each edge has one annotation, the irigid.

Now, say we do an experiment and identify a bunch of genes/proteins and we want to see if they interact with each other at the protein level. For example this might be the output of a gene expression experiment or all genes genetically associated with a disease. Here, for simplicity, i’ll just use a random selection of proteins from the network.

> myGenes = sample(as.character(v$uid),10)
> myGenes = myGenes[!is.na(myGenes)]

We can now extract a subgraph containing our experimentally identified proteins and those that connect them together. In the following code we first get all of the neighbours (adjoining nodes) of our genes from the main graph. We define order = 1 to specify that we will allow a distance of 1 interaction from our genes. We then subset the edges from the main graph to those where both nodes are in our index of neighbours. This gives us just the interactions between our genes and their immediate neighbours. We then build a graph in the same manner as above, using the edge list and a vertice meta-data data-frame.

> order = 1
> edges = get.edges(ppi.graph, 1:(ecount(ppi.graph)))
> neighbours.vid = unique(unlist(neighborhood(ppi.graph,order,which(V(ppi.graph)$name %in% myGenes))))
> rel.vid = edges[intersect(which(edges[,1] %in% neighbours.vid), which(edges[,2] %in% neighbours.vid)),]
> neighbour.names = data.frame(V(ppi.graph)[neighbours.vid]$name,V(ppi.graph)[neighbours.vid]$hugo, stringsAsFactors=F)
> names(neighbour.names) = c("name","hugo")
> rel = as.data.frame(cbind(V(ppi.graph)[rel.vid[,1]]$name, V(ppi.graph)[rel.vid[,2]]$name), stringsAsFactors=F)
> names(rel) = c("from","to")
> subgraph = graph.data.frame(rel, directed=F, vertices=neighbour.names)
> subgraph = simplify(subgraph)
> subgraph
IGRAPH UN-- 81 239 --
+ attr: name (v/c), hugo (v/c)

And finally, we plot the network with red nodes to indicate proteins from our experimentally identified list of genes and yellow for the rest.

ind = which(V(subgraph)$name %in% myGenes)
cols = rep("yellow",vcount(subgraph))
cols[ind] = "red"


It’s at this point that the fun really starts from a biology point of view – we can mine our network for key hubs, overlay our favourite experimental data or compare to it networks for other sets of input genes, etc. The possibilities are endless, and I for one feel that all of our experimental data should be interpreted within the context of the interaction network. Of course there are limitations; our networks will likely never be complete as we only have PPI data for proteins that have been studied. But, having a human interactome with nigh on 500,000 interactions is a good start…

The murky corners of my genome – 23andMe and the Ensembl VEP

I recently got my genotype data from 23andMe. The most exciting finding is that i’m slightly more than average Neanderthal (3% versus the 2.7% average). I always wondered how they calculated this percentage, presuming it was a measure of the SNPs common between myself and the Hairy Caveman. It turns out that it’s a little more complicated – they calculate how far you are from the “Neanderthal axis”, which is a line that links neanderthals and the average of 246 whole African genomes (whom have no neanderthal ancestry) on a PCA plot. But, I digress, see this paper for all the details.

This post was really inspired by this blog from Neil Saunders in which he describes how to run your 23andMe SNP data through Ensembl’s Variant Effect Predictor (VEP). I have largely followed the method he outlines in order to take a closer look at the SNPs in my genome and what they’re up to.

The first thing to enthuse upon is the VEP itself – what a fab tool. Running locally with default settings it took <15 mins to crunch through the 960,613 SNPs in the data. It produces a pretty nice HTML page with summary counts of variant consequence, chromosome distribution etc.

So what are my SNPs up to? Just over 50% of my SNPs are intronic, another 8% intergenic etc. These are quite hard to interpret, so i’ll ignore the non-coding mutations for now. The pie below summarises the consequences of the mutations within coding regions. Alarmingly, it seems I have 98 stop gain mutations and 15,041 missense mutations in 3,687 genes! These are mutations where the variant could have an effect on protein function, either through premature abortion of mRNA translation or by switching the amino acid coded for at that location. In the first case, the truncated protein is probably not produced due to nonsense mediated decay, but there will still be less protein than there should be. The missense mutations will not all have an impact – only those that change a key functional domain of the protein are likely to. Even then, the degree of impact will be mitigated by many things including functional redundancy with other proteins etc.


The most obvious thing to confirm is the functional impact they are likely to have. This can be done with SIFT and PolyPhen predcitions, both of which the VEP calculates for you. However, it quickly becomes apparent that the VEP default settings don’t get you very much useful information past a classification of the variant consequences. But, there are many options available to pass to the VEP in order to get it to calculate all sorts of information. The following flags turn on sift and polyphen predictions and the global MAF from 1000 genomes:

--sift p --polyphen p --gmaf --fork 10

Happily, the authors also provided a --everything flag which returns, well, everything, including sift and polyphen predictions of the variant’s functional impact and the MAF etc. As you can imagine this takes a lot longer to run! It’s sensible to undertake a quick bit of jiggery pokery to subset the original VCF file to just the variants that cause a stop gain or missense mutation.

Quickly browsing the VEP summary HTML it’s apparent that PolyPhen and SIFT think some of my stop gains/missense mutations are going to have a damaging effect on protein function:


The 621 variants for which both PolyPhen and Sift predict a deleterious/damaging conseqeunce are found in 270 genes. Add that to the 46 genes that have gained a premature stop codon and I’m short 316 fully functional genes! This is by no means abnormal however – the 1000 genomes project estimates that we all have putative loss of function variants in 250-300 genes.

And, it’s not all bad news – it seems that one of my mutations, rs497116, is a well known stop gain in caspase 12 (CASP12). The A allele (which I have) is dominant in European populations but less so in those of African descent. The variant leads to a truncated inactive form of Caspase 12, which is protective against sepsis – the full length protein renders the carrier susceptible to an over the top immune response to bacterial infection.

What’s more I haven’t explored my genotype at these locations – am I heterozygous or homozygous? If heterozygous then I have a “spare”, perfectly normal copy of the gene that will hopefully compensate for the damaged one (leaving aside compound heterozygosity). If homozygous, then I’m potentially the human version of a knock out mouse! I also want to know what the frequency of these mutations are in the general population, their Minor Allele Frequency (MAF). If, like the Caspase 12 example above, most of the mutations are highly penetrant in the general population the chances are they don’t have such drastic consequences. I think I’ll keep all of that to myself though…