Visualising stranded RNA-seq data with Gviz/Bioconductor

Update: Sep 2015 – in response to a reader’s question I have updated the code to a completely reproducible example based on public data. In addition, I hacked the import function so that it will plot data from stranded libraries in either orientation.

Gviz is a really great package for visualising genomics data in R. Recently I have been looking at stranded RNA-seq data, which provides the ability to differentiate sense and antisense expression from a genomic locus thanks to the way in which you generate the libraries (I won’t go into all that here). Most aligners are strand aware so retain this useful information, but there aren’t many (any?) well defined approaches for detecting antisense expression nor for visualising it (in R). So, here I set out to use Gviz for the visualisation of stranded RNA-seq data. The figure demonstrates the results with a particularly nice example from some rat data.

A quick google got me to this post on the BioC mailing list from the Gviz author in which he provides a function to separate the reads in a BAM based on strand. This is not the same thing as working with stranded data – stranded data still has reads aligning to both strands, it’s the orientation of the pairs that determines which strand of the genome was being expressed. But, the post was a useful starter.

To get Gviz to plot stranded data we have to define a new import function to pass to the DataTrack constructor as the default pays no heed to strand. The function requires a path to a BAM file (with index in same directory) and a GRanges object that provides the location in the BAM file we are interested in. We use Rsamtools to read and parse the BAM file for the reads, setting specific flags that assess the orientation of each read and separate them accordingly. For the library I have, the forward/top/5′-3′ strand has reads in the orientation F2R1 and the reverse/bottom/3′-5′ strand is F1R2. For the former, this means the first read of the pair is always on the bottom/reverse of the double stranded template that was sequenced (not the genome!). In TopHat parlance, this is “fr-secondstrand”. So to get all reads from RNA produced by the forward/top/5′-3′ (F2R1 orientation) we scan the BAM file twice with the following flags:

# For the F2:
scanBamFlag(isUnmappedQuery = FALSE, isProperPair = TRUE, isFirstMateRead = FALSE, isMinusStrand = FALSE)

# For the R1:
scanBamFlag(isUnmappedQuery = FALSE, isProperPair = TRUE, isFirstMateRead = TRUE, isMinusStrand = TRUE)

We then combine these such that all pairs of reads originating from the forward/top/5′-3′ strand are together in one GRanges. This is repeated for the reverse/bottom/3′-5′ strand with reads in the F1R2 orientation and we can then calculate coverage over the region of interest and return a GRanges ready for plotting as shown in the figure. Read on for a reproducible example.

Strand specific coverage can be plotted in a Gviz data track using a custom import function. The top track contains the per-base coverage with reads from the forward strand in blue and reads from the reverse strand in purple. The bottom track shows the genomic context of the reads with exons in thick lines, introns as thin lines and the direction of transcription indicated by the arrows on the thin lines. Here the two genes are on opposite strands and are transcribed toward each other, making for a nice example.

Strand specific coverage can be plotted in a Gviz data track using a custom import function. The top track contains the per-base coverage with reads from the forward strand in blue and reads from the reverse strand in purple. The bottom track shows the genomic context of the reads with exons in thick lines, introns as thin lines and the direction of transcription indicated by the arrows on the thin lines. Here the two genes are on opposite strands and are transcribed toward each other, making for a nice example.

To use this in Gviz we create a custom importFunction for the DataTrack constructor. The code of this import function, strandedBamImport, is in this GitHub gist. What follows should be reproducible, but first you will need to download and index a stranded BAM file. The following will grab data from cardiac fibroblasts from ENCODE, but beware it’s an 11G file:

wget -O ENCFF680CQU.bam
samtools index ENCFF680CQU.bam

Now source the import function and set up the Gviz tracks in R. You need to change the path to the bam file on your system and you need to set the global variable “libType” to one of “fr-firststrand” or “fr-secondstrand”. If you don’t know what orientation your library is in, you can use from RSeQC to find out.


# source the custom import function:

# the ENCODE data was aligned to hg19, so we connect to the appropriate ensembl biomart:
mart = useMart(biomart="ENSEMBL_MART_ENSEMBL", host="", path="/biomart/martservice", dataset="hsapiens_gene_ensembl")

# Specify coordinate of the locus we want to plot, in this case the TGFB1 locus:
myChr   = 19
myStart = 41807492
myEnd   = 41859816

# Now we create the track with the gene model:
biomTrack = BiomartGeneRegionTrack(genome="hg19", biomart=mart, chromosome=myChr, start=myStart, end=myEnd, showId=T, geneSymbols=T, rotate.title=TRUE, col.line=NULL, col="orange", fill="orange",filters=list(biotype="protein_coding"), collapseTranscripts=FALSE )

### EDIT ###
bamFile = "/path/to/ENCFF680CQU.bam" 

### EDIT ###
# Set to one of "fr-firststrand" or "fr-secondstrand"
libType = "fr-secondstrand"

# Next create the data track using the new import function:
dataTrack = DataTrack(bamFile, genome="hg19", chromosome=myChr, importFunction=strandedBamImport, stream=TRUE, legend=TRUE, col=c("cornflowerblue","purple"), groups=c("Forward","Reverse"))

# Finally, plot the tracks:
plotTracks(list(dataTrack,biomTrack), from = myStart, to = myEnd, type="hist", col.histogram=NA, cex.title=1, cex.axis=1, title.width=1.2)

The figure below is what you should generate with the above code.



Analysing squash performance using fitbit data

I previously wrote about accessing step data in R using the fitbit API and at the time said I could think of little to do with the data that was not available via the fitbit website. Since then I’ve rejoined the local squash league and thought that I might be able to learn something useful from an integration of my step counts and performance at squash. The hope is that whatever I learn can be used to inform my game plan and so improve my performance; it certainly can’t hurt!

In that first post I used the fitbit API to access daily step totals, but that isn’t going to be too useful here – I need the data for the period of the squash game itself. It turns out that fitbit don’t let any old programmer access the “intra-day” data, presumably to stop you ripping off their premium account. However, they were willing to let me have access to the intra-day data via the API on the basis that this is a personal project. Kudos fitbit. I won’t bother showing any code here as it’s all in R and simply an extension of my previous post, but you can place an intra-day GET request which will either return the whole day’s data in 1 or 15 minute bins. You can also specify a particular time period that you want.

Luckily for this purpose I’m sad enough to track the date, time, result and score of every game of squash I play. I’ve been playing for 3 months now and have data for 11 matches; 6 wins and 5 losses. There were also a few games during which I didn’t wear my tracker (wins of course!) for various silly reasons but I have excluded them from the analysis.


Below I plot the per minute step count for each game (transparent lines) coloured green for a win and red for a loss. I then fit a loess curve to the data for each outcome (win or lose) which is shown by the bold dashed lines.


What does it tell me?

  • I seem to peak at around 100 steps per minute, which seems like a lot but I don’t have anything/body to compare this to. Would love to see what a professional game looks like.
  • There are four dramatic reductions in step counts; one at 8 minutes, two at about 28 minutes and then another at 36 minutes. The first I think represents a quick injury break (sprained ankle), the second two correspond to games I won quickly by 3-0 so the dip is whilst we shake hands and decide to play on for the rest of the court time. The third dip is a game I lost 1-3, so clearly I put up enough of a fight to make the game last a little bit longer than a 3-0 drubbing.
  • The fitted lines reveal that I maintain a higher intensity for longer in the games that I win. At 30 minutes in this is about an extra 10 steps per minute. If I’m doing 100 steps per minute then this is a 10% upping of my intensity level.
  • My intensity levels start higher when I win and drop faster when I’m losing.

There is more to be had form this data; for example the above does not take into account the winning/losing margin. To demonstrate, the plot below shows the total number of steps I make per match (regardless of the outcome) bucketed by the number of games (squash is first to 3) that my opponent won. It’s clear that when my opponent wins no games its a bit of a walkover and I put less effort in. Conversely when the game is close and my opponent wins one or two games I have to put in a bit of extra effort to win. And as we saw above, when I lose I put less effort in.

So, the lesson – fight harder and faster!


Accessing FitBit data in R

I caught a pretty amazing episode of Horizon (the BBC’s in depth science program in the UK) a while back called “The future of medicine is apps“. The programme explored the health benefits of giving people data about their body, health and lifestyle. The more extreme examples included the tracking of the England rugby team during training which allows the coaches to predict injury/flu before the player’s are aware of it, and the professor that monitored the level of every metabolite in his blood every day and was able to diagnose himself with Chron’s disease prior to any symptoms. At the more practical level were the people who simply track their activity levels each day. The theory goes that if you are aware via a direct data feed of what you are doing (or not doing I suppose) then you are able to make changes to your lifestyle for the better. Being a just a bit of a geek I was inspired to get myself an activity tracker and see what it was to collect some data on myself.

I settled on a FitBit Flex, which is essentially a pedometer that you wear on your wrist and which tracks activity (steps) as well as sleep patterns. I have to say it works really well and I am mightily addicted to trying to meet my activity goal each day – currently set to the default of 10,000 steps. FitBit provide a fairly slick website to display all the data you collect but, unfortunately, if you want to download the data and do any kind of analysis yourself you have to pay a pretty exorbitant subscription fee. Luckily, you can get at your data via their API if you have the know-how so I decided have a go in R.

First off you have to register an “app” with FitBit (mine is called StepTrack!) in order to get the credentials needed for authentication. I used the httr package for the OAuth authentication and data retrieval.


token_url = ""
access_url = ""
auth_url = ""
key = "my_key"
secret = "my_secret"

fbr = oauth_app('StepTrack',key,secret)
fitbit = oauth_endpoint(token_url,auth_url,access_url)
token = oauth1.0_token(fitbit,fbr)
sig = sign_oauth1.0(fbr, token=token$oauth_token, token_secret=token$oauth_token_secret)

# get all step data from my first day of use to the current date:
steps = GET("",sig)

The data is returned as json, which can then be plotted to your hearts content. In the plot you can see a five day gap – I went on holiday and forgot the charger!

> steps
Response []
Status: 200
Content-type: application/json;charset=UTF-8


Admittedly I am struggling to come up with ideas of what to do with the data that FitBit doesn’t provide already through their website. But, its the principal of the thing – I should be able to get at my data and now I can. For all of the data shown in the above I was on holiday and in general much more active than when i’m plonked in my desk at work. It will be interesting to see what my daily step count is on a normal working day and whether knowing this will push me on to go for a run at lunch time or take the very long route to the sandwich shop. Being a very competitive person, I suspect it will.

UPDATE: 4th April 2014 – @asrowe has made a nice comparison of two trackers (fitbit and jawbone) here.

iRefR – PPI data access from R

I use a lot of protein-protein interaction (PPI) data as biological networks represent the systems within which our genes and proteins of interest function. There are many sources of PPI data, including BioGRID and IntAct etc. A recent effort has emerged that attempts to pull all of these databases together to provide a single standardised access point to PPI data; iRefIndex. Currently they integrate data from 13 different PPI databases. Why are there 13 in the first place? Because each has their own biological area of interest or specific criteria for curation (manual vs text-mined etc). This post is an opportunity for me to have a play with the R package for iRefIndex (iRefR).

The following code downloads the current version of iRefIndex for human (other species are available):

> library("iRefR")
> library(stringr)
> iref = get_irefindex(tax_id="9606",iref_version="12.0",data_folder="/path/to/save/iRefIndex")

The resulting object is a dirty great 250Mb data frame in MITAB format, or PSI-MITAB2.6 to be exact, the format specs of which can be found here. This seems a little unwieldily to hold in memory to me, so eventually i’ll subset it to something more manageable;

> print(object.size(iref),units="Mb")
248.8 Mb

First though, some summary stats on the info contained within. The data contains a total of 533,551 interactions, 248,215 of which are unique;

> dim(iref)
[1] 533551 54

Duplications in the data arise when an interaction is taken from multiple source databases or papers etc. The irigrid column contains a unique identifier where any given pair of interactants will always have the same identifier.

> length(unique(iref$irigid))
[1] 248215

iRefIndex counts as human any interaction where just one member is a human protein; e.g. if a viral protein interacts with a human protein this is counted as human. Here i’m just interested in human-human interactions, so;

> human_human_list = data.frame(iref$taxa,iref$taxb)
> tmp =`paste`, c(unname(human_human_list), list(sep=".")))
> iref_human = iref[tmp == "taxid:9606(Homo sapiens).taxid:9606(Homo sapiens)" | tmp == "-.taxid:9606(Homo sapiens)",]

> dim(iref_human)
[1] 489796 54

> length(unique(iref_human$irigid))
[1] 220435

To subset the data frame to a more memory friendly set of data i’m going to keep just a two column data frame of protein names for each unique interaction. As I’m going to plot some graphs later I also want to keep the biologist friendly HUGO name for each protein, which involves a bit of fiddling as this is not a field in its own right within MITAB, enter stringr:

> mA = str_locate(iref_human$aliasA, perl("hgnc:.*?\\|"))
> hugoA = str_sub(iref_human$aliasA,mA[,1]+5,mA[,2]-1)
> mB = str_locate(iref_human$aliasB, perl("hgnc:.*?\\|"))
> hugoB = str_sub(iref_human$aliasB,mB[,1]+5,mB[,2]-1)
> x = data.frame(iref_human$X.uidA,iref_human$uidB,hugoA,hugoB,iref_human$irigid)
> colnames(x) = c("uidA","uidB","hugoA","hugoB","irigid")
> dim(x)
[1] 489796 4

> head(x)
uidA uidB hugoA hugoB irigid
1 uniprotkb:O75554 uniprotkb:Q07666 WBP4 KHDRBS1 1139443
2 uniprotkb:Q13425 uniprotkb:B7Z6M3 SNTB2 DGKZ 1650951
3 uniprotkb:O75554 uniprotkb:Q8N684-3 WBP4 CPSF7 668917
4 uniprotkb:P60468 uniprotkb:Q9H0F7 SEC61B ARL6 658508
5 uniprotkb:O75554 uniprotkb:Q15233-2 WBP4 NONO 1338471
6 uniprotkb:O95816 uniprotkb:Q66LE6 BAG2 PPP2R2D 1478060

> length(unique(x$irigid))
[1] 220435

> print(object.size(x),units="Mb")
15.5 Mb

Now I want to build a PPI network from this data using iGraph. First, we hack together a data frame of node annotations, in this case the hugo name associated with each unique protein identifier. We then combine this with the interaction data into an igraph object:

> v = unique(data.frame(c(as.character(x$uidA),as.character(x$uidB)),c(as.character(x$hugoA),as.character(x$hugoB))))
> colnames(v) = c("uid","hugo")
> ppi.graph =[,c(1:2,5)],vertices=v,directed=F)
> ppi.graph
IGRAPH UN-- 31476 489796 --
+ attr: name (v/c), hugo (v/c), irigid (e/n)

Our graph (or network) has 31,476 nodes (proteins) and 489,796 edges (or interactions). Each node has two annotations; the uid in “name” and the hugo symbol. Each edge has one annotation, the irigid.

Now, say we do an experiment and identify a bunch of genes/proteins and we want to see if they interact with each other at the protein level. For example this might be the output of a gene expression experiment or all genes genetically associated with a disease. Here, for simplicity, i’ll just use a random selection of proteins from the network.

> myGenes = sample(as.character(v$uid),10)
> myGenes = myGenes[!]

We can now extract a subgraph containing our experimentally identified proteins and those that connect them together. In the following code we first get all of the neighbours (adjoining nodes) of our genes from the main graph. We define order = 1 to specify that we will allow a distance of 1 interaction from our genes. We then subset the edges from the main graph to those where both nodes are in our index of neighbours. This gives us just the interactions between our genes and their immediate neighbours. We then build a graph in the same manner as above, using the edge list and a vertice meta-data data-frame.

> order = 1
> edges = get.edges(ppi.graph, 1:(ecount(ppi.graph)))
> neighbours.vid = unique(unlist(neighborhood(ppi.graph,order,which(V(ppi.graph)$name %in% myGenes))))
> rel.vid = edges[intersect(which(edges[,1] %in% neighbours.vid), which(edges[,2] %in% neighbours.vid)),]
> neighbour.names = data.frame(V(ppi.graph)[neighbours.vid]$name,V(ppi.graph)[neighbours.vid]$hugo, stringsAsFactors=F)
> names(neighbour.names) = c("name","hugo")
> rel =[rel.vid[,1]]$name, V(ppi.graph)[rel.vid[,2]]$name), stringsAsFactors=F)
> names(rel) = c("from","to")
> subgraph =, directed=F, vertices=neighbour.names)
> subgraph = simplify(subgraph)
> subgraph
IGRAPH UN-- 81 239 --
+ attr: name (v/c), hugo (v/c)

And finally, we plot the network with red nodes to indicate proteins from our experimentally identified list of genes and yellow for the rest.

ind = which(V(subgraph)$name %in% myGenes)
cols = rep("yellow",vcount(subgraph))
cols[ind] = "red"


It’s at this point that the fun really starts from a biology point of view – we can mine our network for key hubs, overlay our favourite experimental data or compare to it networks for other sets of input genes, etc. The possibilities are endless, and I for one feel that all of our experimental data should be interpreted within the context of the interaction network. Of course there are limitations; our networks will likely never be complete as we only have PPI data for proteins that have been studied. But, having a human interactome with nigh on 500,000 interactions is a good start…

A shiny app to display the human body map dataset

There was quite a lot of buzz around when the guys from Rstudio launched Shiny, a new web framework for R that promises to “make it super simple for R users like you to turn analyses into interactive web applications that anyone can use”.  Indeed, it looks really impressive.

So, in order to give Shiny a test I thought i’d analyse and then create a front end to the Illumina human body map data.  This should be quite some test for Shiny as R is slow and clunky for all but the smallest of data sets.   I wanted the application to allow the user to enter a gene and have returned 1) a gene level plot of the tissue distribution, 2) details of all the isoforms detected for that gene and, 3) the expression of each isoform in any given tissue.

For those who haven’t heard of this dataset; it’s RNA-seq data that has been generated by Illumina on a HiSeq2000 from 16 different, healthy, human tissues and freely downloadable from the above link.  The libraries were prepared using poly-A selected mRNA and sequenced as both 50bp paired-end or 75bp single end reads.  No replicates unfortunately.  Here I will use just the paired-end reads, of which there are 70-80 million pairs per sample.  The raw reads were aligned with TopHat and assembled with Cufflinks.

To save having to code up all of the visualisations I wanted from scratch I decided to use the cummeRbund package (from the TopHat/Cufflinks authors) which has some awesome ggplot2 based functions for generating track-like images from a cufflinks/cuffdiff based analysis.  The trade off here is that cummeRbund maintains a huge (19Gb for this dataset) SQLite database in the background and is S.L.O.W. The stats for the data loaded into R:

CuffSet instance with:
16 samples
105441 genes
335696 isoforms
190081 TSS

Right, Shiny. I won’t go into detail – you can read the tutorial/docs yourself.  But, suffice to say it’s dead simple with no CSS, javascript or HTML to worry about. The only downside of this is that the layout and style of the page is largely fixed (unless you want to get your hands really dirty).  Also key is that Shiny is reactive, i.e. if any of the input variables change, any functions that rely on those will automatically update themselves, as will functions that rely on those and so on.

The first task was to get the input form hooked up to the server code, which literally just requires you to specify a text input box and a submit button:

textInput("gene", label="Gene", value = ""),

Then in the server code, specify a function that waits for the gene variable to be defined by the user and does something.  In this case it gets the data out of the cuffdiff database for the input gene and plots the FPKM of that gene in each of the 16 tissues:

output$genePlot = reactivePlot(function() {
myGene = getGene(cuff,input$gene)
if (is.null(myGene))
x = data.frame(
tissue = fpkm(myGene)$sample_name,
print(ggplot(x,aes(fill=tissue,x=tissue,y=fpkm)) + geom_bar(position="dodge", stat="identity") +

I rolled my own barplot here as the cummeRbund version is quite clunky.  The result for PATE1 (Prostate And Testes Expressed 1) looks like this:


Next up, another function to create track-esque plots of the isoforms found for the input gene, which also includes some nice visual touches like the ideogram and the Ensembl annotated isoforms:


You’ll notice a drop down box has appeared under the input form. This is for the final hurrah – select a tissue and the expression of different isoforms in that tissue will be plotted:


OK OK, so it’s not very well laid out etc – but for a first pass I think its great, and the ggplot2 graphics make up for it a little bit. I should also point out that it is not just slow, but very slow! If I stopped calling out to Ensembl, or ditched cummeRbund all together this could be improved.

The gist is on GitHub ( but I don’t provide the expression data (it’s huge!). I’m pretty sure that it should work with only minor tweaking for any RNA-seq data set analysed with a TopHat -> Cufflinks -> CuffDiff pipeline.