Visualising stranded RNA-seq data with Gviz/Bioconductor

Update: Sep 2015 – in response to a reader’s question I have updated the code to a completely reproducible example based on public data. In addition, I hacked the import function so that it will plot data from stranded libraries in either orientation.

Gviz is a really great package for visualising genomics data in R. Recently I have been looking at stranded RNA-seq data, which provides the ability to differentiate sense and antisense expression from a genomic locus thanks to the way in which you generate the libraries (I won’t go into all that here). Most aligners are strand aware so retain this useful information, but there aren’t many (any?) well defined approaches for detecting antisense expression nor for visualising it (in R). So, here I set out to use Gviz for the visualisation of stranded RNA-seq data. The figure demonstrates the results with a particularly nice example from some rat data.

A quick google got me to this post on the BioC mailing list from the Gviz author in which he provides a function to separate the reads in a BAM based on strand. This is not the same thing as working with stranded data – stranded data still has reads aligning to both strands, it’s the orientation of the pairs that determines which strand of the genome was being expressed. But, the post was a useful starter.

To get Gviz to plot stranded data we have to define a new import function to pass to the DataTrack constructor as the default pays no heed to strand. The function requires a path to a BAM file (with index in same directory) and a GRanges object that provides the location in the BAM file we are interested in. We use Rsamtools to read and parse the BAM file for the reads, setting specific flags that assess the orientation of each read and separate them accordingly. For the library I have, the forward/top/5′-3′ strand has reads in the orientation F2R1 and the reverse/bottom/3′-5′ strand is F1R2. For the former, this means the first read of the pair is always on the bottom/reverse of the double stranded template that was sequenced (not the genome!). In TopHat parlance, this is “fr-secondstrand”. So to get all reads from RNA produced by the forward/top/5′-3′ (F2R1 orientation) we scan the BAM file twice with the following flags:

# For the F2:
scanBamFlag(isUnmappedQuery = FALSE, isProperPair = TRUE, isFirstMateRead = FALSE, isMinusStrand = FALSE)

# For the R1:
scanBamFlag(isUnmappedQuery = FALSE, isProperPair = TRUE, isFirstMateRead = TRUE, isMinusStrand = TRUE)

We then combine these such that all pairs of reads originating from the forward/top/5′-3′ strand are together in one GRanges. This is repeated for the reverse/bottom/3′-5′ strand with reads in the F1R2 orientation and we can then calculate coverage over the region of interest and return a GRanges ready for plotting as shown in the figure. Read on for a reproducible example.

Strand specific coverage can be plotted in a Gviz data track using a custom import function. The top track contains the per-base coverage with reads from the forward strand in blue and reads from the reverse strand in purple. The bottom track shows the genomic context of the reads with exons in thick lines, introns as thin lines and the direction of transcription indicated by the arrows on the thin lines. Here the two genes are on opposite strands and are transcribed toward each other, making for a nice example.

Strand specific coverage can be plotted in a Gviz data track using a custom import function. The top track contains the per-base coverage with reads from the forward strand in blue and reads from the reverse strand in purple. The bottom track shows the genomic context of the reads with exons in thick lines, introns as thin lines and the direction of transcription indicated by the arrows on the thin lines. Here the two genes are on opposite strands and are transcribed toward each other, making for a nice example.

To use this in Gviz we create a custom importFunction for the DataTrack constructor. The code of this import function, strandedBamImport, is in this GitHub gist. What follows should be reproducible, but first you will need to download and index a stranded BAM file. The following will grab data from cardiac fibroblasts from ENCODE, but beware it’s an 11G file:

wget https://www.encodeproject.org/files/ENCFF680CQU/@@download/ENCFF680CQU.bam -O ENCFF680CQU.bam
samtools index ENCFF680CQU.bam

Now source the import function and set up the Gviz tracks in R. You need to change the path to the bam file on your system and you need to set the global variable “libType” to one of “fr-firststrand” or “fr-secondstrand”. If you don’t know what orientation your library is in, you can use infer_experiment.py from RSeQC to find out.

library(Gviz)
library(biomaRt)

# source the custom import function:
source("https://gist.githubusercontent.com/sidderb/e485c634b386c115b2ef/raw/e4a7fba665246764bb6953d23ab4d95d56c6f450/strandedBamImport")

# the ENCODE data was aligned to hg19, so we connect to the appropriate ensembl biomart:
mart = useMart(biomart="ENSEMBL_MART_ENSEMBL", host="grch37.ensembl.org", path="/biomart/martservice", dataset="hsapiens_gene_ensembl")

# Specify coordinate of the locus we want to plot, in this case the TGFB1 locus:
myChr   = 19
myStart = 41807492
myEnd   = 41859816

# Now we create the track with the gene model:
biomTrack = BiomartGeneRegionTrack(genome="hg19", biomart=mart, chromosome=myChr, start=myStart, end=myEnd, showId=T, geneSymbols=T, rotate.title=TRUE, col.line=NULL, col="orange", fill="orange",filters=list(biotype="protein_coding"), collapseTranscripts=FALSE )

### EDIT ###
bamFile = "/path/to/ENCFF680CQU.bam" 

### EDIT ###
# Set to one of "fr-firststrand" or "fr-secondstrand"
libType = "fr-secondstrand"

# Next create the data track using the new import function:
dataTrack = DataTrack(bamFile, genome="hg19", chromosome=myChr, importFunction=strandedBamImport, stream=TRUE, legend=TRUE, col=c("cornflowerblue","purple"), groups=c("Forward","Reverse"))

# Finally, plot the tracks:
plotTracks(list(dataTrack,biomTrack), from = myStart, to = myEnd, type="hist", col.histogram=NA, cex.title=1, cex.axis=1, title.width=1.2)

The figure below is what you should generate with the above code.

Stranded_RNAseq_Human_TGFB1

Advertisements