23andMe and life insurance

I get lots of queries from blog readers (am always surprised people read this drivel), mostly about concerns regarding 23andMe and whether this will or could affect life insurance policies. I always answer in a non-commited, got to make your own mind up, kind of way. Prompted by the latest caller, I thought I’d ask the wisdom of the crowds so turned to Twitter. You can read the storify of the exchange but a precis follows:

Note: this does not constitute formal legal advice! I provide this information merely as a comment on the current situation within the UK.

It turns out that UK insurers (I have not investigated other jurisdictions) have a voluntary moratorium with the UK Government against the use of predictive genetic testing in insurance cases that is valid until 2019. Please read the full text of the moratorium, but I include two sections of particular relevance here:

21. Insurers agree to the following:

a. customers will not be asked, nor will they be put under pressure, to take a predictive genetic test to obtain insurance cover;

b. customers who have taken a predictive test before the date of this Concordat will be treated in the same way as customers taking tests under the terms of the

c. customers will not be required to disclose any of the following:

i. a predictive genetic test result from a test taken after the insurance cover has started, for as long as that cover is in force;
ii. the predictive test result of another person, such as a blood relative; or
iii. a predictive or diagnostic test result acquired as part of clinical research (for example, the 100,000 Genomes Project) To avoid doubt,  customers may be asked to disclose details of any symptoms, diagnosis or treatment received outside of the clinical research programme, even if those relate to a condition they found out about through the research programme.

d. customers making relevant insurance applications will be required to disclose a predictive genetic test result only if all of the following apply;

i. the customer is seeking insurance cover above the financial limits set out in the Moratorium;
ii. the test has been assessed by a panel of experts and approved by Government. To date, the only test that people are required to disclose
under the agreement is for Huntington’s Disease for life insurance where the insured sum is over £500,000. Any change to the list of approved
tests would be notified on the ABI and Department of Health websites

iii. the insurer asks the customer to disclose the information.


26. The terms of the Moratorium are as follows.

I. Customers will not be required to disclose the results of predictive genetic tests for policies up to £500,000 of life insurance, or £300,000 for critical illness insurance, or paying annual benefits of £30,000 for income protection insurance (the ‘financial limits’).

II. When the cumulative value of insurance exceeds the financial limits, insurers may seek information about, and customers must disclose, tests approved with the Government for use for a particular insurance product, subject to the restrictions in the Concordat.

III. The Government will announce and the ABI will publish on its website the date of the next review which will be three years before the expiry date of the current Moratorium.

Thanks to @ewanbirney and @TheABB for pointing me at this information.

OK OK, stop badgering! I’ll try Python…

Everybody goes on about Python these days. In Bioinformatics it’s one of the two “must” know languages (along with R), often praised in comments designed to lambast my beloved Perl. So, I thought i’d have a go on the flying circus. I previously wrote about a multi-threaded Perl application and thought that a fun, simple exercise would be to recreate that in Python.

The code performs a silly but intensive task – counting the number of CpG sites in the human genome. The “p” in CpG is biological shorthand indicating that the C and G bases are on the same strand of the genome, so we are literally looking for string matches to “CG” in a 3 billion character string. This is particularly amenable to parallelisation as those 3 billion characters are split across 24 files, one for each chromosome as well as the mitochondrial genome. The answer, as defined in the previous post, is 27,999,538 and the best Perl I could write comes to that conclusion in a little over 2 seconds (2.285s to be exact, which is a bit faster than that original post as some hardware updates have occurred since then).

My first Python attempt was to simply recreate the final multi-threaded Perl code as closely as I could, except it turns out that Python has some issues under the hood that prevent threaded applications working as you would expect. Having said that the code works, but takes a paltry 59.7 seconds to run.

import glob
import threading
from queue import Queue

def countcpg(filename):
  with open (filename, "r") as myfile:
    data=myfile.read().replace('\n', '')
  index = 0
  count = 0
  while index > len(data):
    index = data.find('CG', index)
    if index == -1: # means no match i.e. got to end of string
        index += 2
    count += 1
  with lock:
    global tot_cpg
    tot_cpg += count

dir = "/mnt/nas_omics/cpg"
files = glob.glob(dir+'/*.fa')
tot_cpg = 0
lock = threading.Lock()

def worker():
  while True:
    f = q.get()

q = Queue()
for i in range(24):
  t = threading.Thread(target=worker)
  t.daemon = True

for f in files:


Attempt number two, then, was to fall back on a more simple parallelised approach using map and the multiprocessing package which “offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads”. This roars in at 5.381 seconds, which is good enough for me. Maybe some Python pro’s will suggest improvements and point out obvious flaws in what i’ve done.

import glob
from multiprocessing import Pool

def countcpg(filename):
  with open (filename, "r") as myfile:
    data=myfile.read().replace('\n', '')
  index = 0
  count = 0
  while index < len(data):
    index = data.find('CG', index)
    if index == -1: # means no match i.e. got to end of string
    index += 2
    count += 1

# global vals:
dir = "/mnt/nas_omics/cpg"
files = glob.glob(dir+'/*.fa')
tot_cpg = 0

# using multiprocessing:
pool = Pool()
res = pool.map(countcpg, files)

However, the point of this was not a contest to see which language is faster, but rather to provide a learning opportunity. And it was useful – I get the gist of Python now. It’s pretty superficial, but I liked the cleanliness of the python code and that it’s more object oriented than base perl so quite intuitive. But, the indentation! I guess that takes some getting used to. I also keep hearing about iPython too, which i’ll have a go with one day.

Does this all add up to a compelling drive to switch over then? Maybe, but it seems to me that both Perl and Python have stiff competition from R as an everyday scripting and analysis language – the rise of the Hadleyverse and Shiny, coupled with BioConductor makes it a fabulous language to get things done in Bioinformatics. I wouldn’t want to put it up against either Python or Perl in a speed contest though…

The murky corners of my genome – 23andMe and the Ensembl VEP

I recently got my genotype data from 23andMe. The most exciting finding is that i’m slightly more than average Neanderthal (3% versus the 2.7% average). I always wondered how they calculated this percentage, presuming it was a measure of the SNPs common between myself and the Hairy Caveman. It turns out that it’s a little more complicated – they calculate how far you are from the “Neanderthal axis”, which is a line that links neanderthals and the average of 246 whole African genomes (whom have no neanderthal ancestry) on a PCA plot. But, I digress, see this paper for all the details.

This post was really inspired by this blog from Neil Saunders in which he describes how to run your 23andMe SNP data through Ensembl’s Variant Effect Predictor (VEP). I have largely followed the method he outlines in order to take a closer look at the SNPs in my genome and what they’re up to.

The first thing to enthuse upon is the VEP itself – what a fab tool. Running locally with default settings it took <15 mins to crunch through the 960,613 SNPs in the data. It produces a pretty nice HTML page with summary counts of variant consequence, chromosome distribution etc.

So what are my SNPs up to? Just over 50% of my SNPs are intronic, another 8% intergenic etc. These are quite hard to interpret, so i’ll ignore the non-coding mutations for now. The pie below summarises the consequences of the mutations within coding regions. Alarmingly, it seems I have 98 stop gain mutations and 15,041 missense mutations in 3,687 genes! These are mutations where the variant could have an effect on protein function, either through premature abortion of mRNA translation or by switching the amino acid coded for at that location. In the first case, the truncated protein is probably not produced due to nonsense mediated decay, but there will still be less protein than there should be. The missense mutations will not all have an impact – only those that change a key functional domain of the protein are likely to. Even then, the degree of impact will be mitigated by many things including functional redundancy with other proteins etc.


The most obvious thing to confirm is the functional impact they are likely to have. This can be done with SIFT and PolyPhen predcitions, both of which the VEP calculates for you. However, it quickly becomes apparent that the VEP default settings don’t get you very much useful information past a classification of the variant consequences. But, there are many options available to pass to the VEP in order to get it to calculate all sorts of information. The following flags turn on sift and polyphen predictions and the global MAF from 1000 genomes:

--sift p --polyphen p --gmaf --fork 10

Happily, the authors also provided a --everything flag which returns, well, everything, including sift and polyphen predictions of the variant’s functional impact and the MAF etc. As you can imagine this takes a lot longer to run! It’s sensible to undertake a quick bit of jiggery pokery to subset the original VCF file to just the variants that cause a stop gain or missense mutation.

Quickly browsing the VEP summary HTML it’s apparent that PolyPhen and SIFT think some of my stop gains/missense mutations are going to have a damaging effect on protein function:


The 621 variants for which both PolyPhen and Sift predict a deleterious/damaging conseqeunce are found in 270 genes. Add that to the 46 genes that have gained a premature stop codon and I’m short 316 fully functional genes! This is by no means abnormal however – the 1000 genomes project estimates that we all have putative loss of function variants in 250-300 genes.

And, it’s not all bad news – it seems that one of my mutations, rs497116, is a well known stop gain in caspase 12 (CASP12). The A allele (which I have) is dominant in European populations but less so in those of African descent. The variant leads to a truncated inactive form of Caspase 12, which is protective against sepsis – the full length protein renders the carrier susceptible to an over the top immune response to bacterial infection.

What’s more I haven’t explored my genotype at these locations – am I heterozygous or homozygous? If heterozygous then I have a “spare”, perfectly normal copy of the gene that will hopefully compensate for the damaged one (leaving aside compound heterozygosity). If homozygous, then I’m potentially the human version of a knock out mouse! I also want to know what the frequency of these mutations are in the general population, their Minor Allele Frequency (MAF). If, like the Caspase 12 example above, most of the mutations are highly penetrant in the general population the chances are they don’t have such drastic consequences. I think I’ll keep all of that to myself though…

A shiny app to display the human body map dataset

There was quite a lot of buzz around when the guys from Rstudio launched Shiny, a new web framework for R that promises to “make it super simple for R users like you to turn analyses into interactive web applications that anyone can use”.  Indeed, it looks really impressive.

So, in order to give Shiny a test I thought i’d analyse and then create a front end to the Illumina human body map data.  This should be quite some test for Shiny as R is slow and clunky for all but the smallest of data sets.   I wanted the application to allow the user to enter a gene and have returned 1) a gene level plot of the tissue distribution, 2) details of all the isoforms detected for that gene and, 3) the expression of each isoform in any given tissue.

For those who haven’t heard of this dataset; it’s RNA-seq data that has been generated by Illumina on a HiSeq2000 from 16 different, healthy, human tissues and freely downloadable from the above link.  The libraries were prepared using poly-A selected mRNA and sequenced as both 50bp paired-end or 75bp single end reads.  No replicates unfortunately.  Here I will use just the paired-end reads, of which there are 70-80 million pairs per sample.  The raw reads were aligned with TopHat and assembled with Cufflinks.

To save having to code up all of the visualisations I wanted from scratch I decided to use the cummeRbund package (from the TopHat/Cufflinks authors) which has some awesome ggplot2 based functions for generating track-like images from a cufflinks/cuffdiff based analysis.  The trade off here is that cummeRbund maintains a huge (19Gb for this dataset) SQLite database in the background and is S.L.O.W. The stats for the data loaded into R:

CuffSet instance with:
16 samples
105441 genes
335696 isoforms
190081 TSS

Right, Shiny. I won’t go into detail – you can read the tutorial/docs yourself.  But, suffice to say it’s dead simple with no CSS, javascript or HTML to worry about. The only downside of this is that the layout and style of the page is largely fixed (unless you want to get your hands really dirty).  Also key is that Shiny is reactive, i.e. if any of the input variables change, any functions that rely on those will automatically update themselves, as will functions that rely on those and so on.

The first task was to get the input form hooked up to the server code, which literally just requires you to specify a text input box and a submit button:

textInput("gene", label="Gene", value = ""),

Then in the server code, specify a function that waits for the gene variable to be defined by the user and does something.  In this case it gets the data out of the cuffdiff database for the input gene and plots the FPKM of that gene in each of the 16 tissues:

output$genePlot = reactivePlot(function() {
myGene = getGene(cuff,input$gene)
if (is.null(myGene))
x = data.frame(
tissue = fpkm(myGene)$sample_name,
print(ggplot(x,aes(fill=tissue,x=tissue,y=fpkm)) + geom_bar(position="dodge", stat="identity") +

I rolled my own barplot here as the cummeRbund version is quite clunky.  The result for PATE1 (Prostate And Testes Expressed 1) looks like this:


Next up, another function to create track-esque plots of the isoforms found for the input gene, which also includes some nice visual touches like the ideogram and the Ensembl annotated isoforms:


You’ll notice a drop down box has appeared under the input form. This is for the final hurrah – select a tissue and the expression of different isoforms in that tissue will be plotted:


OK OK, so it’s not very well laid out etc – but for a first pass I think its great, and the ggplot2 graphics make up for it a little bit. I should also point out that it is not just slow, but very slow! If I stopped calling out to Ensembl, or ditched cummeRbund all together this could be improved.

The gist is on GitHub (https://gist.github.com/4672051) but I don’t provide the expression data (it’s huge!). I’m pretty sure that it should work with only minor tweaking for any RNA-seq data set analysed with a TopHat -> Cufflinks -> CuffDiff pipeline.

Global quantification of mammalian gene expression control

I’ve chosen as my first topic this paper:

“Global quantification of mammalian gene expression control”  Nature 473, 337-342, 2011

I’ve chosen this paper for several reasons. One, it’s cool.  Two, it was the last thing I read and, three, it tackles a question I was concerned with during my PhD studies albeit on a much larger scale and in mammals rather than bacteria.  In prokaryotes the fundamental biological processes of transcription and translation are coupled.  There is no nuclear membrane to divide the two and so as soon as an mRNA transcript is produced (even as it’s being produced!) ribosomes bind and initiate translation.  Therefore there is a strong correlation between the amount of mRNA and the quantity of it’s resulting protein in prokaryotes.  Obviously the rate of decay for both the transcript and the protein leads to cases where this is not true, but as a general rule it seems to hold up ok.

In eukaryotes its a whole different ball game.  For one, the nucleus separates the physical process of transcription from translation.  Higher organisms also have much more complex processes to prepare mRNA for translation – the removal of introns for a start – which are not present in prokaryotes.  Because of this it has always been hard to categorically state that a gene’s transcript levels are truly reflective of it’s protein level.  This causes a problem.  For a variety of reasons, most of which are technological and financial, modern molecular biology is largely based on the measurement of mRNA levels from which the state of a cell is inferred. However, proteins are the real functional unit of a cell and if we can’t be sure that the mRNA levels actually reflect their concentration then we can’t be sure of the cellular state as a whole.

We’ve been waiting for a systematic comparison of mRNA and protein levels on a global scale to tease apart this relationship.  The technological limitations that held us back are now being overcome and this paper is the first (to my knowledge) to provide such a comprehensive comparison.  The authors have not only quantified the levels of both mRNA (with high throughput sequencing) and protein (liquid chromatography coupled with tandem mass spec) but have also been able to generate half lives (the rate of decay/turnover) for both as well.  To do this they grew their cells (murine fibroblasts and then later a human breast cancer cell line) in media that contained labeled amino acids and a nucleoside analogue that allowed the team to differentiate newly synthesised mRNA and protein from the pre-existing.  A ratio of the new and pre-existing concentrations compared to total RNA/protein allowed the group to calculate half lives.  In total they have data for 5,028 mRNA-protein pairs.  It is worth noting that they were able to collect mRNA and protein data in the same cells (literally the same cells, not just the same cell type) which means the data is entirely compatible. Further, the method does not use any destructive chemicals to inhibit transcription or translation in order to calculate half lives meaning the cell remains intact and functioning normally throughout the experiment.

I guess the headline result is that they show that the correlation between mRNA and protein levels is approx. 0.4.  Although this is not massive, it’s greater than anyone had predicted in the past and is good news for all of us mRNA observers out there.  Next the group were able to construct a mathematical model that allowed them to explore the contribution of the four main processes involved (the synthesis and degradation of mRNA and protein).  They discover that the rate of translational initiation by the ribosome is the most fundamental check on protein abundance and not the rate of mRNA transcription, another reminder not to focus solely on the rate of transcription.

They classify proteins based on their mRNA and protein stabilities and find that those which are stable as both mRNA and protein are enriched for some fundamental cellular processes such as translation and metabolism.  Those which are unstable both as mRNA and protein are involved in signalling and regulatory systems (including epigenetic mechanisms).  Those with unstable proteins but stable mRNAs are concerned with functions such as cellular defence where the protein needs to be produced rapidly hence the pre-existing pool of mRNA.  This is largely as expected and indicates that the regulation of protein production evolved in a resource contrained environment and has adapted to fit the needs of the cell in an energy efficient optima.  The authors explore numerous other aspects which I won’t go into here. 

It is important to note that these experiments were conducted on a large, non-synchronised population of cells and as such the results reflect the average over the cell cycle.  It will be the case that at the level of the individual cell a particular protein may have quite different synthesis/degradation characteristics.  Nevertheless such a resource will now be invaluable to scientists looking to create systems biology models of cellular pathways where quantities and synthesis/turnover rates are required for accurate computation.  It will be the case that the data shown here derived from mouse fibroblasts will not be applicable to many models but at least they will allow us to move on from our current uninformed guestimates.