The murky corners of my genome – 23andMe and the Ensembl VEP

I recently got my genotype data from 23andMe. The most exciting finding is that i’m slightly more than average Neanderthal (3% versus the 2.7% average). I always wondered how they calculated this percentage, presuming it was a measure of the SNPs common between myself and the Hairy Caveman. It turns out that it’s a little more complicated – they calculate how far you are from the “Neanderthal axis”, which is a line that links neanderthals and the average of 246 whole African genomes (whom have no neanderthal ancestry) on a PCA plot. But, I digress, see this paper for all the details.

This post was really inspired by this blog from Neil Saunders in which he describes how to run your 23andMe SNP data through Ensembl’s Variant Effect Predictor (VEP). I have largely followed the method he outlines in order to take a closer look at the SNPs in my genome and what they’re up to.

The first thing to enthuse upon is the VEP itself – what a fab tool. Running locally with default settings it took <15 mins to crunch through the 960,613 SNPs in the data. It produces a pretty nice HTML page with summary counts of variant consequence, chromosome distribution etc.

So what are my SNPs up to? Just over 50% of my SNPs are intronic, another 8% intergenic etc. These are quite hard to interpret, so i’ll ignore the non-coding mutations for now. The pie below summarises the consequences of the mutations within coding regions. Alarmingly, it seems I have 98 stop gain mutations and 15,041 missense mutations in 3,687 genes! These are mutations where the variant could have an effect on protein function, either through premature abortion of mRNA translation or by switching the amino acid coded for at that location. In the first case, the truncated protein is probably not produced due to nonsense mediated decay, but there will still be less protein than there should be. The missense mutations will not all have an impact – only those that change a key functional domain of the protein are likely to. Even then, the degree of impact will be mitigated by many things including functional redundancy with other proteins etc.


The most obvious thing to confirm is the functional impact they are likely to have. This can be done with SIFT and PolyPhen predcitions, both of which the VEP calculates for you. However, it quickly becomes apparent that the VEP default settings don’t get you very much useful information past a classification of the variant consequences. But, there are many options available to pass to the VEP in order to get it to calculate all sorts of information. The following flags turn on sift and polyphen predictions and the global MAF from 1000 genomes:

--sift p --polyphen p --gmaf --fork 10

Happily, the authors also provided a --everything flag which returns, well, everything, including sift and polyphen predictions of the variant’s functional impact and the MAF etc. As you can imagine this takes a lot longer to run! It’s sensible to undertake a quick bit of jiggery pokery to subset the original VCF file to just the variants that cause a stop gain or missense mutation.

Quickly browsing the VEP summary HTML it’s apparent that PolyPhen and SIFT think some of my stop gains/missense mutations are going to have a damaging effect on protein function:


The 621 variants for which both PolyPhen and Sift predict a deleterious/damaging conseqeunce are found in 270 genes. Add that to the 46 genes that have gained a premature stop codon and I’m short 316 fully functional genes! This is by no means abnormal however – the 1000 genomes project estimates that we all have putative loss of function variants in 250-300 genes.

And, it’s not all bad news – it seems that one of my mutations, rs497116, is a well known stop gain in caspase 12 (CASP12). The A allele (which I have) is dominant in European populations but less so in those of African descent. The variant leads to a truncated inactive form of Caspase 12, which is protective against sepsis – the full length protein renders the carrier susceptible to an over the top immune response to bacterial infection.

What’s more I haven’t explored my genotype at these locations – am I heterozygous or homozygous? If heterozygous then I have a “spare”, perfectly normal copy of the gene that will hopefully compensate for the damaged one (leaving aside compound heterozygosity). If homozygous, then I’m potentially the human version of a knock out mouse! I also want to know what the frequency of these mutations are in the general population, their Minor Allele Frequency (MAF). If, like the Caspase 12 example above, most of the mutations are highly penetrant in the general population the chances are they don’t have such drastic consequences. I think I’ll keep all of that to myself though…