23andMe and life insurance

I get lots of queries from blog readers (am always surprised people read this drivel), mostly about concerns regarding 23andMe and whether this will or could affect life insurance policies. I always answer in a non-commited, got to make your own mind up, kind of way. Prompted by the latest caller, I thought I’d ask the wisdom of the crowds so turned to Twitter. You can read the storify of the exchange but a precis follows:

Note: this does not constitute formal legal advice! I provide this information merely as a comment on the current situation within the UK.

It turns out that UK insurers (I have not investigated other jurisdictions) have a voluntary moratorium with the UK Government against the use of predictive genetic testing in insurance cases that is valid until 2019. Please read the full text of the moratorium, but I include two sections of particular relevance here:

21. Insurers agree to the following:

a. customers will not be asked, nor will they be put under pressure, to take a predictive genetic test to obtain insurance cover;

b. customers who have taken a predictive test before the date of this Concordat will be treated in the same way as customers taking tests under the terms of the
Concordat;

c. customers will not be required to disclose any of the following:

i. a predictive genetic test result from a test taken after the insurance cover has started, for as long as that cover is in force;
ii. the predictive test result of another person, such as a blood relative; or
iii. a predictive or diagnostic test result acquired as part of clinical research (for example, the 100,000 Genomes Project) To avoid doubt,  customers may be asked to disclose details of any symptoms, diagnosis or treatment received outside of the clinical research programme, even if those relate to a condition they found out about through the research programme.

d. customers making relevant insurance applications will be required to disclose a predictive genetic test result only if all of the following apply;

i. the customer is seeking insurance cover above the financial limits set out in the Moratorium;
ii. the test has been assessed by a panel of experts and approved by Government. To date, the only test that people are required to disclose
under the agreement is for Huntington’s Disease for life insurance where the insured sum is over £500,000. Any change to the list of approved
tests would be notified on the ABI and Department of Health websites

iii. the insurer asks the customer to disclose the information.

And…

26. The terms of the Moratorium are as follows.

I. Customers will not be required to disclose the results of predictive genetic tests for policies up to £500,000 of life insurance, or £300,000 for critical illness insurance, or paying annual benefits of £30,000 for income protection insurance (the ‘financial limits’).

II. When the cumulative value of insurance exceeds the financial limits, insurers may seek information about, and customers must disclose, tests approved with the Government for use for a particular insurance product, subject to the restrictions in the Concordat.

III. The Government will announce and the ABI will publish on its website the date of the next review which will be three years before the expiry date of the current Moratorium.

Thanks to @ewanbirney and @TheABB for pointing me at this information.

OK OK, stop badgering! I’ll try Python…

Everybody goes on about Python these days. In Bioinformatics it’s one of the two “must” know languages (along with R), often praised in comments designed to lambast my beloved Perl. So, I thought i’d have a go on the flying circus. I previously wrote about a multi-threaded Perl application and thought that a fun, simple exercise would be to recreate that in Python.

The code performs a silly but intensive task – counting the number of CpG sites in the human genome. The “p” in CpG is biological shorthand indicating that the C and G bases are on the same strand of the genome, so we are literally looking for string matches to “CG” in a 3 billion character string. This is particularly amenable to parallelisation as those 3 billion characters are split across 24 files, one for each chromosome as well as the mitochondrial genome. The answer, as defined in the previous post, is 27,999,538 and the best Perl I could write comes to that conclusion in a little over 2 seconds (2.285s to be exact, which is a bit faster than that original post as some hardware updates have occurred since then).

My first Python attempt was to simply recreate the final multi-threaded Perl code as closely as I could, except it turns out that Python has some issues under the hood that prevent threaded applications working as you would expect. Having said that the code works, but takes a paltry 59.7 seconds to run.

import glob
import threading
from queue import Queue

def countcpg(filename):
  with open (filename, "r") as myfile:
    data=myfile.read().replace('\n', '')
  index = 0
  count = 0
  while index > len(data):
    index = data.find('CG', index)
    if index == -1: # means no match i.e. got to end of string
      break
        index += 2
    count += 1
  with lock:
    global tot_cpg
    tot_cpg += count

dir = "/mnt/nas_omics/cpg"
files = glob.glob(dir+'/*.fa')
tot_cpg = 0
lock = threading.Lock()

def worker():
  while True:
    f = q.get()
    countcpg(f)
    q.task_done()

q = Queue()
for i in range(24):
  t = threading.Thread(target=worker)
  t.daemon = True
  t.start()

for f in files:
  q.put(f)

q.join()
print(tot_cpg)

Attempt number two, then, was to fall back on a more simple parallelised approach using map and the multiprocessing package which “offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads”. This roars in at 5.381 seconds, which is good enough for me. Maybe some Python pro’s will suggest improvements and point out obvious flaws in what i’ve done.

import glob
from multiprocessing import Pool

def countcpg(filename):
  with open (filename, "r") as myfile:
    data=myfile.read().replace('\n', '')
  index = 0
  count = 0
  while index < len(data):
    index = data.find('CG', index)
    if index == -1: # means no match i.e. got to end of string
      break
    index += 2
    count += 1
  return(count)

# global vals:
dir = "/mnt/nas_omics/cpg"
files = glob.glob(dir+'/*.fa')
tot_cpg = 0

# using multiprocessing:
pool = Pool()
res = pool.map(countcpg, files)
print(sum(res))

However, the point of this was not a contest to see which language is faster, but rather to provide a learning opportunity. And it was useful – I get the gist of Python now. It’s pretty superficial, but I liked the cleanliness of the python code and that it’s more object oriented than base perl so quite intuitive. But, the indentation! I guess that takes some getting used to. I also keep hearing about iPython too, which i’ll have a go with one day.

Does this all add up to a compelling drive to switch over then? Maybe, but it seems to me that both Perl and Python have stiff competition from R as an everyday scripting and analysis language – the rise of the Hadleyverse and Shiny, coupled with BioConductor makes it a fabulous language to get things done in Bioinformatics. I wouldn’t want to put it up against either Python or Perl in a speed contest though…