Everybody goes on about Python these days. In Bioinformatics it’s one of the two “must” know languages (along with R), often praised in comments designed to lambast my beloved Perl. So, I thought i’d have a go on the flying circus. I previously wrote about a multi-threaded Perl application and thought that a fun, simple exercise would be to recreate that in Python.
The code performs a silly but intensive task – counting the number of CpG sites in the human genome. The “p” in CpG is biological shorthand indicating that the C and G bases are on the same strand of the genome, so we are literally looking for string matches to “CG” in a 3 billion character string. This is particularly amenable to parallelisation as those 3 billion characters are split across 24 files, one for each chromosome as well as the mitochondrial genome. The answer, as defined in the previous post, is 27,999,538 and the best Perl I could write comes to that conclusion in a little over 2 seconds (2.285s to be exact, which is a bit faster than that original post as some hardware updates have occurred since then).
My first Python attempt was to simply recreate the final multi-threaded Perl code as closely as I could, except it turns out that Python has some issues under the hood that prevent threaded applications working as you would expect. Having said that the code works, but takes a paltry 59.7 seconds to run.
import glob import threading from queue import Queue def countcpg(filename): with open (filename, "r") as myfile: data=myfile.read().replace('\n', '') index = 0 count = 0 while index > len(data): index = data.find('CG', index) if index == -1: # means no match i.e. got to end of string break index += 2 count += 1 with lock: global tot_cpg tot_cpg += count dir = "/mnt/nas_omics/cpg" files = glob.glob(dir+'/*.fa') tot_cpg = 0 lock = threading.Lock() def worker(): while True: f = q.get() countcpg(f) q.task_done() q = Queue() for i in range(24): t = threading.Thread(target=worker) t.daemon = True t.start() for f in files: q.put(f) q.join() print(tot_cpg)
Attempt number two, then, was to fall back on a more simple parallelised approach using map and the multiprocessing package which “offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads”. This roars in at 5.381 seconds, which is good enough for me. Maybe some Python pro’s will suggest improvements and point out obvious flaws in what i’ve done.
import glob from multiprocessing import Pool def countcpg(filename): with open (filename, "r") as myfile: data=myfile.read().replace('\n', '') index = 0 count = 0 while index < len(data): index = data.find('CG', index) if index == -1: # means no match i.e. got to end of string break index += 2 count += 1 return(count) # global vals: dir = "/mnt/nas_omics/cpg" files = glob.glob(dir+'/*.fa') tot_cpg = 0 # using multiprocessing: pool = Pool() res = pool.map(countcpg, files) print(sum(res))
However, the point of this was not a contest to see which language is faster, but rather to provide a learning opportunity. And it was useful – I get the gist of Python now. It’s pretty superficial, but I liked the cleanliness of the python code and that it’s more object oriented than base perl so quite intuitive. But, the indentation! I guess that takes some getting used to. I also keep hearing about iPython too, which i’ll have a go with one day.
Does this all add up to a compelling drive to switch over then? Maybe, but it seems to me that both Perl and Python have stiff competition from R as an everyday scripting and analysis language – the rise of the Hadleyverse and Shiny, coupled with BioConductor makes it a fabulous language to get things done in Bioinformatics. I wouldn’t want to put it up against either Python or Perl in a speed contest though…