OK OK, stop badgering! I’ll try Python…

Everybody goes on about Python these days. In Bioinformatics it’s one of the two “must” know languages (along with R), often praised in comments designed to lambast my beloved Perl. So, I thought i’d have a go on the flying circus. I previously wrote about a multi-threaded Perl application and thought that a fun, simple exercise would be to recreate that in Python.

The code performs a silly but intensive task – counting the number of CpG sites in the human genome. The “p” in CpG is biological shorthand indicating that the C and G bases are on the same strand of the genome, so we are literally looking for string matches to “CG” in a 3 billion character string. This is particularly amenable to parallelisation as those 3 billion characters are split across 24 files, one for each chromosome as well as the mitochondrial genome. The answer, as defined in the previous post, is 27,999,538 and the best Perl I could write comes to that conclusion in a little over 2 seconds (2.285s to be exact, which is a bit faster than that original post as some hardware updates have occurred since then).

My first Python attempt was to simply recreate the final multi-threaded Perl code as closely as I could, except it turns out that Python has some issues under the hood that prevent threaded applications working as you would expect. Having said that the code works, but takes a paltry 59.7 seconds to run.

import glob
import threading
from queue import Queue

def countcpg(filename):
  with open (filename, "r") as myfile:
    data=myfile.read().replace('\n', '')
  index = 0
  count = 0
  while index > len(data):
    index = data.find('CG', index)
    if index == -1: # means no match i.e. got to end of string
      break
        index += 2
    count += 1
  with lock:
    global tot_cpg
    tot_cpg += count

dir = "/mnt/nas_omics/cpg"
files = glob.glob(dir+'/*.fa')
tot_cpg = 0
lock = threading.Lock()

def worker():
  while True:
    f = q.get()
    countcpg(f)
    q.task_done()

q = Queue()
for i in range(24):
  t = threading.Thread(target=worker)
  t.daemon = True
  t.start()

for f in files:
  q.put(f)

q.join()
print(tot_cpg)

Attempt number two, then, was to fall back on a more simple parallelised approach using map and the multiprocessing package which “offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads”. This roars in at 5.381 seconds, which is good enough for me. Maybe some Python pro’s will suggest improvements and point out obvious flaws in what i’ve done.

import glob
from multiprocessing import Pool

def countcpg(filename):
  with open (filename, "r") as myfile:
    data=myfile.read().replace('\n', '')
  index = 0
  count = 0
  while index < len(data):
    index = data.find('CG', index)
    if index == -1: # means no match i.e. got to end of string
      break
    index += 2
    count += 1
  return(count)

# global vals:
dir = "/mnt/nas_omics/cpg"
files = glob.glob(dir+'/*.fa')
tot_cpg = 0

# using multiprocessing:
pool = Pool()
res = pool.map(countcpg, files)
print(sum(res))

However, the point of this was not a contest to see which language is faster, but rather to provide a learning opportunity. And it was useful – I get the gist of Python now. It’s pretty superficial, but I liked the cleanliness of the python code and that it’s more object oriented than base perl so quite intuitive. But, the indentation! I guess that takes some getting used to. I also keep hearing about iPython too, which i’ll have a go with one day.

Does this all add up to a compelling drive to switch over then? Maybe, but it seems to me that both Perl and Python have stiff competition from R as an everyday scripting and analysis language – the rise of the Hadleyverse and Shiny, coupled with BioConductor makes it a fabulous language to get things done in Bioinformatics. I wouldn’t want to put it up against either Python or Perl in a speed contest though…

Advertisements

4 thoughts on “OK OK, stop badgering! I’ll try Python…

  1. on my system:

    PL1 | PL2 | PL3 | PY1 | PY2 | PY3
    41.2 | 58.5 | 2.6 | 79.3 | 4.1 | 3.2

    PL – perl scripts 1, 2 and 3
    PY – python scripts 1 and 2 in this page
    PY3 – using regular expressions (re.compile and re.finditer)

    Also PY1 should read “while index len(data):”

    • Cool, thanks. Am surprised regex is faster in Python given a) regex is usually always an overhead and b) everybody says Python’s regex are particularly slow. Could you share code?

      • python regex has improved a lot and it allows to be compiled once and used several times. that’s what I did (as a good practice. I did not compare the speeds).

        Also, i’ve used an iterator which means the regex matches were not loaded in memory at once.

        You can find the code here:

  2. Even I use python Ben! Main benefit for me is similarly code illiterate people seem to use it so the support forums/stack-exchange discussions are vaguely understandable 🙂

    Good luck – Jez.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s