Q: How many CpG’s in the genome? A: multi-threaded Perl

This started out as a serious question. Whilst analysing some RRBS data I needed to know what percentage of the CpG’s (that is loci where a C nucleotide is followed immediately by a G) in the human genome were covered by our data. The question then, is just how many CpG’s are there in the human genome? “A ha!” I thought – just the kind of question that Perl was built to answer.

To lay out the task for you, the genome as I have it is split across 24 fasta files – one for each chromosome and another for the mitochondrial DNA. These files can be obtained from Ensembl here* and have a combined size of ~3Gb. I need to flaunt my hardware specs for any comparison you might make – we’re running a 2*6 core 2GHz cpu (giving 24 threads) with 96Gb RAM. Obviously this is linux (Red Hat) and i’m using the latest version of Perl (v5.18.1) compiled with multi-threading support enabled. I’m also working off a NAS, so network bandwidth might hold me back.

My first effort was actually quite speedy, I thought, especially as at that point I wasn’t trying to break any speed records. In essence, I slurp each file in turn into memory, replace all the newlines \n (so am essentially making one dirty long string) and then use a regular expression $chr_seq =~ /CG/g; to search for the CpG’s and bung each occurrence into an array. I confess I initially tried $chr_seq =~ /C\n?G, but this was noticeably sluggish. I then count the number of matches in the array, add it to the running total and hey presto a final answer of 27,999,538 is arrived at in just 38 seconds.

use strict ;

my $dir = $ARGV[0] ;
my $tot_cpg = 0 ;

opendir(DIR, $dir) or die $!;
while (defined(my $chr = readdir(DIR))) {
  next unless $chr =~ /\.fa/ ;
  local $/=undef;
  open GENOME,"<$dir$chr" or die $! ;
  my $chr_seq = <GENOME> ;
  close GENOME ;
  $chr_seq =~ s/\n//g ;
  my @matches = $chr_seq =~ /CG/g;
  my $cpg = scalar @matches;
  $tot_cpg = $tot_cpg + $cpg ;
}
closedir(DIR);

print $tot_cpg ;

Now, 38 seconds is pretty fast. If you consider the average rate of reading to be 250 words a minute and that there are 5 letters in the average word, then a person can read 1,250 characters a minute. There are roughly 3 billion characters in the human genome so it would take approximately 2,999,998,750 minutes (or 5,703 years) to read. So, 38 seconds is nothing. But, Perl can do it even faster; we haven’t even considered parallelising the process yet.

#### multi threaded – 18 secs
Perl, if compiled with the -Dusethreads option, can run multiple processes in parallel. This means we could go from sequential processing of the chromosome files (e.g. do chr 1, then chr 2 etc) to processing them all at the same time. Here we do the counting as before, but this time it’s wrapped in a subroutine count. As we read through all the files in the directory we create a new thread passing it a reference to the sub as well as the file that thread is to process: my $thr = threads->create(\&count,$dir.$chr) ;. We have to keep track of the threads we create so that we can wait for them to finish and harvest the data they return. Luckily I have 24 files to process and 24 threads available to use. We now arrive at our answer in just 18 seconds, a more than 2 fold increase in speed.

use strict ;
use threads ;

my $dir = $ARGV[0] ;
my $tot_cpg = 0 ;
my @threads ;

opendir(DIR, $dir) or die $! ;
while (defined(my $chr = readdir(DIR))) {
	next unless $chr =~ /\.fa/ ;
	my $thr = threads->create(\&count,$dir.$chr) ;
	push @threads, $thr ;

}
close DIR ;

foreach (@threads) {
	my $cpg = $_->join;
	$tot_cpg = $tot_cpg+$cpg ;
}

print $tot_cpg ;

sub count {
	$/=undef;
	open GENOME, "<$_[0]" or die $! ;
	my $chr_seq = <GENOME>;
	close GENOME;
	$chr_seq =~ s/\n//g ;
	my @matches = $chr_seq =~ /CG/g;
	return(scalar @matches) ;
}

#### remove regexs – 3 secs
18 seconds is pretty blazing, but now I’ve got the bug and wonder just what else I can optimise. The compilation and use of regexs, although highly optimised in Perl, is quite slow so they had to go. I tried all sorts of ways of using split or unpack but in the end settled on index which determines the position of a substring in string. I also switched from using a substitution to remove the newlines to using the translation operator tr. We now count the CpG’s in an astonishing 3 seconds!

sub count {
	$/=undef;
	open GENOME, "<$_[0]" or die $! ;
	my $chr_seq = <GENOME>;
	close GENOME;
	
    ### replace s///g with tr//d:
    $chr_seq =~ tr/\n//d ;

	### replace regex with index:
	my $pos = -1;
 	my $c = 0 ;
 	while (($pos = index($chr_seq,'CG',$pos)) > -1) {
 		$c++ ;
 		$pos++ ;
 	}
	return($c) ;
}


>time perl countGpGinGenome.pl ./data/
27999538

real 0m3.136s
user 0m27.759s
sys 0m14.491s

3 seconds! Now, i’m not a good enough programmer to go much quicker but I still think there are a few areas that could be optimised further. For example, getting rid of the newline replacement altogether would be sensible. But, it doesn’t really matter. This started out as a biological question and got a bit geeky (too much so if i’m honest) but I learnt so much through this exercise about threaded programming and perl in general that it was worth it. Also, the next time I do RRBS on a new species I won’t waste any time finding out how many CpG’s there are!

*Only use the files matching: Homo_sapiens.GRCh37.72.dna.chromosome.*.fa where * = 1..22|X|MT