Description
Apologies if this is covered previously, but I am unable to find an answer after extensive reading. I did find one reference, discussed below.
Let's say I am reading the human genome chromosome 1, approx 0.25GB unzipped. The zipped file is available here
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr1.fa.gz
In this file the first line is a "header" which I don't care about, and the rest is the sequence. I want the output to be one long string representing the sequence, concatenating all lines after throwing away the first line and removing trailing newline characters. A simple-minded Python function that does this is as follows. On my machine it takes less than 700ms to run.
def readfasta(filename):
f = open(filename)
l0 = f.readline()
l0 = f.readline()[0:-1]
for l in f:
l0 += l[0:-1]
return l0
A direct translation to Julia takes -- well, it doesn't seem to finish in any reasonable time, but it takes 11 seconds on a file 1/200 the size and the time seems to increase more than linearly with file size.
I found this post apparently addressing the matter: https://groups.google.com/forum/#!topic/julia-dev/UDllYRfm64w
The OP said reading the file into an array of strings helped. I tried that, and the following runs in 20 seconds -- still about 30 times slower than python.
function readfasta(filename::String)
f = open(filename)
l0 = readline(f)
strarr = map(x::String->x[1:end-1], readlines(f))
return join(strarr,"")
end
A comment on that post suggested using IOBuffer, but I am not clear on how to do this (joining strings after dropping the last newline characters) more efficiently than what is being done above. Any help would be very welcome.