Skip to content

Reading very large text files #19141

Closed
Closed
@rsidd

Description

@rsidd

Apologies if this is covered previously, but I am unable to find an answer after extensive reading. I did find one reference, discussed below.

Let's say I am reading the human genome chromosome 1, approx 0.25GB unzipped. The zipped file is available here
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr1.fa.gz

In this file the first line is a "header" which I don't care about, and the rest is the sequence. I want the output to be one long string representing the sequence, concatenating all lines after throwing away the first line and removing trailing newline characters. A simple-minded Python function that does this is as follows. On my machine it takes less than 700ms to run.

def readfasta(filename):
    f = open(filename)
    l0 = f.readline()
    l0 = f.readline()[0:-1]
    for l in f:
        l0 += l[0:-1]
    return l0

A direct translation to Julia takes -- well, it doesn't seem to finish in any reasonable time, but it takes 11 seconds on a file 1/200 the size and the time seems to increase more than linearly with file size.

I found this post apparently addressing the matter: https://groups.google.com/forum/#!topic/julia-dev/UDllYRfm64w
The OP said reading the file into an array of strings helped. I tried that, and the following runs in 20 seconds -- still about 30 times slower than python.

function readfasta(filename::String)
    f = open(filename)
    l0 = readline(f)
    strarr = map(x::String->x[1:end-1], readlines(f))
    return join(strarr,"")
end

A comment on that post suggested using IOBuffer, but I am not clear on how to do this (joining strings after dropping the last newline characters) more efficiently than what is being done above. Any help would be very welcome.

Metadata

Metadata

Assignees

No one assigned

    Labels

    ioInvolving the I/O subsystem: libuv, read, write, etc.needs more infoClarification or a reproducible example is requiredperformanceMust go faster

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions