Reading very large text files

Apologies if this is covered previously, but I am unable to find an answer after extensive reading.  I did find one reference, discussed below.

Let's say I am reading the human genome chromosome 1, approx 0.25GB unzipped.  The zipped file is available here
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr1.fa.gz

In this file the first line is a "header" which I don't care about, and the rest is the sequence.  I want the output to be one long string representing the sequence, concatenating all lines after throwing away the first line and removing trailing newline characters.  A simple-minded Python function that does this is as follows.  On my machine it takes less than 700ms to run.

```
def readfasta(filename):
    f = open(filename)
    l0 = f.readline()
    l0 = f.readline()[0:-1]
    for l in f:
        l0 += l[0:-1]
    return l0
```

A direct translation to Julia takes -- well, it doesn't seem to finish in any reasonable time, but it takes 11 seconds on a file 1/200 the size and the time seems to increase more than linearly with file size.  

I found this post apparently addressing the matter: https://groups.google.com/forum/#!topic/julia-dev/UDllYRfm64w 
The OP said reading the file into an array of strings helped.  I tried that, and the following runs in 20 seconds -- still about 30 times slower than python.

```
function readfasta(filename::String)
    f = open(filename)
    l0 = readline(f)
    strarr = map(x::String->x[1:end-1], readlines(f))
    return join(strarr,"")
end
```

A comment on that post suggested using IOBuffer, but I am not clear on how to do this (joining strings after dropping the last newline characters) more efficiently than what is being done above.  Any help would be very welcome. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Reading very large text files #19141

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Reading very large text files #19141

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions