Seems to be 9x slower than data.table on a 6 core machine #45

xiaodaigh · 2019-04-30T14:11:58Z

I have uploaded the largest CSV I can find in the wild and it takes 500 seconds to read. In data.table it takes about 50 seconds. So it's about 9~10 times slower but my machine is only 6 cores, so I was only expecting around 6x faster. Hopefully this will be useful for tuning performance for large files.

download("https://github.com/xiaodaigh/testing/raw/master/Performance_2003Q3.zip", "ok.zip")
run(`unzip -o ok.zip`)
using TableReader
path = "Performance_2003Q3.txt"
@time a = readcsv(path, delim = '|', hasheader = false); # 500 seconds

bicycle1885 · 2019-05-01T05:30:19Z

I quickly benchmarked the performance using that data. The throughput was approximately 1 million records per second up to 70 million records so the expected time to load the file is ~140 seconds, but it became superlinear after that. Unfortunately, my machine couldn't load the whole file due to the limit of RAM (32 GB).

xiaodaigh · 2019-05-01T05:43:29Z

The R data.table version seems to be fairly memory efficient as well using only about 30G of RAM. But it runs through the data twice. Once to count rowand and another to populate.

bicycle1885 · 2019-05-01T05:51:55Z

The memory usage difference may come from the difference of integer encoding. Julia uses 64-bit integers but R uses 32-bit integers instead, doesn't it?

bicycle1885 · 2019-05-01T10:21:23Z

A benchmark result with finer granularity.

The jumps of the elapsed time and memory usage happened when the number of records was doubled, which is expected because Julia doubles the capacity of a vector if required.

bicycle1885 · 2019-05-01T10:33:39Z

Estimating the number of records in the first scan will reduce the required memory since it makes it possible to avoid expanding vectors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seems to be 9x slower than data.table on a 6 core machine #45

Seems to be 9x slower than data.table on a 6 core machine #45

xiaodaigh commented Apr 30, 2019

bicycle1885 commented May 1, 2019 •

edited

Loading

xiaodaigh commented May 1, 2019

bicycle1885 commented May 1, 2019

bicycle1885 commented May 1, 2019 •

edited

Loading

bicycle1885 commented May 1, 2019

Seems to be 9x slower than data.table on a 6 core machine #45

Seems to be 9x slower than data.table on a 6 core machine #45

Comments

xiaodaigh commented Apr 30, 2019

bicycle1885 commented May 1, 2019 • edited Loading

xiaodaigh commented May 1, 2019

bicycle1885 commented May 1, 2019

bicycle1885 commented May 1, 2019 • edited Loading

bicycle1885 commented May 1, 2019

bicycle1885 commented May 1, 2019 •

edited

Loading

bicycle1885 commented May 1, 2019 •

edited

Loading