Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seems to be 9x slower than data.table on a 6 core machine #45

Open
xiaodaigh opened this issue Apr 30, 2019 · 5 comments
Open

Seems to be 9x slower than data.table on a 6 core machine #45

xiaodaigh opened this issue Apr 30, 2019 · 5 comments

Comments

@xiaodaigh
Copy link

I have uploaded the largest CSV I can find in the wild and it takes 500 seconds to read. In data.table it takes about 50 seconds. So it's about 9~10 times slower but my machine is only 6 cores, so I was only expecting around 6x faster. Hopefully this will be useful for tuning performance for large files.

download("https://github.com/xiaodaigh/testing/raw/master/Performance_2003Q3.zip", "ok.zip")
run(`unzip -o ok.zip`)
using TableReader
path = "Performance_2003Q3.txt"
@time a = readcsv(path, delim = '|', hasheader = false); # 500 seconds
@bicycle1885
Copy link
Owner

bicycle1885 commented May 1, 2019

I quickly benchmarked the performance using that data. The throughput was approximately 1 million records per second up to 70 million records so the expected time to load the file is ~140 seconds, but it became superlinear after that. Unfortunately, my machine couldn't load the whole file due to the limit of RAM (32 GB).
elapsed

@xiaodaigh
Copy link
Author

The R data.table version seems to be fairly memory efficient as well using only about 30G of RAM. But it runs through the data twice. Once to count rowand and another to populate.

@bicycle1885
Copy link
Owner

The memory usage difference may come from the difference of integer encoding. Julia uses 64-bit integers but R uses 32-bit integers instead, doesn't it?

@bicycle1885
Copy link
Owner

bicycle1885 commented May 1, 2019

A benchmark result with finer granularity.

elapsed

The jumps of the elapsed time and memory usage happened when the number of records was doubled, which is expected because Julia doubles the capacity of a vector if required.

@bicycle1885
Copy link
Owner

Estimating the number of records in the first scan will reduce the required memory since it makes it possible to avoid expanding vectors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants