-
Notifications
You must be signed in to change notification settings - Fork 603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Basic benchmarking #141
Comments
I have a script which uses Perl and specifically the Text::CSV_XS module to parse delimited files and perform a few manipulations on each column. The script determines if the column is numeric, alpha, or date formatted, then produces a list of the top 10 distinct values by count, the sorted top 5, the sorted bottom 5, and a count of nulls in every column. I replaced the parsing section of the script:
with: ```my @uniqarr=
|
That seems pretty slow - in R (with no optimisations), I can read in that file in 5 seconds and write out the complete file in another 5 seconds. |
Are you sure you're testing it on the largest file in the zip file linked above? I get 21 seconds just to cut out the first column: Granted I've got a pretty weak machine... |
OK - I made a mistake on the Perl side above, I was using the regular Text::CSV module, not the compiled Text::CSV_XS module, which is much faster. Replacing that brings the above run time for ozone.csv to 54 seconds for the Perl code and 2 minutes 35 seconds for csvcut. Running the new code on a 200MB file (730,000 rows, 33 columns, ftp://ftp.epa.gov/castnet/data/model_output_2009.zip) yields times of 33 minutes for csvcut and 15 minutes for the Perl code. Backing up a bit and just comparing the time it takes to print out the first column, for the 200MB file: ./csvcut -d, -c1 model_output_2009.csv > csv_out.txt - 54 seconds awk -F "," '{print $1}' model_output_2009.csv >awk_out.txt - 1 second perl code below - 32 seconds
Which looks like a clear win for awk, except that I don't think it has much ability to handle embedded line feeds, ugly quoting, etc. awk also gets slower when extracting a column further to the right. Extracting the 25th column from the 200MB file takes 8 seconds with awk and 54 seconds with csvcut. I don't know much about Python, but is it possible to compile the code to make it run faster? That's the difference in the two Perl modules I was using, the compiled one runs much, much faster. Hopefully this all helps, sorry for rambling on.... |
Those timings were definitely on the 43 meg file > system.time(x <- read.csv("ozone_1997.csv"))
user system elapsed
3.814 0.203 5.146
> system.time(write.csv(x[, 1], "ozone_1997.csv"))
user system elapsed
0.882 0.061 2.516 I suspect high-performance will require some C code to parse the csv file. |
My R setup yields the following (just to keep the above measurements consistent):
|
Hey guys, sorry I've been slow in replying to this ticket. Taking this as an actual "issue," my short answer would be: I don't really care about performance that much. For csvkit the priorities of usability, general utility and maintainability always trump performance. In fact, I'm so certain that csvkit is slower than the alternatives that I don't see too much to be gained by measuring it. Now, that being said, of course, I'd love for it to be fast, and I'm happy to merge/implement anyone's suggestions that increase the performance (as long as they don't sacrifice UX/features). Moreover, I would like to have a set of benchmarks, if only to prevent regressions, but I don't consider it a priority. This all my sound like data blasphemy, but if I want things to be fast I'll always write optimized code and I'll suggest that anyone else to do the same. |
One possible suggestion to improve performance without sacrificing usability would be to re-implement some of the csvkit tools as wrappers around standard unix tools like awk. That would create portability issues tough. |
This is by no means definitive but a reading of https://davidlyness.com/post/the-functional-and-performance-differences-of-sed-awk-and-other-unix-parsing-utilities suggests CSV handling in Python is already fairly performant compared to awk. |
I'm not seeing a clear issue to resolve here, so closing. |
It would be really useful to include some basic benchmarking. For large data, how does csvkit compare to awk, or to pulling all my data into R?
The text was updated successfully, but these errors were encountered: