Basic benchmarking #141

hadley · 2012-01-16T14:11:25Z

It would be really useful to include some basic benchmarking. For large data, how does csvkit compare to awk, or to pulling all my data into R?

glepore70 · 2012-01-20T19:32:31Z

I have a script which uses Perl and specifically the Text::CSV_XS module to parse delimited files and perform a few manipulations on each column. The script determines if the column is numeric, alpha, or date formatted, then produces a list of the top 10 distinct values by count, the sorted top 5, the sorted bottom 5, and a count of nulls in every column. I replaced the parsing section of the script:

    while ( my $column = $csv->getline($FH2) ) {
        { push @uniqarr, $column->[$t]; }
     $csv->eof or $csv->error_diag();
    close $FH2;

with:

```my @uniqarr=./csvcut -d$delimiter -c$t -e LATIN $file;
chomp @uniqarr;


I was stunned to find that running the new code against data at:
ftp://ftp.epa.gov/castnet/data/ozone_1997.zip (32MB comma delimited file, 520,000 rows, 7 columns)

the csvcut code ran roughly 4 times faster (2 minutes 35 seconds vs. 10 minutes 55 seconds).  I may be making some fundamental mistake in my use of the Text::CSV_XS code, but I haven't found it.

I'm a little worried about csvkit's ability to handle all of the messy CSV files that Text::CSV_XS handles (see http://search.cpan.org/~hmbrand/Text-CSV_XS-0.85/CSV_XS.pm), but I'm thrilled at the increase in speed.  Here is a bash shell equivalent to my Perl script (minus a bunch of prettying up.)

``` #!/bin/bash
file=ozone.csv
delimiter=","
COUNTER=1
echo "##########"
echo "Verification results for $file"
date
columns=$(./csvcut -n -d$delimiter $file  | wc -l)
echo "There are $columns columns in the file."
let columns=columns+1
             while [  $COUNTER -lt $columns ]; do
column=$(./csvcut -n -d$delimiter $file | sed -n ''$COUNTER'p')
read  -rd '' column <<< "$column"
grep_column=$(echo "$column" | cut -c 6-)
echo "###########"
echo "Printing distinct values in column $column"
./csvcut -d$delimiter -c$COUNTER -e LATIN $file| sed 1d | sort -gr | uniq -c | sort -gr | head -10 | awk '{print $2, $1}' 
echo "Lowest 5 values in column $column"
./csvcut -d$delimiter -c$COUNTER -e LATIN $file| sed 1d | sort -g | uniq | head -5
echo "Highest 5 values in column $column"
./csvcut -d$delimiter -c$COUNTER -e LATIN $file| sed 1d | sort -gr | uniq | head -5
echo "Total number of NULLs in column $column"
./csvcut -d$delimiter -c$COUNTER -e LATIN $file|  grep  -c "\"\""
                 let COUNTER=COUNTER+1 
             done
     echo "Total number of rows processed" 
date
wc -l $file  
```

I'll try the code against some larger files to see if the speed increase scales.  I haven't tried an awk solution because of possible messy data issues that Text::CSV_XS handles well.  I also found that R didn't handle alphanumeric fields as well as it handles purely numeric data (plus being much more complicated to me than Perl).

hadley · 2012-01-20T20:32:01Z

That seems pretty slow - in R (with no optimisations), I can read in that file in 5 seconds and write out the complete file in another 5 seconds.

glepore70 · 2012-01-20T21:00:07Z

Are you sure you're testing it on the largest file in the zip file linked above? I get 21 seconds just to cut out the first column:
time ./csvcut -d, -c1 ozone.csv > output.txt
real 0m21.882s
user 0m19.429s
sys 0m0.076s

Granted I've got a pretty weak machine...

glepore70 · 2012-01-21T01:27:08Z

OK - I made a mistake on the Perl side above, I was using the regular Text::CSV module, not the compiled Text::CSV_XS module, which is much faster. Replacing that brings the above run time for ozone.csv to 54 seconds for the Perl code and 2 minutes 35 seconds for csvcut.

Running the new code on a 200MB file (730,000 rows, 33 columns, ftp://ftp.epa.gov/castnet/data/model_output_2009.zip) yields times of 33 minutes for csvcut and 15 minutes for the Perl code.

Backing up a bit and just comparing the time it takes to print out the first column, for the 200MB file:

./csvcut -d, -c1 model_output_2009.csv > csv_out.txt - 54 seconds

awk -F "," '{print $1}' model_output_2009.csv >awk_out.txt - 1 second

perl code below - 32 seconds

    use strict;
    use warnings;
    use Text::CSV_XS;
    my $file = 'model_output_2009.csv';
    my $csv = Text::CSV_XS->new();
    open (CSV, "<", $file) or die $!;
    while (<CSV>) {
        next if ($. == 1);
        if ($csv->parse($_)) {
            my @columns = $csv->fields();
            print "$columns[0]\n";
        } else {
            my $err = $csv->error_input;
            print "Failed to parse line: $err";
        }
    }
    close CSV;

Which looks like a clear win for awk, except that I don't think it has much ability to handle embedded line feeds, ugly quoting, etc.

awk also gets slower when extracting a column further to the right. Extracting the 25th column from the 200MB file takes 8 seconds with awk and 54 seconds with csvcut.

I don't know much about Python, but is it possible to compile the code to make it run faster? That's the difference in the two Perl modules I was using, the compiled one runs much, much faster.

Hopefully this all helps, sorry for rambling on....

hadley · 2012-01-21T14:53:30Z

Those timings were definitely on the 43 meg file

> system.time(x <- read.csv("ozone_1997.csv"))
   user  system elapsed 
  3.814   0.203   5.146 
> system.time(write.csv(x[, 1], "ozone_1997.csv"))
   user  system elapsed 
  0.882   0.061   2.516

I suspect high-performance will require some C code to parse the csv file.

glepore70 · 2012-01-21T15:03:01Z

My R setup yields the following (just to keep the above measurements consistent):

> system.time(x <- read.csv("ozone_1997.csv"))
   user  system elapsed 
  6.437   0.060   6.503 
> system.time(write.csv(x[, 1], "ozone_1997.csv"))
   user  system elapsed 
  1.176   0.060   1.238

onyxfish · 2012-01-31T00:56:36Z

Hey guys, sorry I've been slow in replying to this ticket. Taking this as an actual "issue," my short answer would be: I don't really care about performance that much. For csvkit the priorities of usability, general utility and maintainability always trump performance. In fact, I'm so certain that csvkit is slower than the alternatives that I don't see too much to be gained by measuring it.

Now, that being said, of course, I'd love for it to be fast, and I'm happy to merge/implement anyone's suggestions that increase the performance (as long as they don't sacrifice UX/features). Moreover, I would like to have a set of benchmarks, if only to prevent regressions, but I don't consider it a priority.

This all my sound like data blasphemy, but if I want things to be fast I'll always write optimized code and I'll suggest that anyone else to do the same.

camoles · 2014-05-28T18:58:43Z

One possible suggestion to improve performance without sacrificing usability would be to re-implement some of the csvkit tools as wrappers around standard unix tools like awk. That would create portability issues tough.

gsf · 2015-04-20T19:49:58Z

This is by no means definitive but a reading of https://davidlyness.com/post/the-functional-and-performance-differences-of-sed-awk-and-other-unix-parsing-utilities suggests CSV handling in Python is already fairly performant compared to awk.

jpmckinney · 2016-06-03T22:13:53Z

I'm not seeing a clear issue to resolve here, so closing.

jpmckinney modified the milestone: Backlog Feb 9, 2016

jpmckinney closed this as completed Jun 3, 2016

jpmckinney mentioned this issue Jan 30, 2017

Buffering tools: warn if running against extremely large files #737

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic benchmarking #141

Basic benchmarking #141

hadley commented Jan 16, 2012

glepore70 commented Jan 20, 2012

hadley commented Jan 20, 2012

glepore70 commented Jan 20, 2012

glepore70 commented Jan 21, 2012

hadley commented Jan 21, 2012

glepore70 commented Jan 21, 2012

onyxfish commented Jan 31, 2012

camoles commented May 28, 2014

gsf commented Apr 20, 2015

jpmckinney commented Jun 3, 2016

Basic benchmarking #141

Basic benchmarking #141

Comments

hadley commented Jan 16, 2012

glepore70 commented Jan 20, 2012

hadley commented Jan 20, 2012

glepore70 commented Jan 20, 2012

glepore70 commented Jan 21, 2012

hadley commented Jan 21, 2012

glepore70 commented Jan 21, 2012

onyxfish commented Jan 31, 2012

camoles commented May 28, 2014

gsf commented Apr 20, 2015

jpmckinney commented Jun 3, 2016