Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic benchmarking #141

Closed
hadley opened this issue Jan 16, 2012 · 10 comments
Closed

Basic benchmarking #141

hadley opened this issue Jan 16, 2012 · 10 comments

Comments

@hadley
Copy link

hadley commented Jan 16, 2012

It would be really useful to include some basic benchmarking. For large data, how does csvkit compare to awk, or to pulling all my data into R?

@glepore70
Copy link

I have a script which uses Perl and specifically the Text::CSV_XS module to parse delimited files and perform a few manipulations on each column. The script determines if the column is numeric, alpha, or date formatted, then produces a list of the top 10 distinct values by count, the sorted top 5, the sorted bottom 5, and a count of nulls in every column. I replaced the parsing section of the script:

    while ( my $column = $csv->getline($FH2) ) {
        { push @uniqarr, $column->[$t]; }
     $csv->eof or $csv->error_diag();
    close $FH2;

with:

```my @uniqarr=./csvcut -d$delimiter -c$t -e LATIN $file;
chomp @uniqarr;


I was stunned to find that running the new code against data at:
ftp://ftp.epa.gov/castnet/data/ozone_1997.zip (32MB comma delimited file, 520,000 rows, 7 columns)

the csvcut code ran roughly 4 times faster (2 minutes 35 seconds vs. 10 minutes 55 seconds).  I may be making some fundamental mistake in my use of the Text::CSV_XS code, but I haven't found it.

I'm a little worried about csvkit's ability to handle all of the messy CSV files that Text::CSV_XS handles (see http://search.cpan.org/~hmbrand/Text-CSV_XS-0.85/CSV_XS.pm), but I'm thrilled at the increase in speed.  Here is a bash shell equivalent to my Perl script (minus a bunch of prettying up.)

``` #!/bin/bash
file=ozone.csv
delimiter=","
COUNTER=1
echo "##########"
echo "Verification results for $file"
date
columns=$(./csvcut -n -d$delimiter $file  | wc -l)
echo "There are $columns columns in the file."
let columns=columns+1
             while [  $COUNTER -lt $columns ]; do
column=$(./csvcut -n -d$delimiter $file | sed -n ''$COUNTER'p')
read  -rd '' column <<< "$column"
grep_column=$(echo "$column" | cut -c 6-)
echo "###########"
echo "Printing distinct values in column $column"
./csvcut -d$delimiter -c$COUNTER -e LATIN $file| sed 1d | sort -gr | uniq -c | sort -gr | head -10 | awk '{print $2, $1}' 
echo "Lowest 5 values in column $column"
./csvcut -d$delimiter -c$COUNTER -e LATIN $file| sed 1d | sort -g | uniq | head -5
echo "Highest 5 values in column $column"
./csvcut -d$delimiter -c$COUNTER -e LATIN $file| sed 1d | sort -gr | uniq | head -5
echo "Total number of NULLs in column $column"
./csvcut -d$delimiter -c$COUNTER -e LATIN $file|  grep  -c "\"\""
                 let COUNTER=COUNTER+1 
             done
     echo "Total number of rows processed" 
date
wc -l $file  
```

I'll try the code against some larger files to see if the speed increase scales.  I haven't tried an awk solution because of possible messy data issues that Text::CSV_XS handles well.  I also found that R didn't handle alphanumeric fields as well as it handles purely numeric data (plus being much more complicated to me than Perl).

@hadley
Copy link
Author

hadley commented Jan 20, 2012

That seems pretty slow - in R (with no optimisations), I can read in that file in 5 seconds and write out the complete file in another 5 seconds.

@glepore70
Copy link

Are you sure you're testing it on the largest file in the zip file linked above? I get 21 seconds just to cut out the first column:
time ./csvcut -d, -c1 ozone.csv > output.txt
real 0m21.882s
user 0m19.429s
sys 0m0.076s

Granted I've got a pretty weak machine...

@glepore70
Copy link

OK - I made a mistake on the Perl side above, I was using the regular Text::CSV module, not the compiled Text::CSV_XS module, which is much faster. Replacing that brings the above run time for ozone.csv to 54 seconds for the Perl code and 2 minutes 35 seconds for csvcut.

Running the new code on a 200MB file (730,000 rows, 33 columns, ftp://ftp.epa.gov/castnet/data/model_output_2009.zip) yields times of 33 minutes for csvcut and 15 minutes for the Perl code.

Backing up a bit and just comparing the time it takes to print out the first column, for the 200MB file:

./csvcut -d, -c1 model_output_2009.csv > csv_out.txt - 54 seconds

awk -F "," '{print $1}' model_output_2009.csv >awk_out.txt - 1 second

perl code below - 32 seconds

    use strict;
    use warnings;
    use Text::CSV_XS;
    my $file = 'model_output_2009.csv';
    my $csv = Text::CSV_XS->new();
    open (CSV, "<", $file) or die $!;
    while (<CSV>) {
        next if ($. == 1);
        if ($csv->parse($_)) {
            my @columns = $csv->fields();
            print "$columns[0]\n";
        } else {
            my $err = $csv->error_input;
            print "Failed to parse line: $err";
        }
    }
    close CSV;

Which looks like a clear win for awk, except that I don't think it has much ability to handle embedded line feeds, ugly quoting, etc.

awk also gets slower when extracting a column further to the right. Extracting the 25th column from the 200MB file takes 8 seconds with awk and 54 seconds with csvcut.

I don't know much about Python, but is it possible to compile the code to make it run faster? That's the difference in the two Perl modules I was using, the compiled one runs much, much faster.

Hopefully this all helps, sorry for rambling on....

@hadley
Copy link
Author

hadley commented Jan 21, 2012

Those timings were definitely on the 43 meg file

> system.time(x <- read.csv("ozone_1997.csv"))
   user  system elapsed 
  3.814   0.203   5.146 
> system.time(write.csv(x[, 1], "ozone_1997.csv"))
   user  system elapsed 
  0.882   0.061   2.516

I suspect high-performance will require some C code to parse the csv file.

@glepore70
Copy link

My R setup yields the following (just to keep the above measurements consistent):

> system.time(x <- read.csv("ozone_1997.csv"))
   user  system elapsed 
  6.437   0.060   6.503 
> system.time(write.csv(x[, 1], "ozone_1997.csv"))
   user  system elapsed 
  1.176   0.060   1.238 

@onyxfish
Copy link
Collaborator

Hey guys, sorry I've been slow in replying to this ticket. Taking this as an actual "issue," my short answer would be: I don't really care about performance that much. For csvkit the priorities of usability, general utility and maintainability always trump performance. In fact, I'm so certain that csvkit is slower than the alternatives that I don't see too much to be gained by measuring it.

Now, that being said, of course, I'd love for it to be fast, and I'm happy to merge/implement anyone's suggestions that increase the performance (as long as they don't sacrifice UX/features). Moreover, I would like to have a set of benchmarks, if only to prevent regressions, but I don't consider it a priority.

This all my sound like data blasphemy, but if I want things to be fast I'll always write optimized code and I'll suggest that anyone else to do the same.

@camoles
Copy link

camoles commented May 28, 2014

One possible suggestion to improve performance without sacrificing usability would be to re-implement some of the csvkit tools as wrappers around standard unix tools like awk. That would create portability issues tough.

@gsf
Copy link

gsf commented Apr 20, 2015

This is by no means definitive but a reading of https://davidlyness.com/post/the-functional-and-performance-differences-of-sed-awk-and-other-unix-parsing-utilities suggests CSV handling in Python is already fairly performant compared to awk.

@jpmckinney jpmckinney modified the milestone: Backlog Feb 9, 2016
@jpmckinney
Copy link
Member

I'm not seeing a clear issue to resolve here, so closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants