Skip to content

Add decoding benchmark plus benchmark for GZIP-compressed CSV files #15

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 2, 2020

Conversation

clue
Copy link
Owner

@clue clue commented May 2, 2020

@clue clue added the new feature New feature or request label May 2, 2020
@clue clue added this to the v1.1.0 milestone May 2, 2020
@clue
Copy link
Owner Author

clue commented May 2, 2020

For the reference, here's what the output looks like on my machine:

$ php examples/91-benchmark-count.php < IRAhandle_tweets_1.csv 
243891 records in 4.812s => 50683 records/s

$ php examples/92-benchmark-count-gzip.php < IRAhandle_tweets_1.csv.gz 
243891 records in 4.789s => 50929 records/s

Interestingly, this means that decompressing and parsing a large file was slightly faster than parsing an uncompressed CSV file on my system (I/O overhead vs CPU usage).

It's worth noting that decoding CSV is quite fast – but decoding NDJSON is somewhat faster than decoding CSV in my benchmarks (clue/reactphp-ndjson#16):

me@me-in:~/workspace/clue-reactphp-ndjson$ php examples/91-benchmark-count.php < IRAhandle_tweets_1.ndjson 
243891 records in 1.021s => 238835 records/s

@clue clue changed the title Add decoding benchmark plus benchmarking for GZIP-compressed CSV files Add decoding benchmark plus benchmark for GZIP-compressed CSV files May 2, 2020
@clue clue merged commit 63dcdef into clue:master May 2, 2020
@clue clue deleted the benchmark branch May 2, 2020 18:34
@loilo
Copy link
Contributor

loilo commented Jan 25, 2021

Comment incited by this tweet.

Running the above benchmark against this CSV file (careful when on restricted traffic volume, 90MB file), I get roughly 25% improvement on PHP 8.0, with another ≈4% on top when enabling JIT.

The running machine is a MacBook Pro, late 2019, with a 2.3GHz 8-Core Intel Core i9 CPU.

PHP 7.4

php -v

PHP 7.4.14 (cli) (built: Jan  8 2021 13:20:04) ( NTS )
Copyright (c) The PHP Group
Zend Engine v3.4.0, Copyright (c) Zend Technologies
    with Zend OPcache v7.4.14, Copyright (c), by Zend Technologies
php examples/91-benchmark-count.php < IRAhandle_tweets_1.csv

243891 records in 2.685s => 90838 records/s

Records per second roughly settle between 86k and 92k.

PHP 8.0

php -v

PHP 8.0.1 (cli) (built: Jan  8 2021 12:43:54) ( NTS )
Copyright (c) The PHP Group
Zend Engine v4.0.1, Copyright (c) Zend Technologies
    with Zend OPcache v8.0.1, Copyright (c), by Zend Technologies

Without JIT

php examples/91-benchmark-count.php < IRAhandle_tweets_1.csv

243891 records in 2.146s => 113654 records/s

Records per second roughly settle between 105k and 120k (a plus of about 25%).

With JIT

php -d opcache.enable_cli=1 -d opcache.jit_buffer_size=100M examples/91-benchmark-count.php < IRAhandle_tweets_1.csv

243891 records in 2.051s => 118918 records/s

This is not a lot faster than without JIT, but shows way less variance with records per second settling between 115k and 120k.

@clue
Copy link
Owner Author

clue commented Jan 26, 2021

@loilo Thank you very much for sharing your results here!

Really interesting to see how PHP 8 improved this somewhat and with JIT enabled may be able to achieve even better results. I haven't toyed around with this yet, but I wonder if different JIT settings (e.g. https://stitcher.io/blog/php-8-jit-setup) may yield even better results 🤘

By the way, if you're into benchmarking: https://github.com/clue/reactphp-ndjson includes a very similar benchmark script that can be executed on the same input data set. It was ~5 times faster than this one last time I checked. https://github.com/clue/reactphp-tsv includes a similar benchmark script that was ~10 times faster than this one. 🔥

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reading compressed CSV file (example.csv.gz)
2 participants