Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing compressed output using JSON writer #17323

Merged
merged 10 commits into from
Nov 19, 2024

Conversation

shrshi
Copy link
Contributor

@shrshi shrshi commented Nov 14, 2024

Description

Depends on #17161 for implementations of compression and decompression functions (io/comp/comp.cu, io/comp/comp.hpp, io/comp/io_uncomp.hpp and io/comp/uncomp.cpp)

Adds support for writing GZIP- and SNAPPY-compressed JSON to the JSON writer.
Verifies correctness using a parameterized test in tests/io/json/json_writer.cpp

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Nov 14, 2024
@shrshi shrshi added cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 14, 2024
@shrshi shrshi marked this pull request as ready for review November 14, 2024 06:10
@shrshi shrshi requested review from a team as code owners November 14, 2024 06:10
Copy link
Contributor

@KyleFromNVIDIA KyleFromNVIDIA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving trivial CMake changes

cpp/src/io/comp/comp.cu Outdated Show resolved Hide resolved
cpp/tests/io/json/json_writer.cpp Outdated Show resolved Hide resolved
cpp/src/io/comp/comp.cpp Outdated Show resolved Hide resolved
@shrshi
Copy link
Contributor Author

shrshi commented Nov 15, 2024

compression
decompression

Plots showing throughput performance of SNAPPY and GZIP compression and decompression in libcudf for JSON inputs. The missing bars for SNAPPY is because of failures in nvcomp SNAPPY compression for data sizes larger than $2^{22}$ bytes.
@vuule you're right, I think we should include a warning about SNAPPY compression for now, and then move to a host-side library in a follow-on PR.

Benchmark used to generate these plots: #17334
@GregoryKimball for viz.

Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature needs more tests, but it could be in another PR.
Looks good to me.

@vuule
Copy link
Contributor

vuule commented Nov 16, 2024

Thank you for running these! I now feel more comfortable with moving forward with this PR; I expected even worse performance from device compression 😁

@vuule
Copy link
Contributor

vuule commented Nov 16, 2024

This feature needs more tests, but it could be in another PR.

Could we parametrize some of the existing tests to cover different compression formats?

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there, just a few details in the compress_snappy function to sort out.

cpp/src/io/json/write_json.cu Outdated Show resolved Hide resolved
cpp/src/io/comp/comp.cpp Outdated Show resolved Hide resolved
cpp/src/io/comp/comp.cpp Outdated Show resolved Hide resolved
cpp/src/io/comp/comp.cpp Outdated Show resolved Hide resolved
cpp/src/io/comp/comp.cpp Outdated Show resolved Hide resolved
@shrshi
Copy link
Contributor Author

shrshi commented Nov 18, 2024

This feature needs more tests, but it could be in another PR.

Could we parametrize some of the existing tests to cover different compression formats?

Yes, most of the tests in json_writer.cpp take GZIP, SNAPPY and NONE compression type parameters.

@shrshi shrshi requested a review from vuule November 18, 2024 12:16
@github-actions github-actions bot removed the CMake CMake build issue label Nov 19, 2024
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great stuff; huge increase in test coverage!

@shrshi shrshi added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Nov 19, 2024
@shrshi
Copy link
Contributor Author

shrshi commented Nov 19, 2024

/merge

@rapids-bot rapids-bot bot merged commit 384abae into rapidsai:branch-24.12 Nov 19, 2024
104 checks passed
rapids-bot bot pushed a commit that referenced this pull request Nov 20, 2024
Depends on #17161 for implementations of compression and decompression functions (`io/comp/comp.cu`, `io/comp/comp.hpp`, `io/comp/io_uncomp.hpp` and `io/comp/uncomp.cpp`)\
Depends on #17323 for compressed JSON writer implementation.

Adds benchmark to measure performance of the JSON reader for compressed inputs.

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)
  - Muhammad Haseeb (https://github.com/mhaseeb123)

Approvers:
  - MithunR (https://github.com/mythrocks)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Karthikeyan (https://github.com/karthikeyann)
  - Muhammad Haseeb (https://github.com/mhaseeb123)

URL: #17219
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants