Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarking JSON reader for compressed inputs #17219

Open
wants to merge 38 commits into
base: branch-24.12
Choose a base branch
from

Conversation

shrshi
Copy link
Contributor

@shrshi shrshi commented Oct 31, 2024

Description

Depends on #17161 for implementations of compression and decompression functions (io/comp/comp.cu, io/comp/comp.hpp, io/comp/io_uncomp.hpp and io/comp/uncomp.cpp)
Depends on #17323 for compressed JSON writer implementation.

Adds benchmark to measure performance of the JSON reader for compressed inputs.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Oct 31, 2024
@shrshi shrshi added non-breaking Non-breaking change Performance Performance related issue labels Nov 5, 2024
@shrshi shrshi added the improvement Improvement / enhancement to an existing function label Nov 5, 2024
@shrshi shrshi requested a review from vuule November 6, 2024 15:07
cpp/src/io/comp/comp.cu Outdated Show resolved Hide resolved
@shrshi shrshi requested a review from vuule November 7, 2024 14:44
rapids-bot bot pushed a commit that referenced this pull request Nov 18, 2024
Fixes #17068 
Fixes #12299

This PR introduces a new datasource for compressed inputs which enables batching and byte range reading of multi-source JSONL files using the reallocate-and-retry policy. Moreover. instead of using a 4:1 compression ratio heuristic, the device buffer size is estimated accurately for GZIP, ZIP, and SNAPPY compression types. For remaining types, the files are first decompressed then batched.

~~TODO: Reuse existing JSON tests but with an additional compression parameter to verify correctness.~~
~~Handled by #17219, which implements compressed JSON writer required for the above test.~~
Multi-source compressed input tests added!

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #17161
@github-actions github-actions bot removed the CMake CMake build issue label Nov 19, 2024
Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@shrshi shrshi added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Nov 20, 2024
Copy link
Member

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
Status: In Progress
Status: Burndown
Development

Successfully merging this pull request may close these issues.

5 participants