-
Notifications
You must be signed in to change notification settings - Fork 926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce benchmark suite for JSON reader options #15124
Conversation
auto const view = tbl->view(); | ||
cudf::io::json_writer_options const write_opts = | ||
cudf::io::json_writer_options::builder(source_sink.make_sink_info(), view) | ||
.lines(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we are testing the options, I would recommend making lines
into an nvbench enum axis as well. Although since some of the options require lines to be true, maybe benchmarking lines true/false should be a separate benchmark.
nvbench::enum_type_list<row_selection::ALL, row_selection::BYTE_RANGE>,
lines=True
nvbench::enum_type_list<normalize_single_quotes::NO, normalize_single_quotes::YES>
line=True/False
nvbench::enum_type_list<mixed_types_as_string::NO, mixed_types_as_string::YES>
line=True/False
nvbench::enum_type_list<recovery_mode::RECOVER_WITH_NULL, recovery_mode::FAIL>))
lines=True
After thinking through the options, I don't think we need to test normalize_single_quotes
and mixed_types_as_string
with lines=false
. It still might be useful to add a lines true/false benchmark without any additional options. If others agree then that could be a follow-on PR.
Would you please post some early benchmark results? (similar to this example) |
Benchmarks were run on A100 80GB GPU, with all combinations of |
…n is enabled (#15236) Addresses #15196 by applying a patch from @karthikeyann to skip the `infer_column_type_kernel` by forcing the mixed types column to be a string. With this optimization, we see a significant improvement in performance. Please refer to the [comment](#15236 (comment)) for a visualization of the results before and after applying this optimization as obtained from the [JSON lines benchmarking exercise](#15124). Authors: - Shruti Shivakumar (https://github.com/shrshi) - Karthikeyan (https://github.com/karthikeyann) Approvers: - Karthikeyan (https://github.com/karthikeyann) - Vukasin Milovanovic (https://github.com/vuule) URL: #15236
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving CMake changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff!
Some minor comments/questions.
After updates to the mixed types as string handling and the byte range reader, here are the most recent benchmarks. Note that in view of the data generation bug, the results below may vary after that large last row issue is fixed.
|
These plots are great! Please check out #15185 for an example of where the byte range support is not working optimally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One non-blocking nit.
Very clean code, nice work!
Co-authored-by: Yunsong Wang <yunsongw@nvidia.com>
/ok to test |
template <row_selection RowSelection, | ||
normalize_single_quotes NormalizeSingleQuotes, | ||
normalize_whitespace NormalizeWhitespace, | ||
mixed_types_as_string MixedTypesAsString, | ||
recovery_mode RecoveryMode> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh no this is too many template params. Why not using run time parameter instead? That would reduce compile time a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh no this is too many template params. Why not using run time parameter instead?
This makes results more readable and the build time for benchmarks doesn't really matter IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's fine to me 👍
/merge |
The goal of this piece of work is to analyze the performance of the reader for JSON lines. This PR establishes a baseline for the performance of single quote normalization, white space normalization, mixed type as string parsing and recovery mode options when the input JSON is valid, and does not have any single quotes. Modifying the data generation to produce inputs with single quotes/mixed types/invalid lines will be the focus of follow-on PRs. Addresses rapidsai#15041 Authors: - Shruti Shivakumar (https://github.com/shrshi) - Nghia Truong (https://github.com/ttnghia) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Vukasin Milovanovic (https://github.com/vuule) - Yunsong Wang (https://github.com/PointKernel) - Nghia Truong (https://github.com/ttnghia) URL: rapidsai#15124
Description
The goal of this piece of work is to analyze the performance of the reader for JSON lines. This PR establishes a baseline for the performance of single quote normalization, white space normalization, mixed type as string parsing and recovery mode options when the input JSON is valid, and does not have any single quotes.
Modifying the data generation to produce inputs with single quotes/mixed types/invalid lines will be the focus of follow-on PRs.
Addresses #15041
Checklist