-
Notifications
You must be signed in to change notification settings - Fork 925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix JSON multi-source reading when total source size exceeds INT_MAX
bytes
#15930
Conversation
INT_MAX
bytesINT_MAX
bytes
for (std::size_t i = 0; i < log_repetitions; i++) { | ||
json_string = json_string + "\n" + json_string; | ||
numrows <<= 1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how much time is this loop taking?
nit: newline in json_string + "\n" + json_string;
may not be required because the json_string
starts with \n
.
This may open up optimization to speedup this loop. (reserve and +=
or append).
The loop took 2.5 seconds in my machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestion. After reserving memory for the output json_string
the loop runtime has reduced from 3.02 secs to 660 ms on the ipp1-3304 machine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more stats on the test runtime after some quick nsys profiling -
read_json
takes 20.94s (~96.9% of total runtime) - each read_batch
call takes concatenate
takes 516ms. The json test is the slowest among all large strings tests now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we get the same coverage with smaller input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. For the given single source size, even two sources are sufficient to ensure batching.
EDIT: with two sources, the total runtime is 3.64s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a couple of small suggestions.
outfile << json_string; | ||
|
||
constexpr int num_sources = 10; | ||
std::vector<std::string> filepaths(num_sources, filename); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speaking of speeding up the test, we can save on the IO by creating a vector of source_info
s from {json_string.c_str(), json_string.size()}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean constructing the source_info
object from vector of json_string
host buffers instead? I think the builder takes a single source_info
object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that :)
The only tricky part is that you need to explicitly create host_buffer
s because a vector of strings is treated as a set of filepaths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, shaved off another ~4.5 seconds - current runtime is 16.645 s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for iterating on the tests!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Great work!
/merge |
Description
Fixes #15917.
INT_MAX
bytes. This case will be handled with a chunked reader later.Checklist