-
Notifications
You must be signed in to change notification settings - Fork 928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add is_integer API that can check for the validity of a string-to-integer conversion #7642
Add is_integer API that can check for the validity of a string-to-integer conversion #7642
Conversation
…dity of the conversion process
I don't agree with this change. We considered this in the initial design. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not add this extra kernel call to the conversion process.
Also, I would consider just adding a new is_integer
API that takes a data_type rather change the current one.
If not modifying the existing |
You should be able to use a combination of |
Thanks David. I just checked with the Spark team and was verified that this can be done (in a less efficient way) in Spark, so I've reversed back my changes in the |
Codecov Report
@@ Coverage Diff @@
## branch-0.19 #7642 +/- ##
===============================================
+ Coverage 81.86% 82.10% +0.23%
===============================================
Files 101 101
Lines 16884 17054 +170
===============================================
+ Hits 13822 14002 +180
+ Misses 3062 3052 -10
Continue to review full report at Codecov.
|
Co-authored-by: David <45795991+davidwendt@users.noreply.github.com>
thrust::transform(rmm::exec_policy(stream), | ||
thrust::make_counting_iterator<size_type>(0), | ||
thrust::make_counting_iterator<size_type>(strings.size()), | ||
d_results, | ||
[d_column] __device__(size_type idx) { | ||
if (d_column.is_null(idx)) return false; | ||
return string::is_integer(d_column.element<string_view>(idx)); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thrust::transform(rmm::exec_policy(stream), | |
thrust::make_counting_iterator<size_type>(0), | |
thrust::make_counting_iterator<size_type>(strings.size()), | |
d_results, | |
[d_column] __device__(size_type idx) { | |
if (d_column.is_null(idx)) return false; | |
return string::is_integer(d_column.element<string_view>(idx)); | |
}); | |
thrust::transform(rmm::exec_policy(stream), | |
strings_column.pair_begin(), | |
strings_column.pair_end(), | |
d_results, | |
[d_column] __device__(auto p) { | |
return p.second ? is_integer(p.first) : false; | |
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This becomes even nicer when we replace pair_iterator
with optional_iterator
#6470
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I don't like this is because the pair iterator requires the has_nulls
at compile time and thus this doubles the compile time. This looks nice and all but I'd rather we waited until the pair iterator was changed to make it a runtime parameter instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is blocking that change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidwendt please request changes when something is a problem otherwise the problem will get merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main concern here is something like this:
if (strings.has_nulls()) {
thrust::transform(
rmm::exec_policy(stream),
d_column->pair_begin<string_view, true>(),
d_column->pair_end<string_view, true>(),
d_results,
[] __device__(auto const& p) { return p.second ? string::is_integer(p.first) : false; });
} else {
thrust::transform(
rmm::exec_policy(stream),
d_column->pair_begin<string_view, false>(),
d_column->pair_end<string_view, false>(),
d_results,
[] __device__(auto const& p) { return p.second ? string::is_integer(p.first) : false; });
}
So we will have 2 calls to thrust::transform
instead of just one. I can verify that this makes the compile time increases by around 1 second---not a problem at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before:
6:07 PM Build finished in 25 sec, 27 ms
After:
6:08 PM Build finished in 26 sec, 241 ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting changes until discussion about @davidwendt 's concerns on compile time is resolved.
I don't think its worth worrying about right now. This file is not a top offender for compile time. |
Compile time is affected by 1 second slower, so we don't need to worry about it yet: Before:
After:
|
@gpucibot merge |
This PR addresses #5110, #7080, and rework #7094. It adds the function
cudf::strings::is_integer
that can check if strings can be correctly converted into integer values. Underflow and overflow are also taken into account.Note that this
cudf::strings::is_integer
is different from the existingcudf::strings::string::is_integer
, which only checks for pattern and does not care about under/overflow.Examples: