Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add is_integer API that can check for the validity of a string-to-integer conversion #7642

Merged
merged 27 commits into from
Mar 24, 2021

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Mar 18, 2021

This PR addresses #5110, #7080, and rework #7094. It adds the function cudf::strings::is_integer that can check if strings can be correctly converted into integer values. Underflow and overflow are also taken into account.

Note that this cudf::strings::is_integer is different from the existing cudf::strings::string::is_integer, which only checks for pattern and does not care about under/overflow.

Examples:

s = { "eee", "-200", "-100", "127", "128", "1.5", NULL}

is_integer(s, INT8) = { 0, 0, 1, 1, 0, 0, NULL}
is_integer(s, INT32) = { 0, 1, 1, 1, 1, 0, NULL}

@ttnghia ttnghia added feature request New feature or request 3 - Ready for Review Ready for review by team tests Unit testing for project libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) non-breaking Non-breaking change labels Mar 18, 2021
@ttnghia ttnghia requested a review from a team as a code owner March 18, 2021 20:40
@ttnghia ttnghia requested review from trxcllnt and jrhemstad March 18, 2021 20:40
@ttnghia ttnghia changed the title Casting from string to integer returns null if the string is not a valid integer Add bounds check for is_integer and rework to_integer for string to integer conversion Mar 18, 2021
@davidwendt
Copy link
Contributor

I don't agree with this change. We considered this in the initial design.
Invalid data checks should not be added to the conversion routines. Do not add this extra kernel call to the conversion process.
@jrhemstad @harrism @kkraus14

Copy link
Contributor

@davidwendt davidwendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not add this extra kernel call to the conversion process.
Also, I would consider just adding a new is_integer API that takes a data_type rather change the current one.

@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 18, 2021

Do not add this extra kernel call to the conversion process.
Also, I would consider just adding a new is_integer API that takes a data_type rather change the current one.

If not modifying the existing to_integer API then do you have any idea that we can have a new API for doing the job of my modified to_integer version? It would be great if we have it.

@davidwendt
Copy link
Contributor

Do not add this extra kernel call to the conversion process.
Also, I would consider just adding a new is_integer API that takes a data_type rather change the current one.

If not modifying the existing to_integer API then do you have any idea that we can have a new API for doing the job of my modified to_integer version? It would be great if we have it.

You should be able to use a combination of is_integer and to_integer to achieve your result, right?
But this function should not be part of libcudf.

@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 18, 2021

You should be able to use a combination of is_integer and to_integer to achieve your result, right?
But this function should not be part of libcudf.

Thanks David. I just checked with the Spark team and was verified that this can be done (in a less efficient way) in Spark, so I've reversed back my changes in the to_integer function.

@ttnghia ttnghia changed the title Add bounds check for is_integer and rework to_integer for string to integer conversion Add bounds check for is_integer Mar 18, 2021
@codecov
Copy link

codecov bot commented Mar 19, 2021

Codecov Report

Merging #7642 (01aec79) into branch-0.19 (7871e7a) will increase coverage by 0.23%.
The diff coverage is n/a.

❗ Current head 01aec79 differs from pull request most recent head f2daef9. Consider uploading reports for the commit f2daef9 to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.19    #7642      +/-   ##
===============================================
+ Coverage        81.86%   82.10%   +0.23%     
===============================================
  Files              101      101              
  Lines            16884    17054     +170     
===============================================
+ Hits             13822    14002     +180     
+ Misses            3062     3052      -10     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/categorical.py 91.62% <ø> (+0.23%) ⬆️
python/cudf/cudf/core/column/column.py 87.77% <ø> (+0.01%) ⬆️
python/cudf/cudf/core/column/datetime.py 89.09% <ø> (ø)
python/cudf/cudf/core/column/decimal.py 92.75% <ø> (-2.12%) ⬇️
python/cudf/cudf/core/column/lists.py 92.17% <ø> (+0.77%) ⬆️
python/cudf/cudf/core/column/numerical.py 94.83% <ø> (-0.20%) ⬇️
python/cudf/cudf/core/column/string.py 86.58% <ø> (+0.08%) ⬆️
python/cudf/cudf/core/column/timedelta.py 88.23% <ø> (ø)
python/cudf/cudf/core/column_accessor.py 95.87% <ø> (+0.55%) ⬆️
python/cudf/cudf/core/dataframe.py 90.78% <ø> (+0.31%) ⬆️
... and 27 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 30e493c...f2daef9. Read the comment docs.

cpp/tests/strings/integers_tests.cu Outdated Show resolved Hide resolved
cpp/tests/strings/integers_tests.cu Outdated Show resolved Hide resolved
cpp/tests/strings/integers_tests.cu Outdated Show resolved Hide resolved
cpp/src/strings/convert/convert_integers.cu Outdated Show resolved Hide resolved
Co-authored-by: David <45795991+davidwendt@users.noreply.github.com>
Comment on lines 138 to 145
thrust::transform(rmm::exec_policy(stream),
thrust::make_counting_iterator<size_type>(0),
thrust::make_counting_iterator<size_type>(strings.size()),
d_results,
[d_column] __device__(size_type idx) {
if (d_column.is_null(idx)) return false;
return string::is_integer(d_column.element<string_view>(idx));
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
thrust::transform(rmm::exec_policy(stream),
thrust::make_counting_iterator<size_type>(0),
thrust::make_counting_iterator<size_type>(strings.size()),
d_results,
[d_column] __device__(size_type idx) {
if (d_column.is_null(idx)) return false;
return string::is_integer(d_column.element<string_view>(idx));
});
thrust::transform(rmm::exec_policy(stream),
strings_column.pair_begin(),
strings_column.pair_end(),
d_results,
[d_column] __device__(auto p) {
return p.second ? is_integer(p.first) : false;
});

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This becomes even nicer when we replace pair_iterator with optional_iterator #6470

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I don't like this is because the pair iterator requires the has_nulls at compile time and thus this doubles the compile time. This looks nice and all but I'd rather we waited until the pair iterator was changed to make it a runtime parameter instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is blocking that change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidwendt please request changes when something is a problem otherwise the problem will get merged.

Copy link
Contributor Author

@ttnghia ttnghia Mar 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main concern here is something like this:

if (strings.has_nulls()) {
    thrust::transform(
      rmm::exec_policy(stream),
      d_column->pair_begin<string_view, true>(),
      d_column->pair_end<string_view, true>(),
      d_results,
      [] __device__(auto const& p) { return p.second ? string::is_integer(p.first) : false; });
  } else {
    thrust::transform(
      rmm::exec_policy(stream),
      d_column->pair_begin<string_view, false>(),
      d_column->pair_end<string_view, false>(),
      d_results,
      [] __device__(auto const& p) { return p.second ? string::is_integer(p.first) : false; });
  }

So we will have 2 calls to thrust::transform instead of just one. I can verify that this makes the compile time increases by around 1 second---not a problem at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before:

6:07 PM	Build finished in 25 sec, 27 ms

After:


6:08 PM	Build finished in 26 sec, 241 ms

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes until discussion about @davidwendt 's concerns on compile time is resolved.

@jrhemstad
Copy link
Contributor

Requesting changes until discussion about @davidwendt 's concerns on compile time is resolved.

I don't think its worth worrying about right now. This file is not a top offender for compile time.

@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 24, 2021

Requesting changes until discussion about @davidwendt 's concerns on compile time is resolved.

I don't think its worth worrying about right now. This file is not a top offender for compile time.

Compile time is affected by 1 second slower, so we don't need to worry about it yet:

Before:

6:07 PM	Build finished in 25 sec, 27 ms

After:


6:08 PM	Build finished in 26 sec, 241 ms

@harrism
Copy link
Member

harrism commented Mar 24, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit de55832 into rapidsai:branch-0.19 Mar 24, 2021
@ttnghia ttnghia self-assigned this Apr 25, 2021
@ttnghia ttnghia deleted the convert_integer_bound_check branch May 3, 2021 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change strings strings issues (C++ and Python) tests Unit testing for project
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] add something like an is_valid_integer function
4 participants