-
Notifications
You must be signed in to change notification settings - Fork 3.9k
ARROW-1846: [C++][Compute] Implement "any" reduction kernel for boolean data #8294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really needed? Since there is only one type handled seems like you could omit all this and do something simpler in AnyInit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not. I've simplified it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just use BooleanScalar here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to that here (and in other places)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just use BooleanArray here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to that here (and in other places)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Collapse all this to
out->value = std::make_shared<BooleanScalar>(this->state.max)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These typedefs probably not needed, just use the Boolean* types within
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got rid of all these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use checked_cast<const BooleanArray&> here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No - switched to plain bool variable in the new commit as you suggest below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe just bool here?
42b158a to
cda0edb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wesm for reviewing! I've addressed most of the comments. (There's one left - I'll ping when I'm done and ready for re-review.)
A question re: benchmarks - are they useful for this? I'll add some if yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to that here (and in other places)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to that here (and in other places)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got rid of all these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No - switched to plain bool variable in the new commit as you suggest below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
|
Just a high level remark (didn't yet look at the code), but I think the example you gave: has a wrong output for the kleene version. With Kleene logic, also the second output would be True, as the array already contains a True, the missing value doesn't matter anymore. Using Kleene logic or not is not the same as the skip/emit null handling. By default, if nulls are skipped, then it doesn't matter if you use Kleene logic or not, since there are no nulls to behave in certain ways. So only when not skipping nulls, you get a different behaviour: See also our discussions in pandas about this (pandas-dev/pandas#29686; https://pandas.pydata.org/pandas-docs/stable/user_guide/boolean.html#kleene-logical-operations) |
8fd3c37 to
3fa7dd6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand why this is inheriting from MinMax. Does it help reduce the code size in any way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't. I've rewritten it so it inherits from ScalarAggregator directly
Also since it's no longer a template I moved it to aggregate_basic.cc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This forces counting the set bits in the whole array, but we're only interested in the presence of a single set bit, so we should be able to shortcut much more aggressively.
You may try to use OptionalBinaryBitBlockCounter for that. Untested:
const auto& data = *batch[0].array();
OptionalBinaryBitBlockCounter counter(data.buffers[0], data.offset, data.buffers[1], data.offset, data.length);
int64_t position = 0;
while (position < data.length) {
const auto block = counter.NextAndBlock();
if (block.popcount > 0) {
this->state.max = true;
break;
}
position += block.length;
}}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This worked pretty much straight away. Thanks @pitrou!!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also test with an empty array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
2c26fc4 to
575131a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the behaviour we want? (the "null values are taken to evaluate to false")
Any/all are of course not the most typical reductions (so I am also not fully sure about the desired behaviour), but, for other reductions we actually skip nulls. And skipping nulls is not the same as evaluating them to False
(somewhat related to my comment at #8294 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the desired behavior either (although if it was up to me I would want to shoot for consistency with existing kernels).
That said I may need to improve the phrasing in the docstring. I think he current kernel behavior is what you describe: we skip nulls and return whether we saw any True values so perhaps it's better to just say that. I think that as is treating null as false or skipping is the same in this case, since neither evaluate to true.
A bit off-topic, but for an all kernel (which could be nice to have) I think we'd want to have null handling options, so that users could switch between
any([true, null]) = true # skip nulls
any([true, null]) = false # null evaluates as false
any([true, null]) = false # kleene logic (I think?)
(PS Apologies for not replying to your comment directly - I have opened ARROW-10291 to track that discussion )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, you're correct that for this PR there is not yet a difference: you are only dealing with any, and it's only for all that there is a difference between "skipping nulls" and "treating nulls as False"
So I would indeed update the docstring to simply state that nulls are skipped.
For your example about all (I assume the code was meant to use all and not any ?), for the last line about Kleene logic I expect a return value of null (as the ǹull in the values can be both True or False, meaning that the result can be True of False, meaning it is unknown)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring updated.
Sorry yes. The examples were for all. Just to make sure I understand, with kleene logic, we emit null if there's a null anywhere in the input
any_kleene([true, null]) = null
any_kleene([false, null]) = null
all_kleene([true, null]) = null
all_kleene([false, null]) = null
I opened ARROW-10301 re: all kernel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry yes. The examples were for all. Just to make sure I understand, with kleene logic, we emit null if there's a null anywhere in the input
No, we only emit null if the presence of the null (the fact that it could be either True or False) would influence the result:
any_kleene([true, null]) = true
any_kleene([false, null]) = null
all_kleene([true, null]) = null
all_kleene([false, null]) = false
But, the above is only when not skipping nulls (because with the default of skipping, there is no difference with non-kleene logic)
See the links I mentioned in #8294 (comment), and also the Julia docs have a good explanation of three-valued (Kleene) logic: https://docs.julialang.org/en/v1/manual/missing/index.html#Logical-operators-1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, the result you show (a null as result in all cases) is what I expect for the non-kleene version when not skipping nulls (I would expect that nulls propagate in that case, and not necessarily be interpreted as false)
87369ad to
af613d9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I addressed all the comments so this is ready for re-review (modulo CI turning something up)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't. I've rewritten it so it inherits from ScalarAggregator directly
Also since it's no longer a template I moved it to aggregate_basic.cc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the desired behavior either (although if it was up to me I would want to shoot for consistency with existing kernels).
That said I may need to improve the phrasing in the docstring. I think he current kernel behavior is what you describe: we skip nulls and return whether we saw any True values so perhaps it's better to just say that. I think that as is treating null as false or skipping is the same in this case, since neither evaluate to true.
A bit off-topic, but for an all kernel (which could be nice to have) I think we'd want to have null handling options, so that users could switch between
any([true, null]) = true # skip nulls
any([true, null]) = false # null evaluates as false
any([true, null]) = false # kleene logic (I think?)
(PS Apologies for not replying to your comment directly - I have opened ARROW-10291 to track that discussion )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not. I've simplified it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This worked pretty much straight away. Thanks @pitrou!!!
304e4b7 to
0bfeccf
Compare
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. This looks good to me, thank you @arw2019 .
|
I've rebased now. |
|
thanks @pitrou @jorisvandenbossche @wesm for reviewing!!! |
As discussed on Jira this is a short-circuiting Max for booleans implemented on top of the existing
min_maxkernel.As is there are no options: null is always taken to evaluate to false.
If we want to include control over null handling I can either add
Optionsto the kernel or I can implement anany_kleenekernel by analogy with theand_kleeneandor_kleenelogical kernels that we have.In Python the two options would look like: