ARROW-1846: [C++][Compute] Implement "any" reduction kernel for boolean data #8294

arw2019 · 2020-09-29T05:09:53Z

As discussed on Jira this is a short-circuiting Max for booleans implemented on top of the existing min_max kernel.

As is there are no options: null is always taken to evaluate to false.

If we want to include control over null handling I can either add Options to the kernel or I can implement an any_kleene kernel by analogy with the and_kleene and or_kleene logical kernels that we have.

In Python the two options would look like:

In []: a = pa.array([True, None], type='bool') 
    ...:  
    ...: # option 1 
    ...: pc.any(a).as_py() is True 
    ...: pc.any_kleene(a).as_py() is None 
    ...:  
    ...: # option 2 
    ...: pc.any(null_handling='skip') is True 
    ...: pc.any(null_handling='emit_null') is None

github-actions · 2020-09-29T05:38:18Z

https://issues.apache.org/jira/browse/ARROW-1846

wesm · 2020-09-29T16:25:53Z

cpp/src/arrow/compute/kernels/aggregate_basic_internal.h

Is this really needed? Since there is only one type handled seems like you could omit all this and do something simpler in AnyInit

It's not. I've simplified it

wesm · 2020-09-29T16:26:37Z

cpp/src/arrow/compute/kernels/aggregate_basic_internal.h

Just use BooleanScalar here?

Switched to that here (and in other places)

wesm · 2020-09-29T16:26:48Z

cpp/src/arrow/compute/kernels/aggregate_basic_internal.h

Just use BooleanArray here

Switched to that here (and in other places)

wesm · 2020-09-29T16:26:59Z

cpp/src/arrow/compute/kernels/aggregate_basic_internal.h

wesm · 2020-09-29T16:27:59Z

cpp/src/arrow/compute/kernels/aggregate_basic_internal.h

Collapse all this to

out->value = std::make_shared<BooleanScalar>(this->state.max)

wesm · 2020-09-29T16:28:42Z

cpp/src/arrow/compute/kernels/aggregate_test.cc

These typedefs probably not needed, just use the Boolean* types within

Got rid of all these

wesm · 2020-09-29T16:29:36Z

cpp/src/arrow/compute/kernels/aggregate_test.cc

use checked_cast<const BooleanArray&> here

wesm · 2020-09-29T16:29:42Z

cpp/src/arrow/compute/kernels/aggregate_test.cc

Is this needed?

No - switched to plain bool variable in the new commit as you suggest below

wesm · 2020-09-29T16:29:49Z

cpp/src/arrow/compute/kernels/aggregate_test.cc

maybe just bool here?

cpp/src/arrow/compute/kernels/aggregate_test.cc

arw2019

Thanks @wesm for reviewing! I've addressed most of the comments. (There's one left - I'll ping when I'm done and ready for re-review.)

A question re: benchmarks - are they useful for this? I'll add some if yes.

arw2019 · 2020-09-30T06:04:51Z

cpp/src/arrow/compute/kernels/aggregate_basic_internal.h

arw2019 · 2020-09-30T06:04:59Z

cpp/src/arrow/compute/kernels/aggregate_basic_internal.h

Switched to that here (and in other places)

arw2019 · 2020-09-30T06:06:13Z

cpp/src/arrow/compute/kernels/aggregate_basic_internal.h

Switched to that here (and in other places)

arw2019 · 2020-09-30T06:06:26Z

cpp/src/arrow/compute/kernels/aggregate_basic_internal.h

arw2019 · 2020-09-30T06:08:46Z

cpp/src/arrow/compute/kernels/aggregate_test.cc

Got rid of all these

arw2019 · 2020-09-30T06:09:33Z

cpp/src/arrow/compute/kernels/aggregate_test.cc

No - switched to plain bool variable in the new commit as you suggest below

arw2019 · 2020-09-30T06:10:21Z

cpp/src/arrow/compute/kernels/aggregate_test.cc

jorisvandenbossche · 2020-09-30T12:27:02Z

Just a high level remark (didn't yet look at the code), but I think the example you gave:

In []: a = pa.array([True, None], type='bool') 
    ...:  
    ...: # option 1 
    ...: pc.any(a).as_py() is True 
    ...: pc.any_kleene(a).as_py() is None 
    ...:  
    ...: # option 2 
    ...: pc.any(null_handling='skip') is True 
    ...: pc.any(null_handling='emit_null') is None

has a wrong output for the kleene version. With Kleene logic, also the second output would be True, as the array already contains a True, the missing value doesn't matter anymore.

Using Kleene logic or not is not the same as the skip/emit null handling. By default, if nulls are skipped, then it doesn't matter if you use Kleene logic or not, since there are no nulls to behave in certain ways. So only when not skipping nulls, you get a different behaviour: any([True, None], skipna=False) or any_kleene([True, None], skipna=False) would still both give True as result, since there is any True. But eg any([False, None], skipna=False) woud give False (the missing being False) vs any_kleene([False, None], skipna=False) giving null as result.

See also our discussions in pandas about this (pandas-dev/pandas#29686; https://pandas.pydata.org/pandas-docs/stable/user_guide/boolean.html#kleene-logical-operations)

pitrou · 2020-10-01T15:44:23Z

cpp/src/arrow/compute/kernels/aggregate_basic_internal.h

I don't really understand why this is inheriting from MinMax. Does it help reduce the code size in any way?

It doesn't. I've rewritten it so it inherits from ScalarAggregator directly

Also since it's no longer a template I moved it to aggregate_basic.cc

pitrou · 2020-10-01T15:50:46Z

cpp/src/arrow/compute/kernels/aggregate_basic_internal.h

This forces counting the set bits in the whole array, but we're only interested in the presence of a single set bit, so we should be able to shortcut much more aggressively.

You may try to use OptionalBinaryBitBlockCounter for that. Untested:

const auto& data = *batch[0].array(); OptionalBinaryBitBlockCounter counter(data.buffers[0], data.offset, data.buffers[1], data.offset, data.length); int64_t position = 0; while (position < data.length) { const auto block = counter.NextAndBlock(); if (block.popcount > 0) { this->state.max = true; break; } position += block.length; }

}

This worked pretty much straight away. Thanks @pitrou!!!

pitrou · 2020-10-01T15:51:57Z

cpp/src/arrow/compute/kernels/aggregate_test.cc

Please also test with an empty array.

jorisvandenbossche · 2020-10-13T12:56:26Z

cpp/src/arrow/compute/api_aggregate.h

Is this the behaviour we want? (the "null values are taken to evaluate to false")

Any/all are of course not the most typical reductions (so I am also not fully sure about the desired behaviour), but, for other reductions we actually skip nulls. And skipping nulls is not the same as evaluating them to False

(somewhat related to my comment at #8294 (comment))

I'm not sure about the desired behavior either (although if it was up to me I would want to shoot for consistency with existing kernels).

That said I may need to improve the phrasing in the docstring. I think he current kernel behavior is what you describe: we skip nulls and return whether we saw any True values so perhaps it's better to just say that. I think that as is treating null as false or skipping is the same in this case, since neither evaluate to true.

A bit off-topic, but for an all kernel (which could be nice to have) I think we'd want to have null handling options, so that users could switch between

any([true, null]) = true # skip nulls any([true, null]) = false # null evaluates as false any([true, null]) = false # kleene logic (I think?)

(PS Apologies for not replying to your comment directly - I have opened ARROW-10291 to track that discussion )

Ah, yes, you're correct that for this PR there is not yet a difference: you are only dealing with any, and it's only for all that there is a difference between "skipping nulls" and "treating nulls as False"

So I would indeed update the docstring to simply state that nulls are skipped.

For your example about all (I assume the code was meant to use all and not any ?), for the last line about Kleene logic I expect a return value of null (as the ǹull in the values can be both True or False, meaning that the result can be True of False, meaning it is unknown)

Docstring updated.

Sorry yes. The examples were for all. Just to make sure I understand, with kleene logic, we emit null if there's a null anywhere in the input

any_kleene([true, null]) = null any_kleene([false, null]) = null all_kleene([true, null]) = null all_kleene([false, null]) = null

I opened ARROW-10301 re: all kernel

Sorry yes. The examples were for all. Just to make sure I understand, with kleene logic, we emit null if there's a null anywhere in the input

No, we only emit null if the presence of the null (the fact that it could be either True or False) would influence the result:

any_kleene([true, null]) = true any_kleene([false, null]) = null all_kleene([true, null]) = null all_kleene([false, null]) = false

But, the above is only when not skipping nulls (because with the default of skipping, there is no difference with non-kleene logic)

See the links I mentioned in #8294 (comment), and also the Julia docs have a good explanation of three-valued (Kleene) logic: https://docs.julialang.org/en/v1/manual/missing/index.html#Logical-operators-1

BTW, the result you show (a null as result in all cases) is what I expect for the non-kleene version when not skipping nulls (I would expect that nulls propagate in that case, and not necessarily be interpreted as false)

arw2019

I believe I addressed all the comments so this is ready for re-review (modulo CI turning something up)

arw2019 · 2020-10-13T02:13:47Z

cpp/src/arrow/compute/kernels/aggregate_basic_internal.h

It doesn't. I've rewritten it so it inherits from ScalarAggregator directly

Also since it's no longer a template I moved it to aggregate_basic.cc

arw2019 · 2020-10-13T15:12:58Z

cpp/src/arrow/compute/api_aggregate.h

I'm not sure about the desired behavior either (although if it was up to me I would want to shoot for consistency with existing kernels).

That said I may need to improve the phrasing in the docstring. I think he current kernel behavior is what you describe: we skip nulls and return whether we saw any True values so perhaps it's better to just say that. I think that as is treating null as false or skipping is the same in this case, since neither evaluate to true.

A bit off-topic, but for an all kernel (which could be nice to have) I think we'd want to have null handling options, so that users could switch between

any([true, null]) = true # skip nulls any([true, null]) = false # null evaluates as false any([true, null]) = false # kleene logic (I think?)

(PS Apologies for not replying to your comment directly - I have opened ARROW-10291 to track that discussion )

arw2019 · 2020-10-13T16:13:05Z

cpp/src/arrow/compute/kernels/aggregate_basic_internal.h

It's not. I've simplified it

arw2019 · 2020-10-13T16:13:17Z

cpp/src/arrow/compute/kernels/aggregate_test.cc

arw2019 · 2020-10-13T16:26:20Z

cpp/src/arrow/compute/kernels/aggregate_basic_internal.h

This worked pretty much straight away. Thanks @pitrou!!!

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

pitrou

+1. This looks good to me, thank you @arw2019 .

pitrou · 2020-11-24T16:02:24Z

I've rebased now.

arw2019 · 2020-11-24T22:06:59Z

thanks @pitrou @jorisvandenbossche @wesm for reviewing!!!

arw2019 force-pushed the ARROW-1846 branch from af2ea90 to c20b07e Compare September 29, 2020 05:16

arw2019 force-pushed the ARROW-1846 branch from c20b07e to df1ae22 Compare September 29, 2020 13:41

wesm reviewed Sep 29, 2020

View reviewed changes

arw2019 force-pushed the ARROW-1846 branch 2 times, most recently from 42b158a to cda0edb Compare September 30, 2020 03:58

arw2019 commented Sep 30, 2020

View reviewed changes

arw2019 force-pushed the ARROW-1846 branch 2 times, most recently from 8fd3c37 to 3fa7dd6 Compare October 1, 2020 05:35

pitrou reviewed Oct 1, 2020

View reviewed changes

arw2019 force-pushed the ARROW-1846 branch 2 times, most recently from 2c26fc4 to 575131a Compare October 13, 2020 01:01

jorisvandenbossche requested changes Oct 13, 2020

View reviewed changes

arw2019 force-pushed the ARROW-1846 branch 8 times, most recently from 87369ad to af613d9 Compare October 13, 2020 16:34

arw2019 commented Oct 13, 2020

View reviewed changes

arw2019 force-pushed the ARROW-1846 branch 3 times, most recently from 304e4b7 to 0bfeccf Compare October 14, 2020 05:52

kszucs force-pushed the master branch from 953009f to 04660f8 Compare October 19, 2020 18:00

kszucs force-pushed the ARROW-1846 branch from 0bfeccf to 61105ac Compare October 19, 2020 22:32

arw2019 force-pushed the ARROW-1846 branch from 61105ac to 4ce1f65 Compare October 26, 2020 04:18

arw2019 force-pushed the ARROW-1846 branch from 4ce1f65 to dad5157 Compare November 2, 2020 18:25

jorisvandenbossche requested review from pitrou and removed request for pitrou November 19, 2020 14:54

arw2019 and others added 18 commits November 24, 2020 17:00

ARROW-1846: [C++] Implement "any" reduction kernel for boolean data

cdfb054

feedback

1926af3

add test for empty case

f7f394e

feedback: aggressive short-circuiting

f01f079

feedback: simplify AnyImpl

5952443

update docstring

4cd692a

update Python api reference

c55d20f

some fixes

9f099c6

in BooleanAnyImpl rename local var max->any

83c653b

BooleanAnyImpl: use |= instead of += for bool to fix Windows build

bed57a6

add testcase

90f911e

DOC: add python doc in aggregate_basic.cc

2b22096

resolve merge conflict

70a9d7f

fix merge conflict

21c1ef6

linting

579a96d

Update version in C++ docstring

a66661d

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

fix rebase error

8a58bdf

fix rebase error

656bd82

pitrou force-pushed the ARROW-1846 branch from 349960a to 656bd82 Compare November 24, 2020 16:01

pitrou approved these changes Nov 24, 2020

View reviewed changes

pitrou closed this in cb04686 Nov 24, 2020

arw2019 deleted the ARROW-1846 branch November 24, 2020 18:12

This was referenced May 25, 2021

[C++] Implement "any" reduction kernel for boolean data #17840

Closed

[C++] Add "all" boolean reducing kernel #26292

Closed

[C++] Add all_kleene boolean reducing kernel #26284

Open

ARROW-1846: [C++][Compute] Implement "any" reduction kernel for boolean data #8294

ARROW-1846: [C++][Compute] Implement "any" reduction kernel for boolean data #8294

Uh oh!

Conversation

arw2019 commented Sep 29, 2020

Uh oh!

github-actions bot commented Sep 29, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

arw2019 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jorisvandenbossche commented Sep 30, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Oct 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

jorisvandenbossche Oct 13, 2020 •

edited

Loading