Skip to content

Conversation

@arw2019
Copy link
Contributor

@arw2019 arw2019 commented Sep 29, 2020

As discussed on Jira this is a short-circuiting Max for booleans implemented on top of the existing min_max kernel.

As is there are no options: null is always taken to evaluate to false.

If we want to include control over null handling I can either add Options to the kernel or I can implement an any_kleene kernel by analogy with the and_kleene and or_kleene logical kernels that we have.

In Python the two options would look like:

In []: a = pa.array([True, None], type='bool') 
    ...:  
    ...: # option 1 
    ...: pc.any(a).as_py() is True 
    ...: pc.any_kleene(a).as_py() is None 
    ...:  
    ...: # option 2 
    ...: pc.any(null_handling='skip') is True 
    ...: pc.any(null_handling='emit_null') is None                                                     

@github-actions
Copy link

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really needed? Since there is only one type handled seems like you could omit all this and do something simpler in AnyInit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not. I've simplified it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use BooleanScalar here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to that here (and in other places)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use BooleanArray here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to that here (and in other places)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collapse all this to

out->value = std::make_shared<BooleanScalar>(this->state.max)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These typedefs probably not needed, just use the Boolean* types within

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got rid of all these

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use checked_cast<const BooleanArray&> here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - switched to plain bool variable in the new commit as you suggest below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just bool here?

@arw2019 arw2019 force-pushed the ARROW-1846 branch 2 times, most recently from 42b158a to cda0edb Compare September 30, 2020 03:58
Copy link
Contributor Author

@arw2019 arw2019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wesm for reviewing! I've addressed most of the comments. (There's one left - I'll ping when I'm done and ready for re-review.)

A question re: benchmarks - are they useful for this? I'll add some if yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to that here (and in other places)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to that here (and in other places)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got rid of all these

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - switched to plain bool variable in the new commit as you suggest below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@jorisvandenbossche
Copy link
Member

Just a high level remark (didn't yet look at the code), but I think the example you gave:

In []: a = pa.array([True, None], type='bool') 
    ...:  
    ...: # option 1 
    ...: pc.any(a).as_py() is True 
    ...: pc.any_kleene(a).as_py() is None 
    ...:  
    ...: # option 2 
    ...: pc.any(null_handling='skip') is True 
    ...: pc.any(null_handling='emit_null') is None                                                     

has a wrong output for the kleene version. With Kleene logic, also the second output would be True, as the array already contains a True, the missing value doesn't matter anymore.

Using Kleene logic or not is not the same as the skip/emit null handling. By default, if nulls are skipped, then it doesn't matter if you use Kleene logic or not, since there are no nulls to behave in certain ways. So only when not skipping nulls, you get a different behaviour: any([True, None], skipna=False) or any_kleene([True, None], skipna=False) would still both give True as result, since there is any True. But eg any([False, None], skipna=False) woud give False (the missing being False) vs any_kleene([False, None], skipna=False) giving null as result.

See also our discussions in pandas about this (pandas-dev/pandas#29686; https://pandas.pydata.org/pandas-docs/stable/user_guide/boolean.html#kleene-logical-operations)

@arw2019 arw2019 force-pushed the ARROW-1846 branch 2 times, most recently from 8fd3c37 to 3fa7dd6 Compare October 1, 2020 05:35
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand why this is inheriting from MinMax. Does it help reduce the code size in any way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't. I've rewritten it so it inherits from ScalarAggregator directly

Also since it's no longer a template I moved it to aggregate_basic.cc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This forces counting the set bits in the whole array, but we're only interested in the presence of a single set bit, so we should be able to shortcut much more aggressively.

You may try to use OptionalBinaryBitBlockCounter for that. Untested:

const auto& data = *batch[0].array();
OptionalBinaryBitBlockCounter counter(data.buffers[0], data.offset, data.buffers[1], data.offset, data.length);
int64_t position = 0;
while (position < data.length) {
  const auto block = counter.NextAndBlock();
  if (block.popcount > 0) {
    this->state.max = true;
    break;
  }
  position += block.length;
}

}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This worked pretty much straight away. Thanks @pitrou!!!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also test with an empty array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@arw2019 arw2019 force-pushed the ARROW-1846 branch 2 times, most recently from 2c26fc4 to 575131a Compare October 13, 2020 01:01
Copy link
Member

@jorisvandenbossche jorisvandenbossche Oct 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the behaviour we want? (the "null values are taken to evaluate to false")

Any/all are of course not the most typical reductions (so I am also not fully sure about the desired behaviour), but, for other reductions we actually skip nulls. And skipping nulls is not the same as evaluating them to False

(somewhat related to my comment at #8294 (comment))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the desired behavior either (although if it was up to me I would want to shoot for consistency with existing kernels).

That said I may need to improve the phrasing in the docstring. I think he current kernel behavior is what you describe: we skip nulls and return whether we saw any True values so perhaps it's better to just say that. I think that as is treating null as false or skipping is the same in this case, since neither evaluate to true.

A bit off-topic, but for an all kernel (which could be nice to have) I think we'd want to have null handling options, so that users could switch between

any([true, null]) = true # skip nulls
any([true, null]) = false # null evaluates as false
any([true, null]) = false # kleene logic (I think?)

(PS Apologies for not replying to your comment directly - I have opened ARROW-10291 to track that discussion )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, you're correct that for this PR there is not yet a difference: you are only dealing with any, and it's only for all that there is a difference between "skipping nulls" and "treating nulls as False"

So I would indeed update the docstring to simply state that nulls are skipped.

For your example about all (I assume the code was meant to use all and not any ?), for the last line about Kleene logic I expect a return value of null (as the ǹull in the values can be both True or False, meaning that the result can be True of False, meaning it is unknown)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring updated.

Sorry yes. The examples were for all. Just to make sure I understand, with kleene logic, we emit null if there's a null anywhere in the input

any_kleene([true, null]) = null
any_kleene([false, null]) = null
all_kleene([true, null]) = null
all_kleene([false, null]) = null

I opened ARROW-10301 re: all kernel

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry yes. The examples were for all. Just to make sure I understand, with kleene logic, we emit null if there's a null anywhere in the input

No, we only emit null if the presence of the null (the fact that it could be either True or False) would influence the result:

any_kleene([true, null]) = true
any_kleene([false, null]) = null
all_kleene([true, null]) = null
all_kleene([false, null]) = false

But, the above is only when not skipping nulls (because with the default of skipping, there is no difference with non-kleene logic)

See the links I mentioned in #8294 (comment), and also the Julia docs have a good explanation of three-valued (Kleene) logic: https://docs.julialang.org/en/v1/manual/missing/index.html#Logical-operators-1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the result you show (a null as result in all cases) is what I expect for the non-kleene version when not skipping nulls (I would expect that nulls propagate in that case, and not necessarily be interpreted as false)

@arw2019 arw2019 force-pushed the ARROW-1846 branch 8 times, most recently from 87369ad to af613d9 Compare October 13, 2020 16:34
Copy link
Contributor Author

@arw2019 arw2019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I addressed all the comments so this is ready for re-review (modulo CI turning something up)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't. I've rewritten it so it inherits from ScalarAggregator directly

Also since it's no longer a template I moved it to aggregate_basic.cc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the desired behavior either (although if it was up to me I would want to shoot for consistency with existing kernels).

That said I may need to improve the phrasing in the docstring. I think he current kernel behavior is what you describe: we skip nulls and return whether we saw any True values so perhaps it's better to just say that. I think that as is treating null as false or skipping is the same in this case, since neither evaluate to true.

A bit off-topic, but for an all kernel (which could be nice to have) I think we'd want to have null handling options, so that users could switch between

any([true, null]) = true # skip nulls
any([true, null]) = false # null evaluates as false
any([true, null]) = false # kleene logic (I think?)

(PS Apologies for not replying to your comment directly - I have opened ARROW-10291 to track that discussion )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not. I've simplified it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This worked pretty much straight away. Thanks @pitrou!!!

@jorisvandenbossche jorisvandenbossche requested review from pitrou and removed request for pitrou November 19, 2020 14:54
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. This looks good to me, thank you @arw2019 .

@pitrou
Copy link
Member

pitrou commented Nov 24, 2020

I've rebased now.

@pitrou pitrou closed this in cb04686 Nov 24, 2020
@arw2019 arw2019 deleted the ARROW-1846 branch November 24, 2020 18:12
@arw2019
Copy link
Contributor Author

arw2019 commented Nov 24, 2020

thanks @pitrou @jorisvandenbossche @wesm for reviewing!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants