Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: add isNaN and notNaN predicates #1747
API: add isNaN and notNaN predicates #1747
Changes from 1 commit
951699a
f33f5fc
80263ac
742d898
d7c3c3e
d5e6663
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to update the equality predicate to catch
NaN
and rewrite toisNaN
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally thought to update
SparkFilters
to do the rewrite, but this is a much better place. Thanks for the suggestion!Edit: what do you think about doing rewriting
eq
withinUnboundPredicate
? And for rewritingin
, I was thinking to letExpressions.in
to do the rewrite logic ofor(isNaN, in)
/and(notNaN, notIn)
, but that means it will returnExpression
instead ofPredicate
; does that align with your thinking?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not fully understand what you mean by "rewrite logic of
or(isNaN, in)
/and(notNaN, notIn)
" when you talk about rewritingin
. Can you give some examples of what predicate are you trying to support?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So now since we want to handle NaN in
in
predicate, for queryin(1,2, NaN)
to avoid checking for NaN inin
evaluation all the time we can transform that toin(1,2) or isNaN
, andnotIn(1,2,NaN)
tonotIn(1, 2) and notNaN
. The problem is where to do that, sincein
andnotIn
are both predicate, and if we are extending them we are transforming a predicate (simpler form) to an expression (complex form), and I think there's no such case in the current code base, and it would touch a lot of existing test cases for this.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so it's what I thought, just a bit confused by the notation.
So for
eq
, what is the benefit of doing it inUnboundedPredicate
versus just rewriting it in theExpressions
?For
in
, I think it is a more complex question.We need to figure out:in(1,2,NaN)
be supported, given it can be written asis_nan or in(1,2)
on client sideExpressions.in
should returnExpression
as you said, which looks fine to me because the only callerSparkFilters.convert
also returns anExpression
in the end.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick response! Yeah I think the amount of change to method return type/tests is not a concern now. I just wasn't entirely sure if rewriting
eq
toisNan
inExpressions
will help with catching problems early (comparing to rewriting inUnboundPredicate
), since it seems to me that the related code will not have a chance to throw any exception untilbind()
is called?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it isn't much earlier in that case. Maybe that actually exposes a problem with rewriting, too.
Expressions.equal("c", Double.NaN)
ifc
is not a floating point column would result inisNaN
, which should be rejected while binding expressions. You could argue that it should rewrite toalwaysFalse
instead following the same logic asExpressions.equal("intCol", Long.MAX_VALUE)
-- it can't be true.I think that it would be better to be strict and reject binding in that case because something is clearly wrong. I think a lot of the time, that kind of error would happen when columns are misaligned or predicates are incorrectly converted.
If the result of those errors is just to fail in expression binding, then why rewrite at all? Maybe we should just reject NaN in any predicate and force people to explicitly use
isNaN
andnotNaN
. That way we do throw an exception much earlier in all cases. Plus, we wouldn't have to worry about confusion over whetherNaN
is equal to itself: in Java, aDouble
that holds NaN is equal to itself, but a primitive is not. 😕There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, those are some good points! To make sure I understand correctly/know how to move forward, I have some questions:
SparkFilters
(or in general, the integration point with engines during the query-to-expression translation); or maybe even earlier than that, to let engines to support syntax ofis NaN
?isNaN
there has to be some place that ensures the type has to be either double or float, and in iceberg code base we will only know this during binding; are we able to rely on engine to do this check before translating query toExpression
?eq
as we decided to do input validation on otherlg/lteq/gt/gteq
andin
anyway?NaN
toeq
, that may sound backward incompatible until the engine starts to rewrite NaN?I guess the conversation is starting to get too detailed, if you wouldn't mind I'll try to follow up on Slack tomorrow and then post the conclusion here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. If the engine generally uses
d = NaN
then we can convert that toisNaN
. But that would be engine-dependent and the Iceberg expression API would not support equals with NaN.I think so. Most engines will optimize the SQL expressions and handle this already. If not, then it would result in an exception from Iceberg to the user. I think that's okay, too, because as I said above, we want to fail if a NaN is used in an expression with a non-floating-point column, not rewrite to false.
Yes. This makes all of the handling in
Expressions
consistent: always reject NaN values.I'm not convinced either way. You could argue that
d = NaN
is ambiguous and that rejecting it is now fixing a bug. That's certainly the case withd > NaN
, which is not defined. On the other hand, there was some bevhavior before that will now no longer work. So I'd be up for fixing this in Flink and Spark conversions as soon as we can.Feel free to ping me on Slack!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the explanation! I think now I understand the full picture. I think I've addressed everything except for rewriting in
SparkFilters
and other engines, which I think this PR is already too big so I'll submit a separate PR for it (likely next week).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we define a similar
containsNaNsOnly
method to use innotNaN
and for a similar use inisNull
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't define
notNaN
originally as I could directly returnROWS_CANNOT_MATCH
when bothnanCounts
andvalueCounts
contain this column but numbers don't match, without going into the next block of logic (of checking upper == lower == NaN and null count == 0); but this advantage no longer exists since that block needs to be removed.But I wasn't sure if we need it for
isNull
: currently inisNull()
we are checking ifnullCounts == 0
to returnROWS_CANNOT_MATCH
, and I guess the only chance where we rely oncontainsNaNsOnly
to returnROWS_CANNOT_MATCH
isnullCounts
for this column doesn't exist butnanCounts
does. I personally feel the chance of this happening would be small, do you think we will run into this case often?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the
containsNaNsOnly
logic will not be very useful as Yan said, but I think it is also valuable to have that private method just for readability.Then the question reduces to: do we need to consider the case that null value metrics do not exist but NaN metrics do. For now I think the answer is no, because in all metrics modes NaN and null counters either both exist or both not exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I'll create a
containsNaNsOnly
for readability. Ryan, do you have comment on the other point?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the reasoning. If we have NaN counts, then we should have null counts. No need to over-complicated the null logic with a check for when we don't have null counts but do have NaN counts. Good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be safe, I think this should validate that
containsNull
is true.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean check for both
containsNull
andstats.get(pos).lowerBound() == null
are true? When wouldlowerBound
be null while the column doesn't contain null? I guess I'll also need to updatenotNull
for this too (since I copied the logic from there)?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like #1803 is missing
PartitionFieldSummary.containsNaN()
, or is it in some other PR?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should be doable, although I originally consider the scope of the NaN support to be only on manifest entry level, I wasn't sure if we want to extend it beyond that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure of a case where it would happen, but
containsNull
is the source of truth for whether there are null values, not a missing bound value.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I'll add
containsNull
to both here andnotNull
. And looks like we do want to updatePartitionFieldSummary
, that I'll do in a separate pr.