-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: add isNaN and notNaN predicates #1747
Changes from 1 commit
951699a
f33f5fc
80263ac
742d898
d7c3c3e
d5e6663
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -26,6 +26,7 @@ | |
import org.apache.iceberg.transforms.Transform; | ||
import org.apache.iceberg.transforms.Transforms; | ||
import org.apache.iceberg.types.Types; | ||
import org.apache.iceberg.util.NaNUtil; | ||
|
||
/** | ||
* Factory methods for creating {@link Expression expressions}. | ||
|
@@ -140,50 +141,62 @@ public static <T> UnboundPredicate<T> notNaN(UnboundTerm<T> expr) { | |
} | ||
|
||
public static <T> UnboundPredicate<T> lessThan(String name, T value) { | ||
validateInput("lessThan", value); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. An easier way to do this is to add the check in It also ensures that we don't add factory methods later and forget to add the check to them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you! I didn't notice Thank you so much for your time reviewing this long PR! |
||
return new UnboundPredicate<>(Expression.Operation.LT, ref(name), value); | ||
} | ||
|
||
public static <T> UnboundPredicate<T> lessThan(UnboundTerm<T> expr, T value) { | ||
validateInput("lessThan", value); | ||
return new UnboundPredicate<>(Expression.Operation.LT, expr, value); | ||
} | ||
|
||
public static <T> UnboundPredicate<T> lessThanOrEqual(String name, T value) { | ||
validateInput("lessThanOrEqual", value); | ||
return new UnboundPredicate<>(Expression.Operation.LT_EQ, ref(name), value); | ||
} | ||
|
||
public static <T> UnboundPredicate<T> lessThanOrEqual(UnboundTerm<T> expr, T value) { | ||
validateInput("lessThanOrEqual", value); | ||
return new UnboundPredicate<>(Expression.Operation.LT_EQ, expr, value); | ||
} | ||
|
||
public static <T> UnboundPredicate<T> greaterThan(String name, T value) { | ||
validateInput("greaterThan", value); | ||
return new UnboundPredicate<>(Expression.Operation.GT, ref(name), value); | ||
} | ||
|
||
public static <T> UnboundPredicate<T> greaterThan(UnboundTerm<T> expr, T value) { | ||
validateInput("greaterThan", value); | ||
return new UnboundPredicate<>(Expression.Operation.GT, expr, value); | ||
} | ||
|
||
public static <T> UnboundPredicate<T> greaterThanOrEqual(String name, T value) { | ||
validateInput("greaterThanOrEqual", value); | ||
return new UnboundPredicate<>(Expression.Operation.GT_EQ, ref(name), value); | ||
} | ||
|
||
public static <T> UnboundPredicate<T> greaterThanOrEqual(UnboundTerm<T> expr, T value) { | ||
validateInput("greaterThanOrEqual", value); | ||
return new UnboundPredicate<>(Expression.Operation.GT_EQ, expr, value); | ||
} | ||
|
||
public static <T> UnboundPredicate<T> equal(String name, T value) { | ||
validateInput("equal", value); | ||
return new UnboundPredicate<>(Expression.Operation.EQ, ref(name), value); | ||
} | ||
|
||
public static <T> UnboundPredicate<T> equal(UnboundTerm<T> expr, T value) { | ||
validateInput("equal", value); | ||
return new UnboundPredicate<>(Expression.Operation.EQ, expr, value); | ||
} | ||
|
||
public static <T> UnboundPredicate<T> notEqual(String name, T value) { | ||
validateInput("notEqual", value); | ||
return new UnboundPredicate<>(Expression.Operation.NOT_EQ, ref(name), value); | ||
} | ||
|
||
public static <T> UnboundPredicate<T> notEqual(UnboundTerm<T> expr, T value) { | ||
validateInput("notEqual", value); | ||
return new UnboundPredicate<>(Expression.Operation.NOT_EQ, expr, value); | ||
} | ||
|
||
|
@@ -232,6 +245,7 @@ public static <T> UnboundPredicate<T> notIn(UnboundTerm<T> expr, Iterable<T> val | |
} | ||
|
||
public static <T> UnboundPredicate<T> predicate(Operation op, String name, T value) { | ||
validateInput(op.toString(), value); | ||
return predicate(op, name, Literals.from(value)); | ||
} | ||
|
||
|
@@ -243,6 +257,7 @@ public static <T> UnboundPredicate<T> predicate(Operation op, String name, Liter | |
} | ||
|
||
public static <T> UnboundPredicate<T> predicate(Operation op, String name, Iterable<T> values) { | ||
validateInput(op.toString(), values); | ||
return predicate(op, ref(name), values); | ||
} | ||
|
||
|
@@ -254,9 +269,19 @@ public static <T> UnboundPredicate<T> predicate(Operation op, String name) { | |
} | ||
|
||
private static <T> UnboundPredicate<T> predicate(Operation op, UnboundTerm<T> expr, Iterable<T> values) { | ||
validateInput(op.toString(), values); | ||
return new UnboundPredicate<>(op, expr, values); | ||
} | ||
|
||
private static <T> void validateInput(String op, T value) { | ||
Preconditions.checkArgument(!NaNUtil.isNaN(value), String.format("Cannot create %s predicate with NaN", op)); | ||
} | ||
|
||
private static <T> void validateInput(String op, Iterable<T> values) { | ||
Preconditions.checkArgument(Lists.newArrayList(values).stream().noneMatch(NaNUtil::isNaN), | ||
String.format("Cannot create %s predicate with NaN", op)); | ||
} | ||
|
||
public static True alwaysTrue() { | ||
return True.INSTANCE; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to update the equality predicate to catch
NaN
and rewrite toisNaN
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally thought to update
SparkFilters
to do the rewrite, but this is a much better place. Thanks for the suggestion!Edit: what do you think about doing rewriting
eq
withinUnboundPredicate
? And for rewritingin
, I was thinking to letExpressions.in
to do the rewrite logic ofor(isNaN, in)
/and(notNaN, notIn)
, but that means it will returnExpression
instead ofPredicate
; does that align with your thinking?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not fully understand what you mean by "rewrite logic of
or(isNaN, in)
/and(notNaN, notIn)
" when you talk about rewritingin
. Can you give some examples of what predicate are you trying to support?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So now since we want to handle NaN in
in
predicate, for queryin(1,2, NaN)
to avoid checking for NaN inin
evaluation all the time we can transform that toin(1,2) or isNaN
, andnotIn(1,2,NaN)
tonotIn(1, 2) and notNaN
. The problem is where to do that, sincein
andnotIn
are both predicate, and if we are extending them we are transforming a predicate (simpler form) to an expression (complex form), and I think there's no such case in the current code base, and it would touch a lot of existing test cases for this.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so it's what I thought, just a bit confused by the notation.
So for
eq
, what is the benefit of doing it inUnboundedPredicate
versus just rewriting it in theExpressions
?For
in
, I think it is a more complex question.We need to figure out:in(1,2,NaN)
be supported, given it can be written asis_nan or in(1,2)
on client sideExpressions.in
should returnExpression
as you said, which looks fine to me because the only callerSparkFilters.convert
also returns anExpression
in the end.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick response! Yeah I think the amount of change to method return type/tests is not a concern now. I just wasn't entirely sure if rewriting
eq
toisNan
inExpressions
will help with catching problems early (comparing to rewriting inUnboundPredicate
), since it seems to me that the related code will not have a chance to throw any exception untilbind()
is called?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it isn't much earlier in that case. Maybe that actually exposes a problem with rewriting, too.
Expressions.equal("c", Double.NaN)
ifc
is not a floating point column would result inisNaN
, which should be rejected while binding expressions. You could argue that it should rewrite toalwaysFalse
instead following the same logic asExpressions.equal("intCol", Long.MAX_VALUE)
-- it can't be true.I think that it would be better to be strict and reject binding in that case because something is clearly wrong. I think a lot of the time, that kind of error would happen when columns are misaligned or predicates are incorrectly converted.
If the result of those errors is just to fail in expression binding, then why rewrite at all? Maybe we should just reject NaN in any predicate and force people to explicitly use
isNaN
andnotNaN
. That way we do throw an exception much earlier in all cases. Plus, we wouldn't have to worry about confusion over whetherNaN
is equal to itself: in Java, aDouble
that holds NaN is equal to itself, but a primitive is not. 😕There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, those are some good points! To make sure I understand correctly/know how to move forward, I have some questions:
SparkFilters
(or in general, the integration point with engines during the query-to-expression translation); or maybe even earlier than that, to let engines to support syntax ofis NaN
?isNaN
there has to be some place that ensures the type has to be either double or float, and in iceberg code base we will only know this during binding; are we able to rely on engine to do this check before translating query toExpression
?eq
as we decided to do input validation on otherlg/lteq/gt/gteq
andin
anyway?NaN
toeq
, that may sound backward incompatible until the engine starts to rewrite NaN?I guess the conversation is starting to get too detailed, if you wouldn't mind I'll try to follow up on Slack tomorrow and then post the conclusion here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. If the engine generally uses
d = NaN
then we can convert that toisNaN
. But that would be engine-dependent and the Iceberg expression API would not support equals with NaN.I think so. Most engines will optimize the SQL expressions and handle this already. If not, then it would result in an exception from Iceberg to the user. I think that's okay, too, because as I said above, we want to fail if a NaN is used in an expression with a non-floating-point column, not rewrite to false.
Yes. This makes all of the handling in
Expressions
consistent: always reject NaN values.I'm not convinced either way. You could argue that
d = NaN
is ambiguous and that rejecting it is now fixing a bug. That's certainly the case withd > NaN
, which is not defined. On the other hand, there was some bevhavior before that will now no longer work. So I'd be up for fixing this in Flink and Spark conversions as soon as we can.Feel free to ping me on Slack!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the explanation! I think now I understand the full picture. I think I've addressed everything except for rewriting in
SparkFilters
and other engines, which I think this PR is already too big so I'll submit a separate PR for it (likely next week).