-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support for AnyExpression #2549
Conversation
I added the |
@alamb @andygrove is it required to add a new expression for Ballista? I did that in 218a917, but I don't use it. Thanks |
I assume you had to add that to fix compilation issues because ballista has exhaustive matching of all expressions? The changes look reasonable to me. Note that we also have a |
I plan to review this carefully today -- thanks @ovr |
fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> { | ||
let value = match self.value.evaluate(batch)? { | ||
ColumnarValue::Array(array) => array, | ||
ColumnarValue::Scalar(scalar) => scalar.to_array(), | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @ovr -- this is looking quite cool.
I haven't reviewed this code super carefully but I wonder what you think about reusing more of the code for InList
?
It seems like the implementations of x IN (<expr>, <expr>)
and x ANY (<expr>)
are almost exactly the same except for the fact that the the x = ANY(<expr>)
has the list as a single <expr>
?
The implementation of get_set
needs to be different, but otherwise the rest of the code could probably be called directly.
I mention this as it is likely the IN
/ NOT IN
are more optimized (use a hashset for testing, for example) as well as already written.
Let me know what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't reviewed this code super carefully but I wonder what you think about reusing more of the code for InList?
Probably, the nearest operator will be ALL
, but it's not implemented right now. I tried to reuse the code of InList
, but come to the decision that it will be easier/correct to combine it with ANY in the future.
It seems like the implementations of x IN (, ) and x ANY () are almost exactly the same except for the fact that the the x = ANY() has the list as a single ?
Nope, because it supports more operators: <
, <=
, >
, >=
(it's not implemented in this PR, but anyway).
I mention this as it is likely the IN / NOT IN are more optimized (use a hashset for testing, for example) as well as already written.
Yes, but IN supports only the vector of scalars in DF. It's not the same as ANY, because it supports List (real column) or scalar.
For example SELECT left, values FROM table
:
left | right LIst(Int64) - another logic is required
1. | [1,2,3] - ArrayRef = PrimitiveArray<Int64> - another logic is requied
2. |. [2,3,4]
3. |. [4,5,6]
4. | [4,5,6]
5. |. [4,5,6]
In the case of using the whole column, it is required to use another handling of interacting on values ☝️ . Probably here, will be more correct to go by steps:
- Support more operators
- Introduce
ALL
operator + extract the same logic withANY
.
Thanks
Thanks -- I'll try and find some more time to review this PR.
…On Wed, May 18, 2022 at 5:54 PM Dmitry Patsura ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In datafusion/physical-expr/src/expressions/any.rs
<#2549 (comment)>
:
> + fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> {
+ let value = match self.value.evaluate(batch)? {
+ ColumnarValue::Array(array) => array,
+ ColumnarValue::Scalar(scalar) => scalar.to_array(),
+ };
I haven't reviewed this code super carefully but I wonder what you think
about reusing more of the code for InList?
Probably, the nearest operator will be ALL, but it's not implemented
right now. I tried to reuse the code of InList, but come to the decision
that it will be easier/correct to combine it with ANY in the future.
It seems like the implementations of x IN (, ) and x ANY () are almost
exactly the same except for the fact that the the x = ANY() has the list as
a single ?
Nope, because it supports more operators: <, <=, >, >= (it's not
implemented in this PR, but anyway).
I mention this as it is likely the IN / NOT IN are more optimized (use a
hashset for testing, for example) as well as already written.
Yes, but IN supports only the vector of scalars in DF. It's not the same
as ANY, because it supports List (real column) or scalar.
For example SELECT left, values FROM table:
left | right LIst(Int64) - another logic is required
1. | [1,2,3] - ArrayRef = PrimitiveArray<Int64> - another logic is requied
2. |. [2,3,4]
3. |. [4,5,6]
4. | [4,5,6]
5. |. [4,5,6]
In the case of using the whole column, it is required to use another
handling of interacting on values ☝️ . Probably here, will be more
correct to go by steps:
- Support more operators
- Introduce ALL operator + extract the same logic with ANY.
Thanks
—
Reply to this email directly, view it on GitHub
<#2549 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADXZMLLPQHWVVSZ2KIMKULVKVRKVANCNFSM5WCC6J2Q>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Codecov Report
@@ Coverage Diff @@
## master #2549 +/- ##
==========================================
- Coverage 84.64% 84.54% -0.10%
==========================================
Files 267 268 +1
Lines 46926 47239 +313
==========================================
+ Hits 39719 39937 +218
- Misses 7207 7302 +95
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm missing something fundamental, but I'm struggling to follow why this is handling ListArrays at all... This makes it quite hard for me to review 😅
@@ -955,6 +955,30 @@ async fn test_extract_date_part() -> Result<()> { | |||
Ok(()) | |||
} | |||
|
|||
#[tokio::test] | |||
async fn test_binary_any() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth testing what happens if NULL
is in the list?
/// The comparison operator | ||
op: Operator, | ||
/// Right-hand side of the expression | ||
right: Box<Expr>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing supporting an arbitrary subquery here is a future extension?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Expr
here could be an Expr::ScalarSubquery
to cover the subquery case (assuming that we only need to support subqueries that select a single column?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The definition of AnyExpr
here is identical to the definition of BinaryExpr
. I wonder if it would be better to use BinaryExpr
for consistency and model Any as AnyExpr(Expr)
to just model the right-hand side?
ColumnarValue::Array(array) => (array, false), | ||
ColumnarValue::Scalar(scalar) => (scalar.to_array(), true), | ||
}; | ||
let as_list = list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm probably missing something fundamental, but I'm confused as to why this is a ListArray, i.e. 2-dimensional.
In all the examples given the expression is [1,2,3]
which is 1 dimensional?
} | ||
}; | ||
|
||
match value.data_type() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on if this needs to be using ListArray, this could possibly use the dyn comparison kernels in arrow to reduce the amount of code.
let col_a = col("a", &schema)?; | ||
|
||
let values_builder = Int64Builder::new(3 * 3); | ||
let mut builder = ListBuilder::new(values_builder); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using ListArray::from_iter_primitive
can be easier to read than the builders.
I reviewed the logical plan, SQL parsing, and optimizer changes and LGTM. I did not review the execution part. It might make reviews easier for future work like this to have separate PRs for logical versus physical parts. I would have been more comfortable reviewing and approving the logical plan parts of this. |
Hi @ovr What do you think about breaking this up into smaller PRs and starting with the logical plan changes and SQL parsing? I think this might make it easier to start getting parts of this merged in. |
This PR is more than 6 month old, so closing it down for now to clean up the PR list. Please reopen if this is a mistake and you plan to work on it more |
Which issue does this PR close?
#2548
Rationale for this change
This PR implements partial support for
ANY
operator, only for=
&<>
.Are there any user-facing changes?
This PR doesn't introduce any breaking changes, only new functionality