-
Couldn't load subscription status.
- Fork 3.9k
ARROW-11790: [Rust][DataFusion] RFC: Change builder signatures to take Vec<Expr> rather than &[Expr] #9692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
rust/datafusion/src/dataframe.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the API change -- most of the rest of this PR is updating all the call sites. Note that DataFrame::filter takes in an owned Expr so taking an owned Vec is not that large of a departure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is where the Expr::clone occurs and no longer does after the changes contemplated in this PR
|
I would love your thoughts on this approach @jorgecarleitao @andygrove @houqp @Dandandan @ilya-biryukov @seddonm1 -- I wonder if this approach would be worth bringing up on the mailing list |
|
Since it aligns with our previous discussion at #9347 (review), I am all for this change! I think we should adopt this pattern across the entire code base. Clone should be managed at callsites whenever possible. |
|
I am all for this change. I had started out with the pattern of passing |
I also remember seeing that recommendation before and adopted it for awhile. I wonder how wide-spread this is in the Rust community 🤔 |
|
I think this makes sense. This is a partial revert from a PR from me (I believe). But I think it is better for "builder" kind of functions where it needs "owned" data. In general, |
Yeah, I think it's always better to use In case of |
Exactly, this is my current understanding too. It is best to evaluate on a case by case basis 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks a lot, @alamb !
I agree with what was said: when the function needs ownership, it is better to leave the caller to make the decision on whether they want to clone it and keep the object or pass the ownership to the function.
I do think that in some cases we could simplify some structs by having a lifetime parameter and &'a [] instead of Vec, but since this is not high performant code, I do not think we need to worry much.
|
@Dandandan and @houqp What do you think about something like this (which is even more Idiomatic (TM) I think), though harder for beginner Rust programmers to grok? This would allow Sadly, I couldn't give the same treatment to the |
|
Yes, I think if the function only requires sequential access, iterator would have been even better. Although I would prefer to write it as a generic function rather than using Trait object. |
|
Another option is to use @houqp , AFAIK I am not sure the |
|
@jorgecarleitao yeah, you are right, sorry I got it mixed up with I think |
|
@houqp , you are right, Thank you for that piece of knowledge! :) |
…e Vec<Expr> rather than &[Expr]
140dfe4 to
deb15be
Compare
|
Closing in favor of #9703 |
…l Interator<Item=Expr> rather than &[Expr] # NOTE: Since is a fairly major backwards incompatible change (many callsites need to be updated, though mostly mechanically); I gathered some feedback on this approach in #9692 and this is the PR I propose for merge. I'll leave this open for several days and also send a note to the mailing lists for additional comment It is part of my overall plan to make the DataFusion optimizer more idiomatic and do much less copying [ARROW-11689](https://issues.apache.org/jira/browse/ARROW-11689) # Rationale: All callsites currently need an owned `Vec` (or equivalent) so they can pass in `&[Expr]` and then Datafusion copies all the `Expr`s. Many times the original `Vec<Expr>` is discarded immediately after use (I'll point out where this happens in a few places below). Thus I it would better (more idiomatic and often less copy/faster) to take something that could produce an iterator over Expr # Changes 1. Change `Dataframe` so it takes `Vec<Expr>` rather than `&[Expr]` 2. Change `LogicalPlanBuilder` so it takes `impl Iterator<Item=Expr>` rather than `&[Expr]` I couldn't figure out how to allow the `Dataframe` API (which is a Trait) to take an `impl Iterator<Item=Expr>` Closes #9703 from alamb/alamb/less_copy_in_plan_builder_final Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
…l Interator<Item=Expr> rather than &[Expr] # NOTE: Since is a fairly major backwards incompatible change (many callsites need to be updated, though mostly mechanically); I gathered some feedback on this approach in apache/arrow#9692 and this is the PR I propose for merge. I'll leave this open for several days and also send a note to the mailing lists for additional comment It is part of my overall plan to make the DataFusion optimizer more idiomatic and do much less copying [ARROW-11689](https://issues.apache.org/jira/browse/ARROW-11689) # Rationale: All callsites currently need an owned `Vec` (or equivalent) so they can pass in `&[Expr]` and then Datafusion copies all the `Expr`s. Many times the original `Vec<Expr>` is discarded immediately after use (I'll point out where this happens in a few places below). Thus I it would better (more idiomatic and often less copy/faster) to take something that could produce an iterator over Expr # Changes 1. Change `Dataframe` so it takes `Vec<Expr>` rather than `&[Expr]` 2. Change `LogicalPlanBuilder` so it takes `impl Iterator<Item=Expr>` rather than `&[Expr]` I couldn't figure out how to allow the `Dataframe` API (which is a Trait) to take an `impl Iterator<Item=Expr>` Closes #9703 from alamb/alamb/less_copy_in_plan_builder_final Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
NOTE:
Since is a fairly major backwards incompatible change (many callsites need to be updated, though mostly mechanically), I have opened this PR with an example of how it would work for a few callsites, and wanted to gauge community interest before getting the whole thing working.
It is part of my overall plan to make the DataFusion optimizer more idiomatic and do much less copying ARROW-11689
Rationale:
All callsites currently need an owned
Vec(or equivalent) so they can pass in&[Expr]and then Datafusion copies all theExprs. Many times the originalVec<Expr>is discarded immediately after use (I'll point out where this happens in a few places below). Thus I it would better (more idiomatic and often less copy/faster) to take something that could produce an iterator over ExprChanges
Change
Dataframe/LogicalPlanBuilderso it takesVec<Expr>rather than&[Expr]