Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Mar 13, 2021

NOTE:

Since is a fairly major backwards incompatible change (many callsites need to be updated, though mostly mechanically), I have opened this PR with an example of how it would work for a few callsites, and wanted to gauge community interest before getting the whole thing working.

It is part of my overall plan to make the DataFusion optimizer more idiomatic and do much less copying ARROW-11689

Rationale:

All callsites currently need an owned Vec (or equivalent) so they can pass in &[Expr] and then Datafusion copies all the Exprs. Many times the original Vec<Expr> is discarded immediately after use (I'll point out where this happens in a few places below). Thus I it would better (more idiomatic and often less copy/faster) to take something that could produce an iterator over Expr

Changes

Change Dataframe / LogicalPlanBuilder so it takes Vec<Expr> rather than &[Expr]

@github-actions
Copy link

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the API change -- most of the rest of this PR is updating all the call sites. Note that DataFrame::filter takes in an owned Expr so taking an owned Vec is not that large of a departure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is where the Expr::clone occurs and no longer does after the changes contemplated in this PR

@alamb
Copy link
Contributor Author

alamb commented Mar 13, 2021

I would love your thoughts on this approach @jorgecarleitao @andygrove @houqp @Dandandan @ilya-biryukov @seddonm1 -- I wonder if this approach would be worth bringing up on the mailing list

@alamb alamb marked this pull request as draft March 13, 2021 12:41
@houqp
Copy link
Member

houqp commented Mar 13, 2021

Since it aligns with our previous discussion at #9347 (review), I am all for this change! I think we should adopt this pattern across the entire code base. Clone should be managed at callsites whenever possible.

@andygrove
Copy link
Member

I am all for this change. I had started out with the pattern of passing Vec but at the time I got feedback that this wasn't idiomatic Rust code and that we should always pass slices, so had switched to that pattern.

@houqp
Copy link
Member

houqp commented Mar 13, 2021

this wasn't idiomatic Rust code and that we should always pass slices

I also remember seeing that recommendation before and adopted it for awhile. I wonder how wide-spread this is in the Rust community 🤔

@Dandandan
Copy link
Contributor

Dandandan commented Mar 13, 2021

I think this makes sense. This is a partial revert from a PR from me (I believe). But I think it is better for "builder" kind of functions where it needs "owned" data.

In general, &Vec and &String are anti patterns though, and some functions taking Vec and String still could/should be using references instead.

@houqp
Copy link
Member

houqp commented Mar 13, 2021

In general, &Vec and &String are anti patterns though, and some functions taking Vec and String still could/should be using references instead.

Yeah, I think it's always better to use &[] instead of &Vec and &str instead of &String.

In case of Vec v.s. &[], I think it largely depends on the clone overhead. For example, if the function doesn't do any clone at all internally, then we should definitely pass in &[] instead of an owned Vec. If there is a decent amount of clone overhead, then it's better to manage the clone on callsite.

@Dandandan
Copy link
Contributor

In general, &Vec and &String are anti patterns though, and some functions taking Vec and String still could/should be using references instead.

Yeah, I think it's always better to use &[] instead of &Vec and &str instead of &String.

In case of Vec v.s. &[], I think it largely depends on the clone overhead. For example, if the function doesn't do any clone at all internally, then we should definitely pass in &[] instead of an owned Vec. If there is a decent amount of clone overhead, then it's better to manage the clone on callsite.

Exactly, this is my current understanding too. It is best to evaluate on a case by case basis 👍

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks a lot, @alamb !

I agree with what was said: when the function needs ownership, it is better to leave the caller to make the decision on whether they want to clone it and keep the object or pass the ownership to the function.

I do think that in some cases we could simplify some structs by having a lifetime parameter and &'a [] instead of Vec, but since this is not high performant code, I do not think we need to worry much.

@alamb
Copy link
Contributor Author

alamb commented Mar 14, 2021

@Dandandan and @houqp

What do you think about something like this (which is even more Idiomatic (TM) I think), though harder for beginner Rust programmers to grok?

    pub fn project(&self, expr: impl IntoIterator<Item=Expr>) -> Result<Self> {

This would allow PlanBuilder::project to be called with a Vec<Expr> as well as things like BTreeSet<Expr>...

Sadly, I couldn't give the same treatment to the DataFrame trait as when I tried, rustc started complaining to me about trait objects

dataframe::DataFrame` cannot be made into an object
   |
   = help: consider moving `select` to another trait
note: for a trait to be "object safe" it needs to allow building a vtable to allow the call to be resolvable dynamically; for more information visit <https://doc.rust-lang.org/reference/items/traits.html#object-safety>

@houqp
Copy link
Member

houqp commented Mar 14, 2021

Yes, I think if the function only requires sequential access, iterator would have been even better. Although I would prefer to write it as a generic function rather than using Trait object.

@jorgecarleitao
Copy link
Member

jorgecarleitao commented Mar 14, 2021

Another option is to use T: AsRef<[Expr]>, which works for any type that can be represented as a reference to an array (e.g. vec and &[]`)

@houqp , AFAIK impl is just syntactic sugar for a generic; a trait object would be &dyn Expr.

I am not sure the IntoIterator or AsRef would help, though: we would still need to collect it into a vector, which afaik results in a clone.

@houqp
Copy link
Member

houqp commented Mar 15, 2021

@jorgecarleitao yeah, you are right, sorry I got it mixed up with dyn :P

I think IntoIterator<Item = Expr> won't require copy or clone because the iterator owns those values. AsRef with slice on the other side would require copy/clone due to use of slice, which behaves closer to IntoIterator<Item=&Expr>.

@jorgecarleitao
Copy link
Member

@houqp , you are right, AsRef does not work. 👍 playground.

Thank you for that piece of knowledge! :)

@alamb alamb force-pushed the alamb/less_copy_in_plan_builder branch from 140dfe4 to deb15be Compare March 16, 2021 21:14
@alamb
Copy link
Contributor Author

alamb commented Mar 23, 2021

Closing in favor of #9703

@alamb alamb closed this Mar 23, 2021
alamb added a commit that referenced this pull request Mar 25, 2021
…l Interator<Item=Expr> rather than &[Expr]

# NOTE:
Since is a fairly major backwards incompatible change (many callsites need to be updated, though mostly mechanically); I gathered some feedback on this approach in #9692 and this is the PR I propose for merge.

I'll leave this open for several days and also send a note to the mailing lists for additional comment

It is part of my overall plan to make the DataFusion optimizer more idiomatic and do much less copying [ARROW-11689](https://issues.apache.org/jira/browse/ARROW-11689)

# Rationale:
All callsites currently need an owned `Vec` (or equivalent) so they can pass in `&[Expr]` and then Datafusion copies all the `Expr`s. Many times the original `Vec<Expr>` is discarded immediately after use (I'll point out where this happens in a few places below). Thus I it would better (more idiomatic and often less copy/faster) to take something that could produce an iterator over Expr

# Changes
1. Change `Dataframe` so it takes `Vec<Expr>` rather than `&[Expr]`
2. Change `LogicalPlanBuilder` so it takes `impl Iterator<Item=Expr>` rather than `&[Expr]`

I couldn't figure out how to allow the `Dataframe` API (which is a Trait) to take an `impl Iterator<Item=Expr>`

Closes #9703 from alamb/alamb/less_copy_in_plan_builder_final

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
alamb added a commit to apache/arrow-rs that referenced this pull request Apr 20, 2021
…l Interator<Item=Expr> rather than &[Expr]

# NOTE:
Since is a fairly major backwards incompatible change (many callsites need to be updated, though mostly mechanically); I gathered some feedback on this approach in apache/arrow#9692 and this is the PR I propose for merge.

I'll leave this open for several days and also send a note to the mailing lists for additional comment

It is part of my overall plan to make the DataFusion optimizer more idiomatic and do much less copying [ARROW-11689](https://issues.apache.org/jira/browse/ARROW-11689)

# Rationale:
All callsites currently need an owned `Vec` (or equivalent) so they can pass in `&[Expr]` and then Datafusion copies all the `Expr`s. Many times the original `Vec<Expr>` is discarded immediately after use (I'll point out where this happens in a few places below). Thus I it would better (more idiomatic and often less copy/faster) to take something that could produce an iterator over Expr

# Changes
1. Change `Dataframe` so it takes `Vec<Expr>` rather than `&[Expr]`
2. Change `LogicalPlanBuilder` so it takes `impl Iterator<Item=Expr>` rather than `&[Expr]`

I couldn't figure out how to allow the `Dataframe` API (which is a Trait) to take an `impl Iterator<Item=Expr>`

Closes #9703 from alamb/alamb/less_copy_in_plan_builder_final

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants