Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Mar 14, 2021

NOTE:

Since is a fairly major backwards incompatible change (many callsites need to be updated, though mostly mechanically); I gathered some feedback on this approach in #9692 and this is the PR I propose for merge.

I'll leave this open for several days and also send a note to the mailing lists for additional comment

It is part of my overall plan to make the DataFusion optimizer more idiomatic and do much less copying ARROW-11689

Rationale:

All callsites currently need an owned Vec (or equivalent) so they can pass in &[Expr] and then Datafusion copies all the Exprs. Many times the original Vec<Expr> is discarded immediately after use (I'll point out where this happens in a few places below). Thus I it would better (more idiomatic and often less copy/faster) to take something that could produce an iterator over Expr

Changes

  1. Change Dataframe so it takes Vec<Expr> rather than &[Expr]
  2. Change LogicalPlanBuilder so it takes impl Iterator<Item=Expr> rather than &[Expr]

I couldn't figure out how to allow the Dataframe API (which is a Trait) to take an impl Iterator<Item=Expr>

@github-actions
Copy link

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a pretty good example -- the answer_schema was created and immediately dropped after calling df.select(). Now the exprs don't need to be copied

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I initially found the impl IntoIterator<Item .... syntax hard to grok, I figured I would add an example to LogicalPlanBuilder showing that it lets you use vec![] and the trait business can be somewhat ignored

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was even worse here -- two copies of each input expr were made -- one copy to pass to validate_unique_names and once to put on the LogicalPlanAggregate.

Admittedly, removing the second copy doesn't require changing the signature of this function

@returnString
Copy link
Contributor

I couldn't figure out how to allow the Dataframe API (which is a Trait) to take an impl Iterator<Item=Expr>

Yeah, it's a pain but impl trait (in both return and argument position) isn't supported in traits right now. Afaik there were some historical discussions about opening up the system to support it but I can't find anything recent.

Related: is there a reason that DataFrame is a trait with a single impl? I'm wondering if it could be a concrete struct with the actual execution logic handling more exotic setups (custom TableProvider impls etc), which would free it from all the trait-imposed restrictions. Are we expecting to support multiple DF implementations?

@alamb
Copy link
Contributor Author

alamb commented Mar 15, 2021

Related: is there a reason that DataFrame is a trait with a single impl? I'm wondering if it could be a concrete struct with the actual execution logic handling more exotic setups (custom TableProvider impls etc), which would free it from all the trait-imposed restrictions. Are we expecting to support multiple DF implementations?

@jorgecarleitao might know the answer to that as he introduced the DataFrame trait initially. @ritchie46 has a dataframe library polars which maybe someday could be integrated: https://www.ritchievink.com/blog/2021/02/28/i-wrote-one-of-the-fastest-dataframe-libraries/. It is interesting to me that many of the polars features and implementation seems very similar to the features in DataFusion

@jorgecarleitao
Copy link
Member

@alamb , I do not know the reason. I did not use it on datafusion-python, mostly because the trait is not really used anywhere: we have no generics or dyn DataFrame. I am fine dropping it until we see a need for it, but @andygrove likely had some plans here :)

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks a lot, @alamb .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be even sweater once IntoIter is supported for arrays: [col()] vs vec![col()], see rust-lang/rust#65819

@andygrove
Copy link
Member

@alamb , I do not know the reason. I did not use it on datafusion-python, mostly because the trait is not really used anywhere: we have no generics or dyn DataFrame. I am fine dropping it until we see a need for it, but @andygrove likely had some plans here :)

I can explain :-)

My motivation was extensibility. I wanted users to be able to use the DataFusion DataFrame for local (in-process) execution but then be able to run distributed by having a mechanism to tell the context which DataFrame impl to use (configuration change rather than code change).

I have a much better idea now though of the requirements so plan on writing up a design once Ballista is in this repo.

@andygrove
Copy link
Member

And just to be clear, I am fine with removing the trait.

@alamb
Copy link
Contributor Author

alamb commented Mar 23, 2021

I am going to rebase this PR and merge it once CI passes

@alamb
Copy link
Contributor Author

alamb commented Mar 23, 2021

And just to be clear, I am fine with removing the trait.

Thanks @andygrove -- I have no particular need to remove the DataFrame trait, but if it gets in the way we can definitely revisit 👍

@andygrove
Copy link
Member

I wrote up some very brief notes in https://issues.apache.org/jira/browse/ARROW-12064

@jorgecarleitao
Copy link
Member

Got. Then fwiw, I think that we can build that with this trait, we just need a bit more work / investigation:

I am not 100% sure, but I think that the path for libraries that want to expose a different DataFrame (datafusion-python needs this) and use datafusion's ExecutionContext is to make:

struct ExecutionContext<D: DataFrame>

For the case of Ballista, I think we need the opposite: Ballista has an ExecutionContext, and DataFrame should just have a way of absorving that implementation, e.g. via DataFrame<C: ExecutionContext> (we need a trait for ExecutionContext for this).

(This is a hand-waving idea)

Copy link
Member

@houqp houqp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice refactor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants