-
Notifications
You must be signed in to change notification settings - Fork 0
Proposed changes for more flexible user defined Aggregate and window functions #12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposed changes for more flexible user defined Aggregate and window functions #12
Conversation
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @mustafasrepo, I really like this PR and its direction 👍 I have one open question but otherwise I think this is awesome
Open question
How does include_rank (link) fit into this model?
Suggested next step:
- Make a PR to apache/arrow-datafusion with the changes to PartitionEvaluator that are in this PR (I believe you plan to do that
I will open the PR that unify evaluate and evaluate_stateful fields on the main repo once it is ready.) - I will work on various tests / examples for WindowUDF on apache#6617 (which I will port to use the new API when it is ready)
Comments / Responses
First of all current state is a bit complex for end user to handle. After examining the PartitionEvaluator trait we have decided that evaluate_stateful and evaluate_inside_range can be combined. Its new name is evaluate with the following API ..
I also had this observation and I think your solution is very elegant 👍
With the new API we have following options for the end user?
I really like the tabular format of this analysis and it makes sense to me. Adding that table to the comments of PartitionEvaluator would really help people understand it.
In short, I think with the current approach in apache#6617. We are in very good shape (I will simplify evaluate logic with another PR).
I agree.
Thank you so much for all your help!
| /// implement [`PartitionEvaluator::evaluate`] | ||
| fn supports_bounded_execution(&self) -> bool { | ||
| false | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea that the special case forinclude_rank would also be an option here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When include_rank flag is true, evaluate_with_rank_all will be called. This method is basically same with evaluate_all in the spirit. (It takes all the data and produces all the output in single pass). However, since evaluate_with_rank_all requires additional arguments (such as rank boundaries). We do not unify their API, to not recalculate rank boundaries each time (even if we do not use them).
Certainly, we can move this trait to PartitionEvaluator also. However, I thought this would be confusing. Hence didn't move it. (I will think about how to combine evaluate_with_rank_all and evaluate_all without calculating rank boundaries unnecessarily).
Maybe we can present to the user just a subset of the PartitionEvaluator methods. They wouldn't see evaluate_with_rank_all either (Just like your suggestion in option 2).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For anyone following along, I think we went with the evaluate_with_rank_all approach: https://github.com/apache/arrow-datafusion/blob/c3d5d77e447e51c2cca814a67706e5ab3e050ced/datafusion/physical-expr/src/window/partition_evaluator.rs#L209-L217
| fn uses_bounded_memory(&self) -> bool { | ||
| self.aggregate.supports_bounded_execution() | ||
| && !self.window_frame.end_bound.is_unbounded() | ||
| let supports_bounded_execution = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can make this check explicit by adding the API described by @stuartcarnie here: apache#6611 (so that the accumulator can report on its capabilities)
| } | ||
|
|
||
| // TODO show how to use other evaluate methods | ||
| /// These different evaluation methods are called depending on the various settings of WindowUDF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I have opened the PR for stage 1. It can be found in the #6655 |
* Add dialect param to use CHAR instead of TEXT for Utf8 unparsing for MySQL (#12) * Configurable data type instead of flag for Utf8 unparsing * Fix type in comment
Which issue does this PR close?
Closes #.
Rationale for this change
In #6617. We have discussed how to make user defined aggregate and window functions more flexible.
First of all current state is a bit complex for end user to handle. After examining the
PartitionEvaluatortrait we have decided thatevaluate_statefulandevaluate_inside_rangecan be combined. Its new name isevaluatewith the following APIfn evaluate(&mut self,_values: &[ArrayRef],_range: &Range<usize>,) -> Result<ScalarValue>(Existingevaluateis renamed withevaluate_allto reflect better what function does).It returns a single
ScalarValuefor the given input which is the result of window function (If functionuses_window_frameresult calculated according to given range).Existing
fn evaluate(&self, _values: &[ArrayRef], _num_rows: usize) -> Result<ArrayRef>is replaced byfn evaluate_all(&mut self, values: &[ArrayRef], num_rows: usize) -> Result<ArrayRef>.This function receives whole of the data as single batch and produces all of the window result as bulk, which maybe more optimal for some use cases.
With the new API we have following options for the end user?
evaluate_all(if we were to implementPERCENT_RANKit would end up in this quadrant, we cannot produce any result without seeing whole data)evaluate(optionally can also implementevaluate_allfor more optimized implementation. However, there will be default implementation that is suboptimal) . If we were to implementROW_NUMBERit will end up in this quadrant. ExampleOddRowNumbershowcases this use caseevaluate(I think as long asuses_window_frameistrue. There is no way forsupports_bounded_executionto be false). I couldn't come up with any example for this quadrantevaluate. If we were to implementFIRST_VALUE, it would end up in this quadrantTo support end user to set flag
uses_window_frameandsupports_bounded_execution. I have moved these methods fromBuiltInWindowFunctionExprtoPartitionEvaluator. However, in the following commit @alamb could find another way to add this support (I think his version is better. However, since this is showcase PR for new API, I didn't bother with retracting changes.).In short, I think with the current approach in #6617. We are in very good shape (I will simplify evaluate logic with another PR). Hopefully, after these changes end user, by setting
uses_window_frameandsupports_bounded_executionproperly. Then implementing corresponding evaluator (evaluateorevaluate_all) will be able to accomplish desired behavior for most of the use casesWhat changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?