Add reproducer for consecutive RepartitionExec#18343
Conversation
|
@gene-bordegaray : can you help review this? |
There was a problem hiding this comment.
This makes sense. Just want to ask clarifying question to make sure I am understanding:
- the Round Robin Repartition into the Aggregate is useful because we can disperse work across partitions and then accumulate their results. Using the aggregated results we can use the Hash Repartition to hand off work with the same key (such as env = 'prod') to workers, thus is more efficient
- the parquet query is not working this way as the Reparititons are not separated by the Aggregate. The Aggregate does all this work on a single partition then does Repartitioning too late.
Yes -- you can read more about the two phase aggregation in general here https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.Accumulator.html#tymethod.state |
Correct. And the ticket of this reproducer #18341 proposes 3 different ways to improve it |
Ok awesome, I will take some time to understand these approaches and give some of my thoughts |
|
Thanks @NGA-TRAN @gene-bordegaray and @Dandandan |
Reproducer for #18341