[Bug Fix]: Deem hash repartition unnecessary when input and output has 1 partition#10095
Conversation
|
cc @echai58, this fix should solve the problem in the issue. Feel free to review, if you have time. |
|
Thanks @mustafasrepo and @ozankabak |
…s 1 partition (apache#10095) * Add input partition number check * Minor changes
|
@mustafasrepo good catch on UNION -- I previously suggested that this kind of plan (with partitions > config.target_partitions) is "illegal" in DF. |
I agree, this behaviour is a bit counter intuitive. However, with current implementation of the |
| 11)------------ProjectionExec: expr=[1 as c, 3 as d] | ||
| 12)--------------PlaceholderRowExec | ||
|
|
||
| query IIII |
There was a problem hiding this comment.
I think this query is non determisitic and fails sometimes on CI as it doesn't have an ORDER BY and isn't annotated with rowsort. Here is a PR to fix that: #10120
Probably better solution would be planning union inputs execution according to total available partitions -- e.g select l_linenumber as f
from lineitem
union all
select l_orderkey as f
from lineitemwith target_partitions = 4, could plan 2 threads for each ParquetExec (ideally we could also use byte/row statistics and plan according to them -- not only 2-2, but probably 1-3 if there is significant data skew across inputs/files). Currently, with target_partitions = 4, it's planned as 4 threads per ParquetExec, and 8 output partitions for UNION. And on top of it, when target_partitions is less then number of UNION inputs (e.g. UNION has 10 inputs, target_partitions = 4, and we need at least 1 thread for each input) there could be RepartitionExec. |
That might work. However, this approach cannot solve all cases I guess. For the following query select * from table
union all
select * from table
union all
select * from tablewhen |
|
Hi guys, not familiar with datafusion's release process - is there a estimate of when this will be released in a new datafusion version? |
Which issue does this PR close?
Closes #9928.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Yes
Are there any user-facing changes?