[DISCUSSION]: Unified approach for joins to output batches close to `batch_size`

              `BatchCoalescer` is not used in joins yet, since CoalesceBatchesExec appears after the joins having filter, in case of the output batches might have a  lower row count than target batch size. So, why cannot we follow the same pattern in SMJ? If collecting batches in the join itself is more performant, then we should also refactor the other joins as well?

On the other hand, `BatchSplitter` is used in other joins, and SMJ could (should) have it too, as there is no other way of splitting the batches according to target batch size.


I've thought about this, and I believe the most optimal solution is to make all join operators capable of performing both coalescing and splitting in a built-in manner. This is because the output of a join can either be smaller or larger than the target batch size. Ideally, there should be no need (or only minimal need) for CoalesceBatchesExec.

To achieve this built-in coalescing and splitting, we can leverage existing tools like BatchSplitter and BatchCoalescer (although there are no current examples of BatchCoalescer being used in joins). My suggestion is to generalize these tools so they can be utilized by any operator and applied wherever this mechanism is needed. As this pattern becomes more common, it will be easier to expand its usage and simplify its application.

_Originally posted by @berkaysynnada in https://github.com/apache/datafusion/issues/14160#issuecomment-2601723615_
            

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DISCUSSION]: Unified approach for joins to output batches close to `batch_size` #14238

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[DISCUSSION]: Unified approach for joins to output batches close to batch_size #14238

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[DISCUSSION]: Unified approach for joins to output batches close to `batch_size` #14238