Implement Sort-Merge Join

*Note*: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-11094

The current hash join works well when one side of the join can be loaded into memory but cannot scale beyond the available RAM.

The advantage of implementing SMJ (Sort-Merge Join) is that we can sort the left and right partitions, and write the intermediate results to disk, and then stream both sides of the join by merging these sorted partitions and we do not need to load one side into memory. At most, we need to load all batches from both sides that contain the current join key values.

In order to reduce memory pressure we will want to limit the concurrency of these sort operations.

We would still want to default to hash join when we know that the build-side can fit into memory since it is more efficient than using a sort-merge join.

[https://en.wikipedia.org/wiki/Sort-merge_join]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement Sort-Merge Join #141

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement Sort-Merge Join #141

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions