-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Support Joining "arbitrarily" large inputs (e.g. when one or both of the inputs don't fit in the available RAM)
This ticket concerns the memory used the JoinExec operator -- it doesn't cover other potential targets (e.g. externalized sort or grouping). That will be covered by other tasks tracked by #587
Describe the solution you'd like
- Allow DataFusion users to specify a RAM budget (aka via the config introduced in Initial MemoryManager and DiskManager APIs for query execution + External Sort implementation #1526) and have their queries complete running without the join by exceeding the budget allocated to it via the memory manager.
There are many potential ways to Limit the memory used while joining. The classic way is "sort-merge-join" where the input data on both sides is sorted according to the equality predicates (using externalized sort, such as described in #1568 ) and then the two join inputs are streamed through and the output computed, depending on the type of Join required (INNER, LEFT, RIGHT, SEMI, etc)
I personally think the following would be the ideal behavior for DataFusion Joins:
- A single Join operator that behaved like like the following:
- Hashed one input or the other, if the memory limit was not exhausted, behave like the existing JoinExec
- If memory was exhausted, switch to a merge join strategy: sort the one or both sides that didn't fit in memory using externalized sort on the equality predicates, then stream them back through the Join
The rationale for a runtime switch is that then the optimizer (which always has limited information) can't make the "wrong" choice related to join order
In case anyone wants some "light reading" this stuff is nicely described by Goetz Graffe in "Query evaluation techniques for large databases": https://scholar.google.com/citations?view_op=view_citation&hl=en&user=pdDeRScAAAAJ&citation_for_view=pdDeRScAAAAJ:u5HHmVD_uO8C
Online link: http://infolab.stanford.edu/~hyunjung/cs346/graefe.pdf
Describe alternatives you've considered
Have the Optimizer (aka the HashBuildProbeOrder ) pick both the order and the algorithm to use based on statistics or heuristics
Context
This is follow on work from the great PR from @yjshen in #1526 and part of the story of limiting memory used by DataFusion #587
Task Tracking
- Implement sort-merge join #2242
- Date32/Date64 as join keys for merge join #2314
- Rename SortMergeJoinExec and SortMergeJoinStream to MergeJoin and MergeJoinStream #2315
- Consolidate MergeJoin with HashJoin to adaptive join relations according to runtime resources and table sizes #2316
- Research on using row format for fast join comparison