Description
What I see (what problem we are trying to solve)
DataFusion's current join implementations are fairly basic. They are functional enough to run TPCH and TPC-DS, but lack other features such as larger-than-memory processing, ASOF joins, complete subquery support and more.
There seems to be a non trivial desire in the community to improve this.
Some examples of issues / tickets related to enhanced join support / features:
Subqueries (which are implemented as joins)
- [EPIC] More Subquery support #5483
- General framework to decorrelate the subqueries #5492
- Decorrelate scalar subqueries with more complex filter expressions #14554
Join Features
- Proposal: Hook to better support
CollectLeft
joins in distributed execution #12454 - Remove nulls in joins during hash table lookup #15784
- Add tests for filtered SortMergeJoin output size #14239
- [DISCUSSION]: Unified approach for joins to output batches close to
batch_size
#14238 - Improve join performance for h2o queries #13765
- Add MemoryReservation to batch splitting in joins #13003
- Add spilling support for HashJoin #12952
- Feature request: Support for lateral joins #10048
- Implement nested join optimization #3843
Specialized Joins
- [Epic] Remove Sort Merge Join Experimental status #9846
- ASOF join support / Specialize Range Joins #318
- SortMergeJoin: Add RightSemi join support #13471
- Eliminate more outer joins by supporting more expressions #13232
- EPIC: Implement/investigate other join types #13181
- Implement RightMark join #13138
Performance
- Support zero copy hash repartitioning for Hash Join #15382
- Push Dynamic Join Predicates into Scan ("Sideways Information Passing", etc) #7955
- Support Self Join Eliminate #14758
- Potentially improve join performance by implementing a version of the take kernel that accepts an iterator of indices #13620
What is blocking significant forward progress
In my mind, the major challenge is that "improving" JOIN
s can get arbitrarily complicated. There are dozens of academic paper each year on various aspects of join implemnetations, and designing / implementing join capabilities is a substantial engineering effort.
I spent 6 years of my life doing joins at Vertica where they accounted for around 50% of the optimizer's complexity, to give some sense
I don't think the issue is that any particular feature is super complicated to understand, but defining the overall goal, the framework that will accomodate the goal, and then breaking it down into implementable pieces itself I think will require both specialized knowledge and substantial time.
What I suggest
I suggest that people with the relevant skills and time to invest gather together to drive this process worward
- plan out a "join roadmap" (aka prioritize what join features they will push forward)
- Figure out what, if any, new structures are in place
- Start breaking it down into smaller tickets
I can't personally lead such an effort, but I am filing this ticket to try and help connect the relevant people in the community that can.
Some potential people that could help (sorry if I didn't list you)
- @duongcongtoai -- the discussion on Decorrelate scalar subqueries with more complex filter expressions #14554 (comment)
- @xudong963 who has experience in this area
- @Dandandan @comphead and @korowa who contributed substantially to the existing joins
- @mingmwang and @jackwener who contributed significantly to the original subquery implementation
- @liukun4515 who likewise helped significantly
- @suibianwanwank
Related content:
Related blogs (join ordering section in part 2): https://www.influxdata.com/blog/optimizing-sql-dataframes-part-two/