Open
Description
This ticket tracks the progress for a 2025 Google Summer of Code (GSOC) sponsored project on Correlated Subquery Support
Project Documentation
Is your feature request related to a problem or challenge?
DataFusion currently has limited support for correlated subqueries. This project aims to implement comprehensive support for correlated subqueries in Apache DataFusion by applying Hyper's 'Unnesting Arbitrary
Queries' framework.
Timeline:
Except from the Official GSOC Timeline:
- May 8 - June 1: Community Bonding Period | GSoC contributors get to know mentors, read documentation, get up to speed to begin working on their projects
- June 2: Coding officially begins!
- July 18: Mid term evaluation
- August 25: Final week
- Sep 8: Final evaluations due / wrap up
Work
Epics tracking technical work:
- Support for deeper correlated subqueries: Nested correlated subquery error with a depth exceeding 1 #15558
- Combine the optimization rules
decorrelate
,decorrelate_lateral_join
, anddecorrelate_predicate_subquery
into one. #16073 - Blog about DataFusion correlated subquery support #16084
- [Epic] Transform Correlated Subquery Into Dependent Join #16173
Other potential future work
- Focus on practical use cases(like tpch-ds([EPIC] Support TPC-DS benchmarks #4763), duckdb subquery test...), in order to identify and list unsupported cases.
- Further break down tasks to address unsupported cases...
Related work:
- General framework to decorrelate the subqueries #5492
- [DISCUSSION] JOIN "task force" / project team #15885
Related documentation
- Improving Unnesting of Complex Queries
- 'Unnesting Arbitrary
Queries' - Optimizing SQL (and DataFrames) in DataFusion: Part 2 -- talks about DataFusion join optimization
- Query Optimization Technology for Correlated Subqueries (Alibaba cloud)
Newer research that might be interesting