Description
Is your feature request related to a problem or challenge?
To support queries that exceed available memory, DataFusion must spill intermediate results to disk. As a continuation of the community effort on external query execution, this epic aims to improve the robustness of spilling execution and explore further performance optimizations.
This includes tracking which queries fail under specific memory limits, fixing bugs in external query execution, and addressing inefficiencies in the current implementation. An additional goal is to explore the feasibility of applying experimental optimizations proposed in academic papers, such as adaptive compression.
Describe the solution you'd like
1. Stabilize Larger-Than-Memory Queries
User Experience & Testing
- enable
TrackConsumersPool
by default indatafusion-cli
- improve err msg to guide how to adjust parameters
- Enable sort query fuzzing with limited memory #15517
- Update tpch, clickbench, sort_tpch to mark failed queries #16160
- migrate tests to
insta
Sort
Related tracking issues: #16131, #16132
- Track peak memory usage in
SortExec
#16042 - More accurate memory accounting in external sort #14748
- A complete solution for stable and safe sort with spill #14692
Aggregate
- Integrate ExternalSorter
Join
2. Optimize Spill File Format
3. Docs & Blog
Describe alternatives you've considered
While spilling for window functions and CTEs is currently not a focus, they remain potential areas for improvement.
Additional context
Related work: