Skip to content

[Epic]: Google Summer of Code 2025 Improving Spilling Execution #16065

Open
@ding-young

Description

@ding-young

Is your feature request related to a problem or challenge?

To support queries that exceed available memory, DataFusion must spill intermediate results to disk. As a continuation of the community effort on external query execution, this epic aims to improve the robustness of spilling execution and explore further performance optimizations.

This includes tracking which queries fail under specific memory limits, fixing bugs in external query execution, and addressing inefficiencies in the current implementation. An additional goal is to explore the feasibility of applying experimental optimizations proposed in academic papers, such as adaptive compression.

Describe the solution you'd like

1. Stabilize Larger-Than-Memory Queries

User Experience & Testing

Sort

Related tracking issues: #16131, #16132

Aggregate

  • Integrate ExternalSorter

Join

2. Optimize Spill File Format

3. Docs & Blog

Describe alternatives you've considered

While spilling for window functions and CTEs is currently not a focus, they remain potential areas for improvement.

Additional context

Related work:

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions