Skip to content

[EPIC] Memory Limited Sort (Externalized / Spill) #1568

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Support Sorting "arbitrarily" large inputs (that don't fit in RAM or within some user defined budget)

This ticket concerns the memory potentially used the Sort operator -- it doesn't cover other potential targets (e.g. sort based aggregation, for example). That will be covered by other tasks tracked by #587

Describe the solution you'd like

  1. Allow DataFusion users to specify a RAM budget (aka via the config introduced in Initial MemoryManager and DiskManager APIs for query execution + External Sort implementation #1526) and have their queries complete running without the sort exceeding that budget.
  2. Make sure there is a single production ready implementation of N-way Merge of Sorted Streams to build on top of

For the Sort operator, I think the best behavior would be:

  1. Sort all the input in RAM (as is done today in SortExec ), if the memory budget allows
  2. If all the input does not fit in the RAM budget, the in memory portion is sorted and written to temporary disk files
  3. Temporary disk files are read / merged to produce the final results

@yjshen 's PR #1526 has most of the individual pieces of this logic, but they aren't hooked up yet (that is if you run SELECT .. FROM foo ORDER BY x) it will still exhaust memory.

Some ideas of how to break this task down

Describe alternatives you've considered
Use two implementations of Sort (one in memory and one that can spill). This would ensure no performance regressions for the non-spilling case but would require users to decide which one to use.

Context
This is follow on work from the great PR from @yjshen in #1526 and part of the story of limiting memory used by DataFusion #587

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions