Skip to content

Dynamic pruning filters from TopK state (optimize ORDER BY LIMIT queries) #15037

Open
@adriangb

Description

@adriangb

Is your feature request related to a problem or challenge?

From discussion with @alamb yesterday the idea came up of optimizing queries like select * from data order by timestamp desc limit 10 for the case where the data is not perfectly sorted by timestamp but mostly follows a sorted pattern.

You can imagine this data gets created if multiple sources with clock skews, network delays, etc. are writing data and you don't do anything fancy to guarantee perfect sorting by timestamp (i.e. you naively write out the data to Parquet, maybe do some compaction, etc.). The point is that 99% of yesterday's files have a timestamp smaller than 99% of today's files but there may be a couple seconds of overlap between files. To be concrete, let's say this is our data:

file min max
1 1 10
2 9 19
3 20 31
4 30 35

Currently DataFusion will exhaustively open each file, read the timestamp column and feed it into a TopK.
I think we can do a lot better if we:

  • Use file stats to decide which files to work on first. In this case it makes sense to start with file 4 and 3 (assuming we have parallelism of 2).
  • Let's say that between those two we have 10 rows, so we've already filled up our TopK. The only way more things would get added to our TopK is if they are greater than the smallest item already seen (let's say that's 20, the smallest value in file 3).
  • Now we know just from statistics that we can skip files 2 and 1 because neither of them can have any timestamp > 20.

Extrapolating this to scenarios where you have years worth / TBs of data and want a limit 5 would yield orders of magnitude improvement I think.

@alamb mentioned this sounds similar to Dynamic Filters, I assume this must be a known technique (or my analysis may be completely wrong 😆 ) but I don't know what it would be called.

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions