Is your feature request related to a problem or challenge?
Currently sort pushdown reorders files by min/max statistics to achieve sort elimination. But within each file, row groups are read in their original order (or reversed via reverse_row_groups). If row groups within a file have non-overlapping ranges, reordering them by statistics could further optimize scan order.
Describe the solution you'd like
Pass the desired sort order into the FileOpener and have it re-sort row groups based on their min/max statistics to match the scan's desired ordering.
For example, a file with 4 row groups:
RG1: min=100, max=200
RG2: min=1, max=50
RG3: min=300, max=400
RG4: min=51, max=99
For ORDER BY col ASC LIMIT 10, reordering to [RG2, RG4, RG1, RG3] lets TopK find the smallest values first, which provides two levels of optimization:
-
Row-level filtering (immediate benefit): TopK sets a tight dynamic filter threshold after reading RG2. For subsequent row groups, the dynamic filter acts as a row-level filter — the parquet reader uses page index to skip non-matching pages and avoids decoding non-sort columns for filtered rows. This significantly reduces I/O for wide tables.
-
Row-group-level skipping (follow-up): Currently row group selection is done upfront before any data is read, so the dynamic filter from TopK cannot prune already-selected row groups. A follow-up optimization could re-evaluate the dynamic filter between row groups, skipping entire row groups when their min statistics exceed the TopK threshold. This would be especially effective with morselized scans where TopK could terminate after a single row group.
Additional context
This was suggested by @adriangb in #21182 (comment):
I do think there's one more trick we could have up our sleeves: instead of only reversing row group orders we could pass the desired sort order into the opener and have it re-sort the row groups based on stats to try to match the scan's desired ordering. This might be especially effective once we have morselized scans since we could terminate after a single row group for TopK queries.
Related code: ParquetSource::try_pushdown_sort — currently only supports reverse scan (reverse_row_groups=true), could be extended to do statistics-based row group reordering.
Parent issue: #17348
Is your feature request related to a problem or challenge?
Currently sort pushdown reorders files by min/max statistics to achieve sort elimination. But within each file, row groups are read in their original order (or reversed via
reverse_row_groups). If row groups within a file have non-overlapping ranges, reordering them by statistics could further optimize scan order.Describe the solution you'd like
Pass the desired sort order into the
FileOpenerand have it re-sort row groups based on their min/max statistics to match the scan's desired ordering.For example, a file with 4 row groups:
For
ORDER BY col ASC LIMIT 10, reordering to[RG2, RG4, RG1, RG3]lets TopK find the smallest values first, which provides two levels of optimization:Row-level filtering (immediate benefit): TopK sets a tight dynamic filter threshold after reading RG2. For subsequent row groups, the dynamic filter acts as a row-level filter — the parquet reader uses page index to skip non-matching pages and avoids decoding non-sort columns for filtered rows. This significantly reduces I/O for wide tables.
Row-group-level skipping (follow-up): Currently row group selection is done upfront before any data is read, so the dynamic filter from TopK cannot prune already-selected row groups. A follow-up optimization could re-evaluate the dynamic filter between row groups, skipping entire row groups when their min statistics exceed the TopK threshold. This would be especially effective with morselized scans where TopK could terminate after a single row group.
Additional context
This was suggested by @adriangb in #21182 (comment):
Related code:
ParquetSource::try_pushdown_sort— currently only supports reverse scan (reverse_row_groups=true), could be extended to do statistics-based row group reordering.Parent issue: #17348