Sort pushdown: reorder row groups by statistics within each file

**Is your feature request related to a problem or challenge?**

Currently sort pushdown reorders **files** by min/max statistics to achieve sort elimination. But within each file, row groups are read in their original order (or reversed via `reverse_row_groups`). If row groups within a file have non-overlapping ranges, reordering them by statistics could further optimize scan order.

**Describe the solution you'd like**

Pass the desired sort order into the `FileOpener` and have it re-sort row groups based on their min/max statistics to match the scan's desired ordering.

For example, a file with 4 row groups:
```
RG1: min=100, max=200
RG2: min=1,   max=50
RG3: min=300, max=400
RG4: min=51,  max=99
```

For `ORDER BY col ASC LIMIT 10`, reordering to `[RG2, RG4, RG1, RG3]` lets TopK find the smallest values first, which provides two levels of optimization:

1. **Row-level filtering (immediate benefit):** TopK sets a tight dynamic filter threshold after reading RG2. For subsequent row groups, the dynamic filter acts as a row-level filter — the parquet reader uses page index to skip non-matching pages and avoids decoding non-sort columns for filtered rows. This significantly reduces I/O for wide tables.

2. **Row-group-level skipping (follow-up):** Currently row group selection is done upfront before any data is read, so the dynamic filter from TopK cannot prune already-selected row groups. A follow-up optimization could re-evaluate the dynamic filter between row groups, skipping entire row groups when their min statistics exceed the TopK threshold. This would be especially effective with morselized scans where TopK could terminate after a single row group.

**Additional context**

This was suggested by @adriangb in https://github.com/apache/datafusion/pull/21182#discussion_r3000108065:

> I do think there's one more trick we could have up our sleeves: instead of only reversing row group orders we could pass the desired sort order into the opener and have it re-sort the row groups based on stats to try to match the scan's desired ordering. This might be especially effective once we have morselized scans since we could terminate after a single row group for TopK queries.

Related code: [`ParquetSource::try_pushdown_sort`](https://github.com/apache/datafusion/blob/main/datafusion/datasource-parquet/src/source.rs) — currently only supports reverse scan (`reverse_row_groups=true`), could be extended to do statistics-based row group reordering.

Parent issue: https://github.com/apache/datafusion/issues/17348

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sort pushdown: reorder row groups by statistics within each file #21317

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Sort pushdown: reorder row groups by statistics within each file #21317

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions