Skip to content

Disable prune_by_limit when output ordering is required#48

Merged
zhuqi-lucas merged 1 commit intobranch-52from
fix-prune-by-limit-ordering
Apr 14, 2026
Merged

Disable prune_by_limit when output ordering is required#48
zhuqi-lucas merged 1 commit intobranch-52from
fix-prune-by-limit-ordering

Conversation

@zhuqi-lucas
Copy link
Copy Markdown
Collaborator

@zhuqi-lucas zhuqi-lucas commented Apr 14, 2026

Summary

Fix incorrect query results when ORDER BY + LIMIT is combined with partially-matched row groups.

prune_by_limit discards partially-matched row groups when enough fully-matched groups can satisfy the limit. This is only safe for unordered scans — with output ordering, a partially-matched group may contain rows that sort before any fully-matched group.

Root cause

ParquetOpener had no awareness of output ordering. It called prune_by_limit unconditionally whenever a limit was set.

Fix

Add preserve_order: bool to ParquetOpener:

  • Set to true when output_ordering is non-empty (in ParquetSource::create_opener)
  • Guard prune_by_limit: if let (Some(limit), false) = (limit, preserve_order)

Example

Parquet sorted by [a ASC], 3 rows per RG:

  • RG0: a=[1,2,3] — partially matched for a > 2
  • RG1: a=[4,5,6] — fully matched

Query: WHERE a > 2 ORDER BY a ASC LIMIT 1

Result Correct?
Before (no guard) a=4 from RG1 ❌ skipped RG0
After (preserve_order=true) a=3 from RG0

Test

test_preserve_order_prevents_limit_pruning — creates a real parquet with 2 RGs, verifies:

  • preserve_order=false: prune_by_limit skips RG0, returns a=4
  • preserve_order=true: RG0 is kept, returns a=3

Copilot AI review requested due to automatic review settings April 14, 2026 01:39
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes incorrect results for Parquet scans where ORDER BY + LIMIT is planned as an order-preserving scan and row-group pruning uses prune_by_limit, which can incorrectly drop partially-matched row groups that may contain earlier-sorting rows.

Changes:

  • Add a preserve_order flag to ParquetOpener and set it based on FileScanConfig.output_ordering in ParquetSource::create_file_opener.
  • Guard prune_by_limit so it is only applied when preserve_order is false.
  • Add a regression test ensuring preserve_order=true prevents limit-based pruning from skipping partially-matched row groups.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
datafusion/datasource-parquet/src/source.rs Propagates ordering requirements from FileScanConfig into the Parquet file opener via preserve_order.
datafusion/datasource-parquet/src/opener.rs Disables limit pruning when output ordering must be preserved; adds a regression test covering the correctness bug.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

When a parquet scan has output ordering constraints (e.g. ORDER BY + LIMIT),
prune_by_limit must not discard partially-matched row groups. A partially-
matched group may contain rows that sort before any fully-matched group,
so skipping it returns incorrect results.

This matches the fix in upstream DataFusion (apache#21190) where
prune_by_limit is guarded by preserve_order.

Add preserve_order: bool to ParquetOpener. Set to true when output_ordering
is non-empty. Guard prune_by_limit with !preserve_order.
@zhuqi-lucas zhuqi-lucas force-pushed the fix-prune-by-limit-ordering branch from 4fe6b27 to fa4144d Compare April 14, 2026 01:45
@zhuqi-lucas zhuqi-lucas merged commit 16e0a5c into branch-52 Apr 14, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants