Skip to content

Conversation

@Samyak2
Copy link
Contributor

@Samyak2 Samyak2 commented Nov 26, 2025

Which issue does this PR close?

Rationale for this change

When any partitioned file had ranges selected (even if the range contained the whole file), repartitioning was being skipped.

There was no good reason to disallow this.

What changes are included in this PR?

  • I have now allowed this and change the implementation to respect file ranges.
  • Also did a very small refactor to make the code easier to read: replace state tuple and state.0/state.1 with proper variable names using destructing

Are these changes tested?

Added a couple of unit tests. Let me know if more are needed.

Are there any user-facing changes?

@github-actions github-actions bot added the datasource Changes to the datasource crate label Nov 26, 2025
@Samyak2 Samyak2 force-pushed the repartitioned-on-range-files branch from 93dd62c to 143f685 Compare November 26, 2025 18:38
@Samyak2 Samyak2 force-pushed the repartitioned-on-range-files branch 3 times, most recently from 3e49087 to 691254c Compare November 27, 2025 14:04
@Samyak2
Copy link
Contributor Author

Samyak2 commented Nov 27, 2025

I rebased on main. Not sure why cargo fmt is failing on a file that I did not touch!

 + cargo fmt --all -- --check
Diff in /__w/datafusion/datafusion/datafusion/execution/src/config.rs:114:

 /// A type map for storing extensions.
-/// 
+///
 /// Extensions are indexed by their type `T`. If multiple values of the same type are provided, only the last one
 /// will be kept.
-/// 
+///
 /// Extensions are opaque objects that are unknown to DataFusion itself but can be downcast by optimizer rules,
 /// execution plans, or other components that have access to the session config.
 /// They provide a flexible way to attach extra data or behavior to the session config.

@Jefffrey
Copy link
Contributor

I rebased on main. Not sure why cargo fmt is failing on a file that I did not touch!

 + cargo fmt --all -- --check
Diff in /__w/datafusion/datafusion/datafusion/execution/src/config.rs:114:

 /// A type map for storing extensions.
-/// 
+///
 /// Extensions are indexed by their type `T`. If multiple values of the same type are provided, only the last one
 /// will be kept.
-/// 
+///
 /// Extensions are opaque objects that are unknown to DataFusion itself but can be downcast by optimizer rules,
 /// execution plans, or other components that have access to the session config.
 /// They provide a flexible way to attach extra data or behavior to the session config.

Unrelated error, fixed on main now so I've merged up from main

@Samyak2
Copy link
Contributor Author

Samyak2 commented Nov 28, 2025

Thank you! Do I need to tag more people to review this PR? or let it be for a few days?

@Jefffrey
Copy link
Contributor

I plan to review this PR soon (hopefully today or tomorrow)

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me. Only thing to comment on is perhaps adjust the test cases so the ranges don't span the entire file; for example, pfile("a", 123).with_range(1, 122) just as a quick sanity check that we respect ranges that don't essentially alias an entire file.

Fixes apache#18940

When any partitioned file had ranges selected (even if the range
contained the whole file), repartitioning was being skipped.

There was no good reason to disallow this. I have now allowed this and
change the implementation to respect file ranges.

Added a couple of unit tests too.
Added support for repartitioning files with ranges when ordering needs
to be preserved.

Introduced two new methods on PartitionedFile to do this:
- `effective_size()`: size of the file to be scanned. Takes into account
  range (if present)
- `range()`: effective range of the file to be scanned
When the order needs to be preserved, the end ranges were not being set
correctly. Leading to empty file ranges like (20, 20) instead of (20,
40).

Found this bug while adding tests with less than complete ranges. Thank
you Jeffrey Vo for this suggestion!
@Samyak2 Samyak2 force-pushed the repartitioned-on-range-files branch from 255c509 to 5ffe118 Compare November 29, 2025 17:53
@Samyak2
Copy link
Contributor Author

Samyak2 commented Nov 29, 2025

Thanks for pointing that out @Jefffrey! There was a bug in the order-preserved case that lead to some partitioned files being empty when the start of the range was non-zero. I have fixed the bug and added tests in the latest commit.

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave this PR open for a few days before merging in case anyone else is interested in reviewing it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ExecutionPlan::repartitioned doesn't work when file groups have ranges

2 participants