Skip to content

Conversation

@adriangb
Copy link
Contributor

Summary

This PR consolidates the separate ArrowFileSource and ArrowStreamFileSource implementations into a unified ArrowSource with an ArrowFormat enum.

This is part of the larger projection refactoring effort tracked in #18627.

Key Changes

  • Removed separate structs: Eliminated duplicate ArrowFileSource and ArrowStreamFileSource implementations
  • Added ArrowFormat enum: Simple enum with File and Stream variants to distinguish between Arrow IPC formats
  • Unified ArrowSource struct: Single struct that uses ArrowFormat to dispatch to appropriate opener
  • Kept separate openers: ArrowFileOpener and ArrowStreamFileOpener remain distinct as their implementations differ significantly
  • Format-specific behavior: repartitioned() method returns None for Stream format (doesn't support parallel reading) and delegates to default logic for File format

Benefits

  • Reduced code duplication: ~144 net lines removed
  • Clearer architecture: Single source of truth for Arrow file handling
  • Maintained separation: Format-specific logic remains in separate openers
  • No behavior changes: All existing tests pass without modification

Testing

  • All existing tests pass
  • No changes to test files needed
  • Both file and stream formats work correctly

Related Work

This PR is independent and can be merged before or after:

  • PR 1: Move Statistics Handling (if created)
  • PR 3: Enhance Physical-Expr Projection Handling (if created)

Part of #18627

🤖 Generated with Claude Code

@github-actions github-actions bot added the datasource Changes to the datasource crate label Nov 14, 2025
@adriangb adriangb force-pushed the projection-source-arrow branch from 05411eb to c76595f Compare November 15, 2025 00:48
@adriangb adriangb marked this pull request as draft November 15, 2025 00:50
This commit consolidates the separate ArrowFileSource and ArrowStreamFileSource
implementations into a unified ArrowSource with an ArrowFormat enum.

Key changes:
- Removed ArrowFileSource and ArrowStreamFileSource structs
- Added ArrowFormat enum (File, Stream) to distinguish between formats
- Created unified ArrowSource struct that uses ArrowFormat to dispatch
- Kept separate ArrowFileOpener and ArrowStreamFileOpener implementations
- Consolidated all FileSource trait implementations in ArrowSource
- Format-specific behavior in repartitioned() method (Stream returns None)

This consolidation reduces code duplication while maintaining clear separation
of concerns between the file and stream format handling.

Part of apache#18627

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@adriangb adriangb force-pushed the projection-source-arrow branch from c76595f to 3990945 Compare November 15, 2025 07:04
@adriangb adriangb marked this pull request as ready for review November 15, 2025 07:16
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adriangb nice clean up!

@adriangb adriangb added this pull request to the merge queue Nov 16, 2025
Merged via the queue into apache:main with commit 97af468 Nov 16, 2025
28 checks passed
@adriangb
Copy link
Contributor Author

Thank you @comphead ! I think you'll find #18721 also a nice cleanup and easy to review if you have a few minutes 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants