move projection handling into FileSource #18627

adriangb · 2025-11-11T14:58:01Z

This moves ownership of projections from FileScanConfig into FileSource.
Notably we do not do anything special with this in Parquet just yet: I leave it for a followup to actually use the projection expressions instead of column indices to e.g. generate the Parquet ProjectionMask directly from expressions (in particular to select leaves instead of roots for struct and variant access).

AdamGS · 2025-11-11T16:05:42Z

datafusion-examples/examples/custom_data_source/csv_json_opener.rs

+    let scan_config =
+        FileScanConfigBuilder::new(ObjectStoreUrl::local_filesystem(), source)


adriangb · 2025-11-11T16:18:13Z

This is the tracer bullet I've been using to guide this change:

use arrow::util::pretty::pretty_format_batches;
use arrow_schema::DataType;
use datafusion::{
    common::Result,
    prelude::{ParquetReadOptions, SessionContext},
};

#[tokio::main]
async fn main() -> Result<()> {
    let ctx = SessionContext::new();
    let df = ctx
        .read_parquet(
            "data/",
            ParquetReadOptions::default()
                .table_partition_cols(vec![("b".to_string(), DataType::UInt32)]),
        )
        .await?;
    let res = df
        .select_exprs(&["b", "a", "a = b", "a * 2"])?
        .collect()
        .await?;

    println!("Result:\n{}", pretty_format_batches(&res)?);

    Ok(())
}

I plan to do a review and cleanup, handle the protobuf stuff (I know that needs to be changed) and then fix all the rest of the tests. It looks like we have 36 failing tests, not too bad, and a lot of them look related.

adriangb · 2025-11-12T07:09:15Z

Down to 4 failing tests! All proto related.

adriangb · 2025-11-12T22:50:16Z

datafusion/datasource-json/src/source.rs

+            // No projection - use the full file schema
+            Arc::clone(self.table_schema.file_schema())


TODO: do we need to dynamically generate a "full" projection that includes table partition columns? Or should that be the default that each FileSource initializes itself with (then we don't even need to check here!)?

adriangb · 2025-11-12T23:30:02Z

I'm marking this as ready for review. I expect more changes / cleanup is needed, I already have some ideas, but I'd like some initial feedback. cc @AdamGS @XiangpengHao @waynexia since you've all expressed interest.

This PR adds trait implementations, a project_batch() method, and fixes a bug in update_expr() for literal expressions. Also adds comprehensive tests. Part of apache#18627

This commit consolidates the separate ArrowFileSource and ArrowStreamFileSource implementations into a unified ArrowSource with an ArrowFormat enum. Key changes: - Removed ArrowFileSource and ArrowStreamFileSource structs - Added ArrowFormat enum (File, Stream) to distinguish between formats - Created unified ArrowSource struct that uses ArrowFormat to dispatch - Kept separate ArrowFileOpener and ArrowStreamFileOpener implementations - Consolidated all FileSource trait implementations in ArrowSource - Format-specific behavior in repartitioned() method (Stream returns None) This consolidation reduces code duplication while maintaining clear separation of concerns between the file and stream format handling. Part of apache#18627 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

This commit moves statistics handling from individual FileSource implementations into FileScanConfig, simplifying the FileSource trait. Changes: - Remove statistics() and with_statistics() methods from FileSource trait - Remove with_projection() method from FileSource trait (statistics PR only) - Add statistics field to FileScanConfig struct - Add statistics() method to FileScanConfig to retrieve statistics - Update FileScanConfigBuilder to properly handle statistics - Remove projected_statistics field from all FileSource implementations: - ParquetSource - CsvSource - JsonSource - AvroSource - ArrowFileSource and ArrowStreamFileSource - MockSource (test utility) - Update test utilities and assertions to use config.statistics() instead of file_source.statistics() - Update proto serialization to use config.statistics() Part of apache#18627 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

This PR adds trait implementations, a project_batch() method, and fixes a bug in update_expr() for literal expressions. Also adds comprehensive tests. Part of apache#18627

This commit consolidates the separate ArrowFileSource and ArrowStreamFileSource implementations into a unified ArrowSource with an ArrowFormat enum. Key changes: - Removed ArrowFileSource and ArrowStreamFileSource structs - Added ArrowFormat enum (File, Stream) to distinguish between formats - Created unified ArrowSource struct that uses ArrowFormat to dispatch - Kept separate ArrowFileOpener and ArrowStreamFileOpener implementations - Consolidated all FileSource trait implementations in ArrowSource - Format-specific behavior in repartitioned() method (Stream returns None) This consolidation reduces code duplication while maintaining clear separation of concerns between the file and stream format handling. Part of apache#18627 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

This commit moves statistics handling from individual FileSource implementations into FileScanConfig, simplifying the FileSource trait. Changes: - Remove statistics() and with_statistics() methods from FileSource trait - Remove with_projection() method from FileSource trait (statistics PR only) - Add statistics field to FileScanConfig struct - Add statistics() method to FileScanConfig to retrieve statistics - Update FileScanConfigBuilder to properly handle statistics - Remove projected_statistics field from all FileSource implementations: - ParquetSource - CsvSource - JsonSource - AvroSource - ArrowFileSource and ArrowStreamFileSource - MockSource (test utility) - Update test utilities and assertions to use config.statistics() instead of file_source.statistics() - Update proto serialization to use config.statistics() Part of apache#18627 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

This PR adds trait implementations, a project_batch() method, and fixes a bug in update_expr() for literal expressions. Also adds comprehensive tests. Part of apache#18627

This commit consolidates the separate ArrowFileSource and ArrowStreamFileSource implementations into a unified ArrowSource with an ArrowFormat enum. Key changes: - Removed ArrowFileSource and ArrowStreamFileSource structs - Added ArrowFormat enum (File, Stream) to distinguish between formats - Created unified ArrowSource struct that uses ArrowFormat to dispatch - Kept separate ArrowFileOpener and ArrowStreamFileOpener implementations - Consolidated all FileSource trait implementations in ArrowSource - Format-specific behavior in repartitioned() method (Stream returns None) This consolidation reduces code duplication while maintaining clear separation of concerns between the file and stream format handling. Part of apache#18627 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

This commit moves statistics handling from individual FileSource implementations into FileScanConfig, simplifying the FileSource trait. Changes: - Remove statistics() and with_statistics() methods from FileSource trait - Remove with_projection() method from FileSource trait (statistics PR only) - Add statistics field to FileScanConfig struct - Add statistics() method to FileScanConfig to retrieve statistics - Update FileScanConfigBuilder to properly handle statistics - Remove projected_statistics field from all FileSource implementations: - ParquetSource - CsvSource - JsonSource - AvroSource - ArrowFileSource and ArrowStreamFileSource - MockSource (test utility) - Update test utilities and assertions to use config.statistics() instead of file_source.statistics() - Update proto serialization to use config.statistics() Part of apache#18627 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

adriangb · 2025-11-15T07:17:31Z

To make life easier for reviewers I've opened a couple smaller hopefully "no brainer" PRs to break work out of this one:

## Summary This PR consolidates the separate `ArrowFileSource` and `ArrowStreamFileSource` implementations into a unified `ArrowSource` with an `ArrowFormat` enum. This is part of the larger projection refactoring effort tracked in #18627. ## Key Changes - **Removed separate structs**: Eliminated duplicate `ArrowFileSource` and `ArrowStreamFileSource` implementations - **Added `ArrowFormat` enum**: Simple enum with `File` and `Stream` variants to distinguish between Arrow IPC formats - **Unified `ArrowSource` struct**: Single struct that uses `ArrowFormat` to dispatch to appropriate opener - **Kept separate openers**: `ArrowFileOpener` and `ArrowStreamFileOpener` remain distinct as their implementations differ significantly - **Format-specific behavior**: `repartitioned()` method returns `None` for Stream format (doesn't support parallel reading) and delegates to default logic for File format ## Benefits - **Reduced code duplication**: ~144 net lines removed - **Clearer architecture**: Single source of truth for Arrow file handling - **Maintained separation**: Format-specific logic remains in separate openers - **No behavior changes**: All existing tests pass without modification ## Testing - All existing tests pass - No changes to test files needed - Both file and stream formats work correctly ## Related Work This PR is independent and can be merged before or after: - PR 1: Move Statistics Handling (if created) - PR 3: Enhance Physical-Expr Projection Handling (if created) Part of #18627 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude <noreply@anthropic.com>

github-actions bot added physical-expr Changes to the physical-expr crates core Core DataFusion crate substrait Changes to the substrait crate proto Related to proto crate datasource Changes to the datasource crate labels Nov 11, 2025

adriangb mentioned this pull request Nov 11, 2025

Support Push down expression evaluation in TableProviders #14993

Open

AdamGS reviewed Nov 11, 2025

View reviewed changes

alamb mentioned this pull request Nov 12, 2025

Andrew Lamb Weekly-ish Open Source plan - 2025-11-03 #18486

Closed

53 tasks

adriangb commented Nov 12, 2025

View reviewed changes

adriangb force-pushed the projection-source branch from d485ef7 to 493036f Compare November 12, 2025 23:27

adriangb marked this pull request as ready for review November 12, 2025 23:27

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Nov 13, 2025

alamb mentioned this pull request Nov 14, 2025

Andrew Lamb Weekly-ish Open Source plan - 2025-11-17 #18711

Open

34 tasks

adriangb force-pushed the projection-source branch from bc3d8f4 to 5627591 Compare November 14, 2025 22:37

adriangb mentioned this pull request Nov 14, 2025

Misc improvements to ProjectionExprs #18719

Open

adriangb mentioned this pull request Nov 14, 2025

Consolidate ArrowFileSource and ArrowStreamFileSource #18720

Merged

adriangb mentioned this pull request Nov 14, 2025

Move statistics handling into FileScanConfig #18721

Open

3 tasks

adriangb force-pushed the projection-source branch from a3e37bf to 3997572 Compare November 15, 2025 07:02

adriangb added 3 commits November 17, 2025 20:38

move projection handling into FileSource

937aa62

lint

167c8f0

fix slt

b005dfd

adriangb force-pushed the projection-source branch from 3997572 to b005dfd Compare November 17, 2025 12:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

move projection handling into FileSource #18627

move projection handling into FileSource #18627

Uh oh!

adriangb commented Nov 11, 2025 •

edited

Loading

Uh oh!

AdamGS Nov 11, 2025

Uh oh!

adriangb commented Nov 11, 2025 •

edited

Loading

Uh oh!

adriangb commented Nov 12, 2025

Uh oh!

adriangb Nov 12, 2025

Uh oh!

adriangb commented Nov 12, 2025 •

edited

Loading

Uh oh!

adriangb commented Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		let scan_config =
		FileScanConfigBuilder::new(ObjectStoreUrl::local_filesystem(), source)

		// No projection - use the full file schema
		Arc::clone(self.table_schema.file_schema())

move projection handling into FileSource #18627

Are you sure you want to change the base?

move projection handling into FileSource #18627

Uh oh!

Conversation

adriangb commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AdamGS Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

adriangb commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb commented Nov 12, 2025

Uh oh!

adriangb Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

adriangb commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb commented Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adriangb commented Nov 11, 2025 •

edited

Loading

adriangb commented Nov 11, 2025 •

edited

Loading

adriangb commented Nov 12, 2025 •

edited

Loading