Skip to content

refactor: Add structure for dispatching iceberg to native scans #22405

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 30, 2025

Conversation

nameexhaustion
Copy link
Collaborator

@nameexhaustion nameexhaustion commented Apr 25, 2025

Introduces code structure that allows us to dispatch iceberg to native parquet scans. There is also a transparent fallback to using python for scanning as we do not support some datasets at the moment due to deletion vectors / schema evolution.

Note that the native scans are disabled by default as it can fail on datasets with type changes.

Review notes:

  • PyIceberg can accept row-limit when resolving files. To take advantage of this, we avoid eagerly performing file resolution in DSL->IR conversion. Instead, we add an ExpandDatasets optimization pass that calls the file resolution after slice/predicate pushdown.
    • The same technique can be applied to path expansion in the future if needed

Doc: Native/python dispatch override

The dispatch can be manually overridden with the following controls:

  • Newly added reader_overrde: 'native' | 'pyiceberg' parameter to scan_iceberg()
    • Note that this is an unstable API
  • Setting POLARS_ICEBERG_READER_OVERRIDE in the environment ('native' | 'pyiceberg'):
    • This must be set on the machine that performs the collection of the query (i.e. it is not serialized)
    • This is not used if reader_overrde was passed to scan_iceberg()

There will be log lines in POLARS_VERBOSE to verify this:

IcebergDataset: to_dataset_scan(): fallback to python scan: forced force_scan_dispatch='python'
ComputeError: iceberg force_scan_dispatch='native' failed: unimplemented: dataset contained delete files
IcebergDataset: to_dataset_scan(): fallback to python scan: native scans disabled by default
IcebergDataset: to_dataset_scan(): native scan_parquet() (2 sources)

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Apr 25, 2025
@nameexhaustion nameexhaustion changed the title feat: Defer reading data in scan_iceberg until collect() time _ Apr 25, 2025
@nameexhaustion nameexhaustion changed the title _ refactor: Add structure for dispatching iceberg to native scans Apr 25, 2025
@github-actions github-actions bot added internal An internal refactor or improvement and removed title needs formatting labels Apr 25, 2025
@@ -241,56 +216,25 @@ impl SlicePushDown {
mut unified_scan_args,
predicate,
scan_type,
}, Some(state)) if predicate.is_none() && matches!(&*scan_type, FileScan::NDJson {.. }) => {
unified_scan_args.pre_slice = Some(state.to_slice_enum());
}, Some(state)) if predicate.is_none() && match &*scan_type {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive-by de-duplicate the match blocks

@@ -56,7 +56,7 @@ def test_scan_iceberg_snapshot_id(self, iceberg_path: str) -> None:

def test_scan_iceberg_snapshot_id_not_found(self, iceberg_path: str) -> None:
with pytest.raises(ValueError, match="Snapshot ID not found"):
pl.scan_iceberg(iceberg_path, snapshot_id=1234567890)
pl.scan_iceberg(iceberg_path, snapshot_id=1234567890).collect()
Copy link
Collaborator Author

@nameexhaustion nameexhaustion Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently eagerly query the dataset for the snapshot ID, but this PR defers all IO operations until collection time.

Note I had to change the exception type because we are going through the Rust, I plan to change it back after #22410.

@ritchie46
Copy link
Member

A rebase away

@nameexhaustion nameexhaustion marked this pull request as draft April 28, 2025 22:41
Copy link

codecov bot commented Apr 28, 2025

Codecov Report

Attention: Patch coverage is 62.95547% with 183 lines in your changes missing coverage. Please review.

Project coverage is 81.07%. Comparing base (fe6bc80) to head (38e0567).
Report is 14 commits behind head on main.

Files with missing lines Patch % Lines
...polars-plan/src/plans/optimizer/expand_datasets.rs 39.34% 111 Missing ⚠️
py-polars/polars/io/iceberg/dataset.py 52.94% 38 Missing and 10 partials ⚠️
...olars-python/src/dataset/dataset_provider_funcs.rs 81.81% 14 Missing ⚠️
crates/polars-schema/src/schema.rs 80.00% 4 Missing ⚠️
py-polars/polars/io/iceberg/functions.py 75.00% 2 Missing and 1 partial ⚠️
crates/polars-plan/src/plans/functions/count.rs 0.00% 1 Missing ⚠️
...rates/polars-python/src/lazyframe/visitor/nodes.rs 0.00% 1 Missing ⚠️
...lars-stream/src/physical_plan/io/python_dataset.rs 96.96% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #22405      +/-   ##
==========================================
+ Coverage   81.05%   81.07%   +0.01%     
==========================================
  Files        1643     1650       +7     
  Lines      231847   232316     +469     
  Branches     2720     2738      +18     
==========================================
+ Hits       187935   188342     +407     
- Misses      43269    43319      +50     
- Partials      643      655      +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nameexhaustion
Copy link
Collaborator Author

nameexhaustion commented Apr 28, 2025

Updated:

  • Renamed to reader_override / POLARS_ICEBERG_READER_OVERRIDE.
    • reader_override is now exposed as an unstable API parameter in scan_iceberg()

@nameexhaustion nameexhaustion marked this pull request as ready for review April 29, 2025 02:43
@ritchie46 ritchie46 merged commit a36f3bd into pola-rs:main Apr 30, 2025
27 checks passed
ritchie46 pushed a commit to polars-inc/polars that referenced this pull request Apr 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature internal An internal refactor or improvement python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants