refactor: Add structure for dispatching iceberg to native scans #22405

nameexhaustion · 2025-04-25T07:30:42Z

Introduces code structure that allows us to dispatch iceberg to native parquet scans. There is also a transparent fallback to using python for scanning as we do not support some datasets at the moment due to deletion vectors / schema evolution.

Note that the native scans are disabled by default as it can fail on datasets with type changes.

Review notes:

PyIceberg can accept row-limit when resolving files. To take advantage of this, we avoid eagerly performing file resolution in DSL->IR conversion. Instead, we add an ExpandDatasets optimization pass that calls the file resolution after slice/predicate pushdown.
- The same technique can be applied to path expansion in the future if needed

Doc: Native/python dispatch override

The dispatch can be manually overridden with the following controls:

Newly added reader_overrde: 'native' | 'pyiceberg' parameter to scan_iceberg()
- Note that this is an unstable API
Setting POLARS_ICEBERG_READER_OVERRIDE in the environment ('native' | 'pyiceberg'):
- This must be set on the machine that performs the collection of the query (i.e. it is not serialized)
- This is not used if reader_overrde was passed to scan_iceberg()

There will be log lines in POLARS_VERBOSE to verify this:

IcebergDataset: to_dataset_scan(): fallback to python scan: forced force_scan_dispatch='python'
ComputeError: iceberg force_scan_dispatch='native' failed: unimplemented: dataset contained delete files
IcebergDataset: to_dataset_scan(): fallback to python scan: native scans disabled by default
IcebergDataset: to_dataset_scan(): native scan_parquet() (2 sources)

nameexhaustion · 2025-04-25T09:45:27Z

crates/polars-plan/src/plans/optimizer/slice_pushdown_lp.rs

@@ -241,56 +216,25 @@ impl SlicePushDown {
                mut unified_scan_args,
                predicate,
                scan_type,
-            }, Some(state)) if predicate.is_none() && matches!(&*scan_type, FileScan::NDJson {.. }) =>  {
-                unified_scan_args.pre_slice = Some(state.to_slice_enum());
+            }, Some(state)) if predicate.is_none() && match &*scan_type {


drive-by de-duplicate the match blocks

nameexhaustion · 2025-04-25T10:09:27Z

py-polars/tests/unit/io/test_iceberg.py

@@ -56,7 +56,7 @@ def test_scan_iceberg_snapshot_id(self, iceberg_path: str) -> None:

    def test_scan_iceberg_snapshot_id_not_found(self, iceberg_path: str) -> None:
        with pytest.raises(ValueError, match="Snapshot ID not found"):
-            pl.scan_iceberg(iceberg_path, snapshot_id=1234567890)
+            pl.scan_iceberg(iceberg_path, snapshot_id=1234567890).collect()


We currently eagerly query the dataset for the snapshot ID, but this PR defers all IO operations until collection time.

Note I had to change the exception type because we are going through the Rust, I plan to change it back after #22410.

ritchie46 · 2025-04-25T18:35:04Z

A rebase away

codecov · 2025-04-28T23:09:36Z

Codecov Report

Attention: Patch coverage is 62.95547% with 183 lines in your changes missing coverage. Please review.

Project coverage is 81.07%. Comparing base (fe6bc80) to head (38e0567).
Report is 14 commits behind head on main.

Files with missing lines	Patch %	Lines
...polars-plan/src/plans/optimizer/expand_datasets.rs	39.34%	111 Missing ⚠️
py-polars/polars/io/iceberg/dataset.py	52.94%	38 Missing and 10 partials ⚠️
...olars-python/src/dataset/dataset_provider_funcs.rs	81.81%	14 Missing ⚠️
crates/polars-schema/src/schema.rs	80.00%	4 Missing ⚠️
py-polars/polars/io/iceberg/functions.py	75.00%	2 Missing and 1 partial ⚠️
crates/polars-plan/src/plans/functions/count.rs	0.00%	1 Missing ⚠️
...rates/polars-python/src/lazyframe/visitor/nodes.rs	0.00%	1 Missing ⚠️
...lars-stream/src/physical_plan/io/python_dataset.rs	96.96%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #22405      +/-   ##
==========================================
+ Coverage   81.05%   81.07%   +0.01%     
==========================================
  Files        1643     1650       +7     
  Lines      231847   232316     +469     
  Branches     2720     2738      +18     
==========================================
+ Hits       187935   188342     +407     
- Misses      43269    43319      +50     
- Partials      643      655      +12

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

nameexhaustion · 2025-04-28T23:37:07Z

Updated:

Renamed to reader_override / POLARS_ICEBERG_READER_OVERRIDE.
- reader_override is now exposed as an unstable API parameter in scan_iceberg()

…-rs#22405)

github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Apr 25, 2025

nameexhaustion changed the title ~~feat: Defer reading data in scan_iceberg until collect() time~~ _ Apr 25, 2025

github-actions bot added the title needs formatting label Apr 25, 2025

nameexhaustion changed the title _ refactor: Add structure for dispatching iceberg to native scans Apr 25, 2025

github-actions bot added internal An internal refactor or improvement and removed title needs formatting labels Apr 25, 2025

nameexhaustion commented Apr 25, 2025

View reviewed changes

nameexhaustion force-pushed the iceberg branch from 62b5e14 to 50c7400 Compare April 25, 2025 11:00

nameexhaustion marked this pull request as ready for review April 25, 2025 11:17

nameexhaustion requested review from ritchie46, c-peters, alexander-beedie, MarcoGorelli, reswqa, wence- and orlp as code owners April 25, 2025 11:17

nameexhaustion force-pushed the iceberg branch from 50c7400 to dd46ea3 Compare April 26, 2025 06:56

nameexhaustion mentioned this pull request Apr 28, 2025

Tracking for Iceberg cloud enablement #22450

Open

16 tasks

rebase

d61adb0

nameexhaustion marked this pull request as draft April 28, 2025 22:41

rename to reader_override

0acd405

nameexhaustion force-pushed the iceberg branch from dd46ea3 to 0acd405 Compare April 28, 2025 22:56

nameexhaustion added 2 commits April 29, 2025 09:21

c

8206614

c

783400b

nameexhaustion added 3 commits April 29, 2025 11:01

c

299cc28

c

3b8b80e

c

38e0567

nameexhaustion marked this pull request as ready for review April 29, 2025 02:43

ritchie46 merged commit a36f3bd into pola-rs:main Apr 30, 2025
27 checks passed

ritchie46 pushed a commit to polars-inc/polars that referenced this pull request Apr 30, 2025

refactor: Add structure for dispatching iceberg to native scans (pola…

45592c7

…-rs#22405)

nameexhaustion mentioned this pull request May 28, 2025

Partition filtering on Iceberg no longer working #22978

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

refactor: Add structure for dispatching iceberg to native scans #22405

refactor: Add structure for dispatching iceberg to native scans #22405

Uh oh!

nameexhaustion commented Apr 25, 2025 •

edited

Loading

Uh oh!

nameexhaustion Apr 25, 2025

Uh oh!

nameexhaustion Apr 25, 2025 •

edited

Loading

Uh oh!

ritchie46 commented Apr 25, 2025

Uh oh!

codecov bot commented Apr 28, 2025 •

edited

Loading

Uh oh!

nameexhaustion commented Apr 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

refactor: Add structure for dispatching iceberg to native scans #22405

refactor: Add structure for dispatching iceberg to native scans #22405

Uh oh!

Conversation

nameexhaustion commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Doc: Native/python dispatch override

Uh oh!

nameexhaustion Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

nameexhaustion Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ritchie46 commented Apr 25, 2025

Uh oh!

codecov bot commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nameexhaustion commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nameexhaustion commented Apr 25, 2025 •

edited

Loading

nameexhaustion Apr 25, 2025 •

edited

Loading

codecov bot commented Apr 28, 2025 •

edited

Loading

nameexhaustion commented Apr 28, 2025 •

edited

Loading