-
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Open
Labels
A-io-cloudArea: reading/writing to cloud storageArea: reading/writing to cloud storageA-io-icebergRelated to Apache Iceberg tables.Related to Apache Iceberg tables.P-highPriority: highPriority: highacceptedReady for implementationReady for implementation
Description
Roadmap for allowing distributed iceberg scans in Polars Cloud.
MVP
- Support native dispatch to
scan_parquet()
with transparent fallback- refactor: Add structure for dispatching iceberg to native scans #22405
- Users can force fallback scans by passing
reader_override='pyiceberg'
- Support schema-evolved datasets via type-casting in multi-scan post-apply pipeline
- Expose cast options parameter to scan_parquet
- Expose
extra_columns
parameter to scan_parquet
- Native support for deletion files in multi-scan post-apply pipeline
- Set appropriate
cast_options
/extra_columns
etc. parameters when calling nativescan_parquet()
- feat: Enable default set of
ScanCastOptions
for nativescan_iceberg()
#23416 - (Related) Enable use of
ScanCastOptions
in Delta scans by default (ES) feat: Support reading nanosecond/Int96 timestamps and schema evolved datasets inscan_delta()
#23398
- feat: Enable default set of
- Parquet row-group skipping with type-casting
- Column-mapping support Incorrect results on native Iceberg scans when columns have been renamed #23428
- Parquet row-group skipping with column-mapping
After completing the above, we should be safe to switch scan_iceberg()
to use the native Parquet scanner by default.
Further work
- Filtering on Iceberg statistics
- Filtering on Iceberg partitions fields
- Fast-count (physical and deleted row counts are available in the Iceberg python objects)
- Parquet pre-filtering with type-casting
- Parquet pre-filtering with deletion files
- Parquet pre-filtering with column mapping
(ES) - Related to enterprise support work
lmocsi, coastalwhite, cmdlineluser, BitPhinix, pj-ml and 2 more
Metadata
Metadata
Assignees
Labels
A-io-cloudArea: reading/writing to cloud storageArea: reading/writing to cloud storageA-io-icebergRelated to Apache Iceberg tables.Related to Apache Iceberg tables.P-highPriority: highPriority: highacceptedReady for implementationReady for implementation