-
Notifications
You must be signed in to change notification settings - Fork 401
Description
Full DML support for tables with evolved partition specs
Summary
UPDATE and DELETE operations currently fail on tables that have undergone partition evolution. The DataFusion integration assumes all files use the current default partition spec, which isn't true for tables where the partitioning scheme has changed over time.
Background
Iceberg supports changing a table's partition scheme without rewriting existing data (partition evolution). Each data file tracks which partition spec was in effect when it was written via partition_spec_id. A table might look like:
Spec 0: PARTITION BY (date) → wrote files A, B, C
Spec 1: PARTITION BY (date, region) → wrote files D, E
Current default: Spec 1
When you run an UPDATE that touches both old and new files, the current code tries to serialize all partition data using Spec 1's schema. Files from Spec 0 don't have a region field, so serialization fails or produces garbage.
Current Behavior
There's a guard in physical_plan/update.rs that returns FeatureNotSupported when it encounters files with non-default spec IDs. This prevents corruption but blocks legitimate use cases.
Proposed Changes
-
Expose spec ID on DataFile - Make
partition_spec_id()public so downstream code can access it -
Add helper on Table - Something like
partition_type_for_spec(spec_id)to look up the correct partition schema for any spec in the table's history -
Thread spec ID through the pipeline - Carry the original DataFile (or at least its spec ID) through scan → transform → commit stages instead of just the partition values
-
Per-file serialization - When serializing partition data for delete files and commits, look up the correct spec for each file rather than assuming default
-
Remove the evolution guard - Once correctness is guaranteed, remove the
FeatureNotSupportederror
Key invariants to maintain
- Delete files must use the same spec as their source data file
- New data files from UPDATE should use the current default spec
- Missing/invalid spec IDs should fail with a clear error, not silent corruption
- Tables written by iceberg-rust should remain readable by Spark/Trino/etc
Test coverage needed
- UPDATE touching files from multiple specs
- DELETE across evolved partitions
- Round-trip tests: serialize with spec N, deserialize with spec N, verify unchanged
- Cross-engine compatibility (Spark can read what we write)
- Error cases: invalid spec ID references
Related
- RowDelta action (prerequisite, provides atomic commit mechanism)
- Compaction (EPIC: Rust Based Compaction #624) - will also need this for compacting across specs
- Delete support (Iceberg-rust Delete support #735)