Skip to content

feat(datafusion): Full DML support for tables with evolved partition specs #1923

@ethan-tyler

Description

@ethan-tyler

Full DML support for tables with evolved partition specs

Summary

UPDATE and DELETE operations currently fail on tables that have undergone partition evolution. The DataFusion integration assumes all files use the current default partition spec, which isn't true for tables where the partitioning scheme has changed over time.

Background

Iceberg supports changing a table's partition scheme without rewriting existing data (partition evolution). Each data file tracks which partition spec was in effect when it was written via partition_spec_id. A table might look like:

Spec 0: PARTITION BY (date)           → wrote files A, B, C
Spec 1: PARTITION BY (date, region)   → wrote files D, E
Current default: Spec 1

When you run an UPDATE that touches both old and new files, the current code tries to serialize all partition data using Spec 1's schema. Files from Spec 0 don't have a region field, so serialization fails or produces garbage.

Current Behavior

There's a guard in physical_plan/update.rs that returns FeatureNotSupported when it encounters files with non-default spec IDs. This prevents corruption but blocks legitimate use cases.

Proposed Changes

  1. Expose spec ID on DataFile - Make partition_spec_id() public so downstream code can access it

  2. Add helper on Table - Something like partition_type_for_spec(spec_id) to look up the correct partition schema for any spec in the table's history

  3. Thread spec ID through the pipeline - Carry the original DataFile (or at least its spec ID) through scan → transform → commit stages instead of just the partition values

  4. Per-file serialization - When serializing partition data for delete files and commits, look up the correct spec for each file rather than assuming default

  5. Remove the evolution guard - Once correctness is guaranteed, remove the FeatureNotSupported error

Key invariants to maintain

  • Delete files must use the same spec as their source data file
  • New data files from UPDATE should use the current default spec
  • Missing/invalid spec IDs should fail with a clear error, not silent corruption
  • Tables written by iceberg-rust should remain readable by Spark/Trino/etc

Test coverage needed

  • UPDATE touching files from multiple specs
  • DELETE across evolved partitions
  • Round-trip tests: serialize with spec N, deserialize with spec N, verify unchanged
  • Cross-engine compatibility (Spark can read what we write)
  • Error cases: invalid spec ID references

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions