Skip to content

Unexpected behavior when execute CREATE OR REPLACE AS SELECT with REST Catalog #12738

@baotran306

Description

@baotran306

Apache Iceberg version

1.7.2

Query engine

Spark

Please describe the bug 🐞

Describe bug

When I try this commandCREATE OR REPLACE TABLE AS SELECT, it works well for several cases including:

  • Replace table with structure only (I try to limit 0 in my select statement)
  • Replace table from non-partition table to partition table
  • Replace table from identity partition column to another identity partition column

However, there's one case I've tested that was unsuccessful: attempting to replace a table with a transform partitiontion to non-partitioned.

Step reproduce

Firstly, I created an empty non-partitioned table using the CREATE TABLE command, followed by inserting test data. Next, I created another partitioned table with a different structure using the same command as in the first step.

Image

Then, I tried to replace the existing partitioned table using the CREATE OR REPLACE TABLE AS SELECT command along with my non-partitioned table. My expectation was that the table would be completely replaced, including both the schema and data, with the non-partitioned table successfully. However, it didn't work and raised org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 24.0 failed 1 times, most recent failure: Lost task 0.0 in stage 24.0 (TID 50) (10.16.14.4 executor driver): org.apache.iceberg.exceptions.ValidationException: Cannot find source column for partition field: 1000: id_bucket: bucket[16](1)

Image

Okay, I tried one more time. Before replacing the table, I dropped all partition columns using ALTER TABLE DROP PARTITION FIELD. However, it still resulted in the same errors. I don't understand why it is still checking for partition fields when I had already dropped them before replacing the table.(I've already checked metadata file, and there's no partition spec now as I dropped before)

Image

My code: Spark + Rest Catalog(catalog_name = 'lakekeeper')

# Create non-partition table
spark.sql("""
    CREATE TABLE IF NOT EXISTS lakekeeper.risk_mon_out.my_data_non_partitioned(
        vendor_id int,
        product string,
        price double,
        event_time timestamp      
    )
""")
spark.sql("""
    INSERT INTO lakekeeper.risk_mon_out.my_data_non_partitioned VALUES
          (1, "Car", 2.34, TIMESTAMP '2025-03-01 11:03:01'),
          (2, "Bike", 5.43, TIMESTAMP '2025-03-01 11:19:21'),
          (3, "House", 100.4, TIMESTAMP '2025-03-01 12:43:17'),
          (4, "Plane", 124.4, TIMESTAMP '2025-03-01 13:05:29')
""")
spark.sql("SELECT * FROM lakekeeper.risk_mon_out.my_data_non_partitioned").show()

# Create partition table with transformation
spark.sql("""
    CREATE TABLE lakekeeper.risk_mon_out.data_sample_test(
        id int,
        category string,
        event_time timestamp      
    )
    PARTITIONED BY (
        bucket(16, id),
        category
    )
""")
spark.sql("""
    INSERT INTO lakekeeper.risk_mon_out.data_sample_test VALUES
          (5, "HOVal", TIMESTAMP '2025-03-01 11:03:09'),
          (6, "NextMi", TIMESTAMP '2025-03-01 12:19:29'),
          (7, "OGel", TIMESTAMP '2025-01-09 06:43:17'),
          (8, "AAAA", TIMESTAMP '2025-01-09 06:05:29')
""")
spark.sql("SELECT * FROM lakekeeper.risk_mon_out.data_sample_test").show()

# Replace partition table to non-partition table
try:
    spark.sql("""
        CREATE OR REPLACE TABLE lakekeeper.risk_mon_out.data_sample_test
        USING iceberg
        AS SELECT * FROM lakekeeper.risk_mon_out.my_data_non_partitioned
    """)
except Exception as e:
    print(f"Error 111: {e}")

try:
    spark.sql("ALTER TABLE lakekeeper.risk_mon_out.data_sample_test DROP PARTITION FIELD id_bucket")
    spark.sql("ALTER TABLE lakekeeper.risk_mon_out.data_sample_test DROP PARTITION FIELD category")
    print("Drop done")
    spark.sql("""
        CREATE OR REPLACE TABLE lakekeeper.risk_mon_out.data_sample_test
        USING iceberg
        AS SELECT * FROM lakekeeper.risk_mon_out.my_data_non_partitioned
    """)
except Exception as e:
    print("=====")
    print(f"Error 222: {e}")

Expected Behavior

Table will be replaced entirely with both schema and data.

Environment

Apache Iceberg Version: 1.7.2
Execution Engines Tested: Spark 3.5.2
Catalog Type: REST catalog
Storage: S3-compatible

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions