Closed
Description
Apache Iceberg version
1.2.1 (latest release)
Query engine
Spark
Please describe the bug 🐞
Environment
Spark 3.3.2
Iceberg 1.2.1
Tabular's docker environment: https://github.com/tabular-io/docker-spark-iceberg/blob/main/docker-compose.yml
Repro
1. Setup
CREATE TABLE demo.nyc.contacts2
(
id bigint NOT NULL COMMENT 'unique id',
first_name string,
last_name string,
neighborhood string
)
TBLPROPERTIES(
'format-version'='2',
'write.delete.mode'='merge-on-read',
'write.update.mode'='merge-on-read',
'write.merge.mode'='merge-on-read'
);
ALTER TABLE demo.nyc.contacts2 SET IDENTIFIER FIELDS id;
INSERT INTO demo.nyc.contacts2
SELECT /*+ COALESCE(1) */ *
FROM VALUES (1, 'Adam', 'Smith', 'SoHo'),
(2, 'Virginia', 'Smith', 'SoHo'),
(3, 'Thomas', 'Lao', 'Midtown'),
(4, 'John', 'Books', 'Williamsburg'),
(5, 'Anna', 'Frank', 'Midtown');
Validate setup, expectations:
- 1 data file
- 5 rows
SELECT * FROM demo.nyc.contacts2;
SELECT * FROM demo.nyc.contacts2.files;
2. Setup branch
ALTER TABLE demo.nyc.contacts2 CREATE BRANCH ds20230501 RETAIN 730 DAYS;
INSERT INTO demo.nyc.contacts2.branch_ds20230501
SELECT /*+ COALESCE(1) */ *
FROM VALUES (6, 'Peter', 'Smith', 'Chelsea'), (7, 'John', 'Connor', 'Greenwich Village');
Validate branch setup, expectations:
- 7 rows
- 2 total data files (1 in main branch, 1 in branch
ds20230501
)
SELECT * FROM demo.nyc.contacts2.branch_ds20230501;
SELECT * FROM demo.nyc.contacts2.all_files;
3. Test merge-on-read delete on branch for 1 row within data file
DELETE FROM demo.nyc.contacts2.branch_ds20230501 WHERE id=7;
Previous command fails with:
spark-sql> DELETE FROM demo.nyc.contacts2.branch_ds20230501 WHERE id=7;
23/05/17 20:04:45 ERROR SparkSQLDriver: Failed in [DELETE FROM demo.nyc.contacts2.branch_ds20230501 WHERE id=7]
org.apache.iceberg.exceptions.ValidationException: Cannot delete file where some, but not all, rows match filter ref(name="id") == 7: s3://warehouse/nyc/contacts2/data/00000-1051-74b41dde-49d4-4a85-844c-0ef79e1257f6-00001.parquet
at org.apache.iceberg.exceptions.ValidationException.check(ValidationException.java:49)
at org.apache.iceberg.ManifestFilterManager.manifestHasDeletedFiles(ManifestFilterManager.java:377)
at org.apache.iceberg.ManifestFilterManager.filterManifest(ManifestFilterManager.java:307)
at org.apache.iceberg.ManifestFilterManager.lambda$filterManifests$0(ManifestFilterManager.java:189)
at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:413)
at org.apache.iceberg.util.Tasks$Builder.access$300(Tasks.java:69)
at org.apache.iceberg.util.Tasks$Builder$1.run(Tasks.java:315)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
org.apache.iceberg.exceptions.ValidationException: Cannot delete file where some, but not all, rows match filter ref(name="id") == 7: s3://warehouse/nyc/contacts2/data/00000-1051-74b41dde-49d4-4a85-844c-0ef79e1257f6-00001.parquet
at org.apache.iceberg.exceptions.ValidationException.check(ValidationException.java:49)
at org.apache.iceberg.ManifestFilterManager.manifestHasDeletedFiles(ManifestFilterManager.java:377)
at org.apache.iceberg.ManifestFilterManager.filterManifest(ManifestFilterManager.java:307)
at org.apache.iceberg.ManifestFilterManager.lambda$filterManifests$0(ManifestFilterManager.java:189)
at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:413)
at org.apache.iceberg.util.Tasks$Builder.access$300(Tasks.java:69)
at org.apache.iceberg.util.Tasks$Builder$1.run(Tasks.java:315)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Expectation:
- 6 rows
- 2 total data files (1 in main branch, 1 in branch
ds20230501
) - 1 positional delete files in branch
ds20230501
4. Delete from main branch also fails
This error is different, but fails consistently with 400
(bad request) so I wonder if there's some incorrect handling here. Possibly related to the environment as well as using Tabular's docker setup.
DELETE FROM demo.nyc.contacts2 WHERE id=3;
Caused by: software.amazon.awssdk.services.s3.model.S3Exception: null (Service: S3, Status Code: 400, Request ID: 17600716A659BCC7, Extended Request ID: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855)
at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleErrorResponse(AwsXmlPredicatedResponseHandler.java:156)
at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleResponse(AwsXmlPredicatedResponseHandler.java:108)
at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:85)
at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:43)
at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler$Crc32ValidationResponseHandler.handle(AwsSyncClientHandler.java:95)
at software.amazon.awssdk.core.internal.handler.BaseClientHandler.lambda$successTransformationResponseHandler$7(BaseClientHandler.java:270)
at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:40)
at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:30)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:73)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:42)
at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:78)
at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:40)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:50)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:36)
at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:81)
at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:36)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:56)
at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:36)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:80)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:60)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:42)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:48)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:31)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26)
at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:193)
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103)
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:171)
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:82)
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:179)
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:76)
at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:56)
at software.amazon.awssdk.services.s3.DefaultS3Client.putObject(DefaultS3Client.java:9321)
at org.apache.iceberg.aws.s3.S3OutputStream.completeUploads(S3OutputStream.java:435)
at org.apache.iceberg.aws.s3.S3OutputStream.close(S3OutputStream.java:269)
at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingPositionOutputStream.close(DelegatingPositionOutputStream.java:38)
at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:1197)
at org.apache.iceberg.parquet.ParquetWriter.close(ParquetWriter.java:255)
at org.apache.iceberg.deletes.PositionDeleteWriter.close(PositionDeleteWriter.java:75)
at org.apache.iceberg.io.RollingFileWriter.closeCurrentWriter(RollingFileWriter.java:122)
at org.apache.iceberg.io.RollingFileWriter.close(RollingFileWriter.java:147)
at org.apache.iceberg.io.RollingPositionDeleteWriter.close(RollingPositionDeleteWriter.java:35)
at org.apache.iceberg.io.ClusteredWriter.closeCurrentWriter(ClusteredWriter.java:118)
at org.apache.iceberg.io.ClusteredWriter.close(ClusteredWriter.java:110)
at org.apache.iceberg.io.ClusteredPositionDeleteWriter.close(ClusteredPositionDeleteWriter.java:34)
at org.apache.iceberg.spark.source.SparkPositionDeltaWrite$DeleteOnlyDeltaWriter.close(SparkPositionDeltaWrite.java:477)
at org.apache.iceberg.spark.source.SparkPositionDeltaWrite$DeleteOnlyDeltaWriter.commit(SparkPositionDeltaWrite.java:460)
at org.apache.spark.sql.execution.datasources.v2.WritingSparkTask.$anonfun$run$1(WriteDeltaExec.scala:176)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1538)
at org.apache.spark.sql.execution.datasources.v2.WritingSparkTask.run(WriteDeltaExec.scala:203)
at org.apache.spark.sql.execution.datasources.v2.WritingSparkTask.run$(WriteDeltaExec.scala:142)
at org.apache.spark.sql.execution.datasources.v2.DeltaWithMetadataWritingSparkTask.run(WriteDeltaExec.scala:208)
at org.apache.spark.sql.execution.datasources.v2.ExtendedV2ExistingTableWriteExec.$anonfun$writeWithV2$2(WriteDeltaExec.scala:101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Metadata
Metadata
Assignees
Labels
No labels
Activity