Description
Apache Iceberg version
1.0.0
Query engine
Spark
Please describe the bug 🐞
We run into a weird Aws S3 exceptions "Unable to execute HTTP request: The target server failed to respond" during table merge, which causes lots of task failures.
Here is our case in detail. We have one table with a certain # of partitions. Initially, each partition has 1 DataFile with ~2GB.
It is a v2 table format and we use Merge-on-read to do daily table update(table merge), using Sparks' HashShuffleJoin with Storage Partition Join enabled. All data is at aws s3.
For daily update, there are some updates and some new inserts, no deletes.
It starts with a DataFile0
After day 1's table update (using table merge), it creates 1 DeleteFIle-d1, 1 DataFile-d1-update (for update) and 1 DataFile-d1-insert (for new inserts).
After day 2's table update (using table merge), it creates 1 DeleteFIle-d2, 1 DataFile-d2-update (for update) and 1 DataFile-d2-insert (for new inserts).
After day 3's table update (using table merge), it creates 1 DeleteFIle-d3, 1 DataFile-d3-update (for update) and 1 DataFile-d3-insert (for new inserts).
All went ok though there are a very small number of the aws s3 exceptions mentioned above.
For day 4's table update (using table merge), For each executor, after the first set of tasks finished and it picks up a new set of tasks, for these new tasks, we observed a huge number of the following aws s3 exceptions which caused task to fail. We suspected it is related with number of DataFiles and DeleteFiles as there are 7 DataFiles and 1 DeleteFile for one partition. But we could not figure out the root cause, not sure if this is caused by s3 backend or aws sdk client side.
Attached is the exception stack.
23/07/27 17:30:05 ERROR BaseReader: Error reading file(s): s3a://****
software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: The target server failed to respond
at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:98) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.exception.SdkClientException.create(SdkClientException.java:43) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:205) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:66) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:34) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:56) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:36) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:80) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:60) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:42) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:48) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:31) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:193) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:133) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:159) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:112) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:167) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:94) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:55) ~[aws-core-2.15.40.jar:?]
at software.amazon.awssdk.services.s3.DefaultS3Client.headObject(DefaultS3Client.java:5011) ~[s3-2.15.40.jar:?]
at org.apache.iceberg.aws.s3.BaseS3File.getObjectMetadata(BaseS3File.java:85) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at org.apache.iceberg.aws.s3.S3InputFile.getLength(S3InputFile.java:75) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at org.apache.iceberg.parquet.ParquetIO$ParquetInputFile.getLength(ParquetIO.java:179) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:534) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:777) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:658) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at org.apache.iceberg.parquet.ReadConf.newReader(ReadConf.java:245) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at org.apache.iceberg.parquet.ReadConf.<init>(ReadConf.java:81) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at org.apache.iceberg.parquet.ParquetReader.init(ParquetReader.java:71) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at org.apache.iceberg.parquet.ParquetReader.iterator(ParquetReader.java:91) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at org.apache.iceberg.parquet.ParquetReader.iterator(ParquetReader.java:37) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at org.apache.iceberg.util.Filter.lambda$filter$0(Filter.java:34) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at org.apache.iceberg.io.CloseableIterable$2.iterator(CloseableIterable.java:72) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at org.apache.iceberg.io.CloseableIterable$7$1.<init>(CloseableIterable.java:188) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at org.apache.iceberg.io.CloseableIterable$7.iterator(CloseableIterable.java:187) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at org.apache.iceberg.io.CloseableIterable$ConcatCloseableIterable$ConcatCloseableIterator.hasNext(CloseableIterable.java:257) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at java.lang.Iterable.forEach(Iterable.java:74) ~[?:?]
at org.apache.iceberg.deletes.Deletes.toPositionIndex(Deletes.java:138) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at org.apache.iceberg.deletes.Deletes.toPositionIndex(Deletes.java:132) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at org.apache.iceberg.data.DeleteFilter.applyPosDeletes(DeleteFilter.java:250) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at org.apache.iceberg.data.DeleteFilter.filter(DeleteFilter.java:154) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at org.apache.iceberg.spark.source.RowDataReader.open(RowDataReader.java:92) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at org.apache.iceberg.spark.source.RowDataReader.open(RowDataReader.java:42) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:135) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:119) ~[spark-sql_2.12-3.3.0.48.jar]
at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:156) ~[spark-sql_2.12-3.3.0.48.jar]
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.0.48.jar]
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.0.48.jar]
at scala.Option.exists(Option.scala:376) ~[scala-library-2.12.15.jar:?]
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.0.48.jar]
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:97) ~[spark-sql_2.12-3.3.0.48.jar]
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.0.48.jar]
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) ~[spark-core_2.12-3.3.0.48.jar]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[scala-library-2.12.15.jar:?]
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) ~[?:?]
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) ~[spark-sql_2.12-3.3.0.48.jar]
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) ~[spark-sql_2.12-3.3.0.48.jar]
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491) ~[scala-library-2.12.15.jar:?]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[scala-library-2.12.15.jar:?]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[scala-library-2.12.15.jar:?]
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513) ~[scala-library-2.12.15.jar:?]
at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:225) ~[spark-sql_2.12-3.3.0.48.jar]
at org.apache.spark.sql.execution.SortExec.$anonfun$doExecute$1(SortExec.scala:119) ~[spark-sql_2.12-3.3.0.48.jar]
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890) ~[spark-core_2.12-3.3.0.48.jar]
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890) ~[spark-core_2.12-3.3.0.48.jar]
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.12-3.3.0.48.jar]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) ~[spark-core_2.12-3.3.0.48.jar]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) ~[spark-core_2.12-3.3.0.48.jar]
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) ~[spark-core_2.12-3.3.0.48.jar]
at org.apache.spark.scheduler.Task.run(Task.scala:136) ~[spark-core_2.12-3.3.0.48.jar]
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) ~[spark-core_2.12-3.3.0.48.jar]
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1513) ~[spark-core_2.12-3.3.0.48.jar]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) ~[spark-core_2.12-3.3.0.48.jar]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: org.apache.http.NoHttpResponseException: The target server failed to respond
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:141) ~[httpclient-4.5.13.jar:4.5.13]
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56) ~[httpclient-4.5.13.jar:4.5.13]
at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259) ~[httpcore-4.4.14.jar:4.4.14]
at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163) ~[httpcore-4.4.14.jar:4.4.14]
at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:157) ~[httpclient-4.5.13.jar:4.5.13]
at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273) ~[httpcore-4.4.14.jar:4.4.14]
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) ~[httpcore-4.4.14.jar:4.4.14]
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) ~[httpclient-4.5.13.jar:4.5.13]
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[httpclient-4.5.13.jar:4.5.13]
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[httpclient-4.5.13.jar:4.5.13]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[httpclient-4.5.13.jar:4.5.13]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[httpclient-4.5.13.jar:4.5.13]
at software.amazon.awssdk.http.apache.internal.impl.ApacheSdkHttpClient.execute(ApacheSdkHttpClient.java:72) ~[apache-client-2.15.40.jar:?]
at software.amazon.awssdk.http.apache.ApacheHttpClient.execute(ApacheHttpClient.java:253) ~[apache-client-2.15.40.jar:?]
at software.amazon.awssdk.http.apache.ApacheHttpClient.access$500(ApacheHttpClient.java:106) ~[apache-client-2.15.40.jar:?]
at software.amazon.awssdk.http.apache.ApacheHttpClient$1.call(ApacheHttpClient.java:232) ~[apache-client-2.15.40.jar:?]
at software.amazon.awssdk.http.apache.ApacheHttpClient$1.call(ApacheHttpClient.java:229) ~[apache-client-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.util.MetricUtils.measureDurationUnsafe(MetricUtils.java:64) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeHttpRequestStage.executeHttpRequest(MakeHttpRequestStage.java:76) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeHttpRequestStage.execute(MakeHttpRequestStage.java:55) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeHttpRequestStage.execute(MakeHttpRequestStage.java:39) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:73) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:42) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:77) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:39) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:50) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:36) ~[sdk-core-2.15.40.jar:?]
at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:64) ~[sdk-core-2.15.40.jar:?]
... 78 more