Skip to content

Huge amount of Aws s3 Exception "Unable to execute HTTP request: The target server failed to respond" during Iceberg v2 table merge with some DeleteFiles + DataFiles in a partition #8218

Closed as not planned
@huaxiangsun

Description

@huaxiangsun

Apache Iceberg version

1.0.0

Query engine

Spark

Please describe the bug 🐞

We run into a weird Aws S3 exceptions "Unable to execute HTTP request: The target server failed to respond" during table merge, which causes lots of task failures.

Here is our case in detail. We have one table with a certain # of partitions. Initially, each partition has 1 DataFile with ~2GB.
It is a v2 table format and we use Merge-on-read to do daily table update(table merge), using Sparks' HashShuffleJoin with Storage Partition Join enabled. All data is at aws s3.

For daily update, there are some updates and some new inserts, no deletes.
It starts with a DataFile0
After day 1's table update (using table merge), it creates 1 DeleteFIle-d1, 1 DataFile-d1-update (for update) and 1 DataFile-d1-insert (for new inserts).
After day 2's table update (using table merge), it creates 1 DeleteFIle-d2, 1 DataFile-d2-update (for update) and 1 DataFile-d2-insert (for new inserts).
After day 3's table update (using table merge), it creates 1 DeleteFIle-d3, 1 DataFile-d3-update (for update) and 1 DataFile-d3-insert (for new inserts).

All went ok though there are a very small number of the aws s3 exceptions mentioned above.

For day 4's table update (using table merge), For each executor, after the first set of tasks finished and it picks up a new set of tasks, for these new tasks, we observed a huge number of the following aws s3 exceptions which caused task to fail. We suspected it is related with number of DataFiles and DeleteFiles as there are 7 DataFiles and 1 DeleteFile for one partition. But we could not figure out the root cause, not sure if this is caused by s3 backend or aws sdk client side.

Attached is the exception stack.

23/07/27 17:30:05 ERROR BaseReader: Error reading file(s): s3a://****
software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: The target server failed to respond
        at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:98) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.exception.SdkClientException.create(SdkClientException.java:43) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:205) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:66) ~[sdk-core-2.15.40.jar:?]

        at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:34) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:56) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:36) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:80) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:60) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:42) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:48) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:31) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) ~[sdk-core-2.15.40.jar:?]

        at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:193) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:133) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:159) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:112) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:167) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:94) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:55) ~[aws-core-2.15.40.jar:?]
        at software.amazon.awssdk.services.s3.DefaultS3Client.headObject(DefaultS3Client.java:5011) ~[s3-2.15.40.jar:?]

        at org.apache.iceberg.aws.s3.BaseS3File.getObjectMetadata(BaseS3File.java:85) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
        at org.apache.iceberg.aws.s3.S3InputFile.getLength(S3InputFile.java:75) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
        at org.apache.iceberg.parquet.ParquetIO$ParquetInputFile.getLength(ParquetIO.java:179) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
        at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:534) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
        at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:777) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
        at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:658) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
        at org.apache.iceberg.parquet.ReadConf.newReader(ReadConf.java:245) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
        at org.apache.iceberg.parquet.ReadConf.<init>(ReadConf.java:81) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
        at org.apache.iceberg.parquet.ParquetReader.init(ParquetReader.java:71) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
        at org.apache.iceberg.parquet.ParquetReader.iterator(ParquetReader.java:91) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
        at org.apache.iceberg.parquet.ParquetReader.iterator(ParquetReader.java:37) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
        at org.apache.iceberg.util.Filter.lambda$filter$0(Filter.java:34) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
        at org.apache.iceberg.io.CloseableIterable$2.iterator(CloseableIterable.java:72) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
        at org.apache.iceberg.io.CloseableIterable$7$1.<init>(CloseableIterable.java:188) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
        at org.apache.iceberg.io.CloseableIterable$7.iterator(CloseableIterable.java:187) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
        at org.apache.iceberg.io.CloseableIterable$ConcatCloseableIterable$ConcatCloseableIterator.hasNext(CloseableIterable.java:257) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]

        at java.lang.Iterable.forEach(Iterable.java:74) ~[?:?]
        at org.apache.iceberg.deletes.Deletes.toPositionIndex(Deletes.java:138) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
        at org.apache.iceberg.deletes.Deletes.toPositionIndex(Deletes.java:132) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
        at org.apache.iceberg.data.DeleteFilter.applyPosDeletes(DeleteFilter.java:250) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
        at org.apache.iceberg.data.DeleteFilter.filter(DeleteFilter.java:154) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
        at org.apache.iceberg.spark.source.RowDataReader.open(RowDataReader.java:92) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
        at org.apache.iceberg.spark.source.RowDataReader.open(RowDataReader.java:42) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
        at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:135) ~[iceberg-spark-runtime-3.3_2.12-1.0.15.jar]
        at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:119) ~[spark-sql_2.12-3.3.0.48.jar]
     at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:156) ~[spark-sql_2.12-3.3.0.48.jar]
        at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.0.48.jar]

        at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.0.48.jar]
        at scala.Option.exists(Option.scala:376) ~[scala-library-2.12.15.jar:?]
        at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.0.48.jar]
        at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:97) ~[spark-sql_2.12-3.3.0.48.jar]
        at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63) ~[spark-sql_2.12-3.3.0.48.jar]
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) ~[spark-core_2.12-3.3.0.48.jar]
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[scala-library-2.12.15.jar:?]
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) ~[?:?]
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) ~[spark-sql_2.12-3.3.0.48.jar]
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) ~[spark-sql_2.12-3.3.0.48.jar]
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491) ~[scala-library-2.12.15.jar:?]
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[scala-library-2.12.15.jar:?]
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[scala-library-2.12.15.jar:?]
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513) ~[scala-library-2.12.15.jar:?]

        at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:225) ~[spark-sql_2.12-3.3.0.48.jar]
        at org.apache.spark.sql.execution.SortExec.$anonfun$doExecute$1(SortExec.scala:119) ~[spark-sql_2.12-3.3.0.48.jar]
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890) ~[spark-core_2.12-3.3.0.48.jar]
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890) ~[spark-core_2.12-3.3.0.48.jar]
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.12-3.3.0.48.jar]
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) ~[spark-core_2.12-3.3.0.48.jar]
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) ~[spark-core_2.12-3.3.0.48.jar]
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) ~[spark-core_2.12-3.3.0.48.jar]
        at org.apache.spark.scheduler.Task.run(Task.scala:136) ~[spark-core_2.12-3.3.0.48.jar]

        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) ~[spark-core_2.12-3.3.0.48.jar]
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1513) ~[spark-core_2.12-3.3.0.48.jar]
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) ~[spark-core_2.12-3.3.0.48.jar]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:829) ~[?:?]

Caused by: org.apache.http.NoHttpResponseException: The target server failed to respond
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:141) ~[httpclient-4.5.13.jar:4.5.13]
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56) ~[httpclient-4.5.13.jar:4.5.13]
        at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259) ~[httpcore-4.4.14.jar:4.4.14]

        at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163) ~[httpcore-4.4.14.jar:4.4.14]
        at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:157) ~[httpclient-4.5.13.jar:4.5.13]
        at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273) ~[httpcore-4.4.14.jar:4.4.14]
        at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) ~[httpcore-4.4.14.jar:4.4.14]
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) ~[httpclient-4.5.13.jar:4.5.13]
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[httpclient-4.5.13.jar:4.5.13]
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[httpclient-4.5.13.jar:4.5.13]

        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[httpclient-4.5.13.jar:4.5.13]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[httpclient-4.5.13.jar:4.5.13]
        at software.amazon.awssdk.http.apache.internal.impl.ApacheSdkHttpClient.execute(ApacheSdkHttpClient.java:72) ~[apache-client-2.15.40.jar:?]
        at software.amazon.awssdk.http.apache.ApacheHttpClient.execute(ApacheHttpClient.java:253) ~[apache-client-2.15.40.jar:?]
        at software.amazon.awssdk.http.apache.ApacheHttpClient.access$500(ApacheHttpClient.java:106) ~[apache-client-2.15.40.jar:?]
        at software.amazon.awssdk.http.apache.ApacheHttpClient$1.call(ApacheHttpClient.java:232) ~[apache-client-2.15.40.jar:?]
        at software.amazon.awssdk.http.apache.ApacheHttpClient$1.call(ApacheHttpClient.java:229) ~[apache-client-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.util.MetricUtils.measureDurationUnsafe(MetricUtils.java:64) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeHttpRequestStage.executeHttpRequest(MakeHttpRequestStage.java:76) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeHttpRequestStage.execute(MakeHttpRequestStage.java:55) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeHttpRequestStage.execute(MakeHttpRequestStage.java:39) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:73) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:42) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:77) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:39) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:50) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:36) ~[sdk-core-2.15.40.jar:?]
        at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:64) ~[sdk-core-2.15.40.jar:?]
        ... 78 more



Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions