[SPARK-45579][CORE] Catch errors for FallbackStorage.copy #43409

ukby1234 · 2023-10-17T21:34:09Z

What changes were proposed in this pull request?

As documented in the JIRA ticket, FallbackStorage.copy sometimes will throw FileNotFoundException even though we check for file that exists. This will cause the BlockManagerDecommissioner to be stuck in endless loops and prevent executors from exiting.
We should ignore any FileNotFoundException in this case, and set keepRunning to false for all other exceptions for retries.

Why are the changes needed?

Fix a bug documented in the JIRA ticket

Does this PR introduce any user-facing change?

No

How was this patch tested?

Tests weren't added due to difficulty to replicate the race condition.

Was this patch authored or co-authored using generative AI tooling?

No

…executors

dongjoon-hyun · 2023-10-17T21:38:45Z

Thank you for making a PR, @ukby1234 .

dongjoon-hyun

Do you think we can have a test coverage here?

spark/core/src/test/scala/org/apache/spark/storage/FallbackStorageSuite.scala

Line 43 in f1ae56b

class FallbackStorageSuite extends SparkFunSuite with LocalSparkContext {

ukby1234 · 2023-10-18T02:26:31Z

Do you think we can have a test coverage here?

spark/core/src/test/scala/org/apache/spark/storage/FallbackStorageSuite.scala

Line 43 in f1ae56b

class FallbackStorageSuite extends SparkFunSuite with LocalSparkContext {

Added a unit test coverage.

core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala

mridulm · 2023-10-18T05:03:28Z

core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala

+                    case NonFatal(e) =>
+                      logError(s"Fallback storage for $shuffleBlockInfo failed", e)
+                      keepRunning = false
+                  }


Drop this ? The existing NonFatal block at the end does this currently.

This is different from the existing NonFatal block because it will retry the failed blocks but the existing one is really a catch-all and leave some blocks not retried.

It was not clear from the PR description that this behavior change was being made.
+CC @dongjoon-hyun as you know this part more.

There isn't a behavior change. If we remove the added NonFatal block, this section won't get executed. This means there are shuffle blocks that never trigger numMigratedShuffles.incrementAndGet() and the decommissioner will loop forever because the numMigratedShuffles is always less than migratingShuffles.

Is this true?

If we remove the added NonFatal block, this section won't get executed.

We have line 166, doesn't it?

spark/core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala

Lines 166 to 168 in 2ab7aa8

case e: Exception =>

logError(s"Error occurred during migrating $shuffleBlockInfo", e)

keepRunning = false

Do you think you can provide a test case as the evidence for your claim, @ukby1234 ?

Well this exception is thrown in this catch block, so this line 166 won't get executed.
And updated tests "SPARK-45579: abort for other errors" to show this situation.

…missioner.scala Co-authored-by: Mridul Muralidharan <1591700+mridulm@users.noreply.github.com>

ukby1234 · 2023-10-18T23:10:47Z

hmm looks like the SQL test just timed out and I retried a couple times already. cc @dongjoon-hyun

steveloughran · 2023-10-23T13:28:37Z

Does this happen with any fs client other than the s3a one?
Does anyone know why it happens?
There's a pr up to turn off use of the AWS SDK for its uploads, which will switch back to the classic sequential block read/upload algorithm of everything else. Reviews encouraged HADOOP-18925. S3A: option "fs.s3a.optimized.copy.from.local.enabled" to control CopyFromLocalOperation hadoop#6163

ukby1234 · 2023-10-23T16:09:19Z

I think I can answer 2). It seems shuffle blocks are deleted in between the fs.exists and fs.copyFromLocal calls. From the stack trace linked in the jira ticket, it fails inside the org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.checkSource.

steveloughran · 2023-10-25T12:33:22Z

@ukby1234 thanks

ukby1234 · 2024-01-22T18:04:51Z

@dongjoon-hyun friendly bump

github-actions · 2024-05-20T00:19:57Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

For SPARK-45579, catch FallbackStorage errors so we don't have stuck …

5a6d1f8

…executors

github-actions bot added the CORE label Oct 17, 2023

dongjoon-hyun reviewed Oct 17, 2023

View reviewed changes

add unit tests for fallback storage

9887cd3

ukby1234 requested a review from dongjoon-hyun October 18, 2023 02:27

mridulm reviewed Oct 18, 2023

View reviewed changes

Update core/src/main/scala/org/apache/spark/storage/BlockManagerDecom…

2ab7aa8

…missioner.scala Co-authored-by: Mridul Muralidharan <1591700+mridulm@users.noreply.github.com>

ukby1234 force-pushed the SPARK-45579 branch from 9ed3611 to 2ab7aa8 Compare October 18, 2023 05:36

add tests for catching right exceptions

218f4be

ukby1234 requested a review from mridulm February 9, 2024 19:55

github-actions bot added the Stale label May 20, 2024

github-actions bot closed this May 21, 2024

	case e: Exception =>
	logError(s"Error occurred during migrating $shuffleBlockInfo", e)
	keepRunning = false

[SPARK-45579][CORE] Catch errors for FallbackStorage.copy #43409

[SPARK-45579][CORE] Catch errors for FallbackStorage.copy #43409

Uh oh!

Conversation

ukby1234 commented Oct 17, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun commented Oct 17, 2023

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

ukby1234 commented Oct 18, 2023

Uh oh!

Uh oh!

Uh oh!

mridulm Oct 18, 2023

Choose a reason for hiding this comment

Uh oh!

ukby1234 Oct 18, 2023

Choose a reason for hiding this comment

Uh oh!

mridulm Oct 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ukby1234 Oct 18, 2023

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Oct 18, 2023

Choose a reason for hiding this comment

Uh oh!

ukby1234 Oct 18, 2023

Choose a reason for hiding this comment

Uh oh!

ukby1234 commented Oct 18, 2023

Uh oh!

steveloughran commented Oct 23, 2023

Uh oh!

ukby1234 commented Oct 23, 2023

Uh oh!

steveloughran commented Oct 25, 2023

Uh oh!

ukby1234 commented Jan 22, 2024

Uh oh!

github-actions bot commented May 20, 2024

Uh oh!

Uh oh!

mridulm Oct 18, 2023 •

edited

Loading