Skip to content

[SPARK-45579][CORE] Catch errors for FallbackStorage.copy #43409

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

ukby1234
Copy link
Contributor

What changes were proposed in this pull request?

As documented in the JIRA ticket, FallbackStorage.copy sometimes will throw FileNotFoundException even though we check for file that exists. This will cause the BlockManagerDecommissioner to be stuck in endless loops and prevent executors from exiting.
We should ignore any FileNotFoundException in this case, and set keepRunning to false for all other exceptions for retries.

Why are the changes needed?

Fix a bug documented in the JIRA ticket

Does this PR introduce any user-facing change?

No

How was this patch tested?

Tests weren't added due to difficulty to replicate the race condition.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the CORE label Oct 17, 2023
@dongjoon-hyun
Copy link
Member

Thank you for making a PR, @ukby1234 .

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we can have a test coverage here?

class FallbackStorageSuite extends SparkFunSuite with LocalSparkContext {

@ukby1234
Copy link
Contributor Author

Do you think we can have a test coverage here?

class FallbackStorageSuite extends SparkFunSuite with LocalSparkContext {

Added a unit test coverage.

case NonFatal(e) =>
logError(s"Fallback storage for $shuffleBlockInfo failed", e)
keepRunning = false
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop this ? The existing NonFatal block at the end does this currently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is different from the existing NonFatal block because it will retry the failed blocks but the existing one is really a catch-all and leave some blocks not retried.

Copy link
Contributor

@mridulm mridulm Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was not clear from the PR description that this behavior change was being made.
+CC @dongjoon-hyun as you know this part more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a behavior change. If we remove the added NonFatal block, this section won't get executed. This means there are shuffle blocks that never trigger numMigratedShuffles.incrementAndGet() and the decommissioner will loop forever because the numMigratedShuffles is always less than migratingShuffles.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true?

If we remove the added NonFatal block, this section won't get executed.

We have line 166, doesn't it?

case e: Exception =>
logError(s"Error occurred during migrating $shuffleBlockInfo", e)
keepRunning = false

Do you think you can provide a test case as the evidence for your claim, @ukby1234 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well this exception is thrown in this catch block, so this line 166 won't get executed.
And updated tests "SPARK-45579: abort for other errors" to show this situation.

…missioner.scala

Co-authored-by: Mridul Muralidharan <1591700+mridulm@users.noreply.github.com>
@ukby1234
Copy link
Contributor Author

hmm looks like the SQL test just timed out and I retried a couple times already. cc @dongjoon-hyun

@steveloughran
Copy link
Contributor

  1. Does this happen with any fs client other than the s3a one?
  2. Does anyone know why it happens?
  3. There's a pr up to turn off use of the AWS SDK for its uploads, which will switch back to the classic sequential block read/upload algorithm of everything else. Reviews encouraged HADOOP-18925. S3A: option "fs.s3a.optimized.copy.from.local.enabled" to control CopyFromLocalOperation hadoop#6163

@ukby1234
Copy link
Contributor Author

I think I can answer 2). It seems shuffle blocks are deleted in between the fs.exists and fs.copyFromLocal calls. From the stack trace linked in the jira ticket, it fails inside the org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.checkSource.

@steveloughran
Copy link
Contributor

@ukby1234 thanks

@ukby1234
Copy link
Contributor Author

@dongjoon-hyun friendly bump

@ukby1234 ukby1234 requested a review from mridulm February 9, 2024 19:55
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label May 20, 2024
@github-actions github-actions bot closed this May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants