-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-45579][CORE] Catch errors for FallbackStorage.copy #43409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thank you for making a PR, @ukby1234 . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we can have a test coverage here?
class FallbackStorageSuite extends SparkFunSuite with LocalSparkContext { |
Added a unit test coverage. |
core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala
Outdated
Show resolved
Hide resolved
case NonFatal(e) => | ||
logError(s"Fallback storage for $shuffleBlockInfo failed", e) | ||
keepRunning = false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop this ? The existing NonFatal
block at the end does this currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is different from the existing NonFatal
block because it will retry the failed blocks but the existing one is really a catch-all and leave some blocks not retried.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was not clear from the PR description that this behavior change was being made.
+CC @dongjoon-hyun as you know this part more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't a behavior change. If we remove the added NonFatal
block, this section won't get executed. This means there are shuffle blocks that never trigger numMigratedShuffles.incrementAndGet()
and the decommissioner will loop forever because the numMigratedShuffles
is always less than migratingShuffles
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true?
If we remove the added NonFatal block, this section won't get executed.
We have line 166, doesn't it?
spark/core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala
Lines 166 to 168 in 2ab7aa8
case e: Exception => | |
logError(s"Error occurred during migrating $shuffleBlockInfo", e) | |
keepRunning = false |
Do you think you can provide a test case as the evidence for your claim, @ukby1234 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well this exception is thrown in this catch block, so this line 166 won't get executed.
And updated tests "SPARK-45579: abort for other errors"
to show this situation.
…missioner.scala Co-authored-by: Mridul Muralidharan <1591700+mridulm@users.noreply.github.com>
hmm looks like the SQL test just timed out and I retried a couple times already. cc @dongjoon-hyun |
|
I think I can answer 2). It seems shuffle blocks are deleted in between the |
@ukby1234 thanks |
@dongjoon-hyun friendly bump |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
As documented in the JIRA ticket, FallbackStorage.copy sometimes will throw FileNotFoundException even though we check for file that exists. This will cause the BlockManagerDecommissioner to be stuck in endless loops and prevent executors from exiting.
We should ignore any FileNotFoundException in this case, and set keepRunning to false for all other exceptions for retries.
Why are the changes needed?
Fix a bug documented in the JIRA ticket
Does this PR introduce any user-facing change?
No
How was this patch tested?
Tests weren't added due to difficulty to replicate the race condition.
Was this patch authored or co-authored using generative AI tooling?
No