-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-18406][CORE][Backport-2.1] Race between end-of-task and completion iterator read lock release #18099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…read lock release When a TaskContext is not propagated properly to all child threads for the task, just like the reported cases in this issue, we fail to get to TID from TaskContext and that causes unable to release the lock and assertion failures. To resolve this, we have to explicitly pass the TID value to the `unlock` method. Add new failing regression test case in `RDDSuite`. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes apache#18076 from jiangxb1987/completion-iterator.
cc @gatorsmile |
@@ -454,14 +454,20 @@ private[spark] class BlockManager( | |||
case Some(info) => | |||
val level = info.level | |||
logDebug(s"Level for block $blockId is $level") | |||
val taskAttemptId = Option(TaskContext.get()).map(_.taskAttemptId()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @jiangxb1987. Looks like the same way to get taskAttemptId
as BlockInfoManger
. What's the difference to get the taskAttemptId
instead of BlockInfoManager.currentTaskAttemptId
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is getting the taskAttemptId
from the main thread, in case a TaskContext is not propagated properly to all child threads for the task, we would fail in getting the taskAttemptId
in BlockInfoManager
, see the test case added in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks for your explanation.
Test build #77311 has finished for PR 18099 at commit
|
thanks, merging to 2.1! |
…tion iterator read lock release This is a backport PR of #18076 to 2.1. ## What changes were proposed in this pull request? When a TaskContext is not propagated properly to all child threads for the task, just like the reported cases in this issue, we fail to get to TID from TaskContext and that causes unable to release the lock and assertion failures. To resolve this, we have to explicitly pass the TID value to the `unlock` method. ## How was this patch tested? Add new failing regression test case in `RDDSuite`. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #18099 from jiangxb1987/completion-iterator-2.1.
…tion iterator read lock release This is a backport PR of apache#18076 to 2.1. When a TaskContext is not propagated properly to all child threads for the task, just like the reported cases in this issue, we fail to get to TID from TaskContext and that causes unable to release the lock and assertion failures. To resolve this, we have to explicitly pass the TID value to the `unlock` method. Add new failing regression test case in `RDDSuite`. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes apache#18099 from jiangxb1987/completion-iterator-2.1.
it this fix available to spark2.3.1? |
the following occur to me when I run lab with ALS in spark 8/08/22 21:24:14 ERROR Utils: Uncaught exception in thread stdout writer for python |
same issue in spark 2.2.1 |
This is a backport PR of #18076 to 2.1.
What changes were proposed in this pull request?
When a TaskContext is not propagated properly to all child threads for the task, just like the reported cases in this issue, we fail to get to TID from TaskContext and that causes unable to release the lock and assertion failures. To resolve this, we have to explicitly pass the TID value to the
unlock
method.How was this patch tested?
Add new failing regression test case in
RDDSuite
.