-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HADOOP-16798. S3A Committer thread pool shutdown problems. #1963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-16798. S3A Committer thread pool shutdown problems. #1963
Conversation
Test run against london with exactly the same flags which triggered the failure -you have to have an overloaded test system to cause the execution delays which trigger speculative task execution and then this failure. All good, at least on this run.
Rerunning with 12 threads to see if that can trigger it |
💔 -1 overall
This message was automatically generated. |
run full suite 2x more; no recreation of the failure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably missed something but I fail to see how this'll address the problem.
IMO the problem is that buildThreadPool() returns a reference to the thread pool object to its caller, and therefore the caller can use the thread pool without holding the lock on the AbstractS3ACommitter.
If that's the case, immediately reseting the threadpool reference in the destroyThreadPool() isn't going to help here because the caller can always access the thread pool object at anytime (e.g. during shutdown).
hmm. you've probably got a point. what If I make the submit() operation part of the sync block so you can never submit work while another thread is shutting it down? |
yetus didnt run the last patch. i will rebase and resubmit to force it |
Contributed by Steve Loughran. Fixes a condition which can cause job commit to fail if a task was aborted < 60s before the job commit commenced: the task abort will shut down the thread pool with a hard exit after 60s; the job commit POST requests would be scheduled through the same pool, so be interrupted and fail. At present the access is synchronized, but presumably the executor shutdown code is calling wait() and releasing locks. Task abort is triggered from the AM when task attempts succeed but there are still active speculative task attempts running. Thus it only surfaces when speculation is enabled and the final tasks are speculating, which, given they are the stragglers, is not unheard of. The fix copies and clears the threadPool field in a synchronized block, then shuts it down; job commit will encounter the empty field and demand-create a new one. As would a sequence of task aborts. Change-Id: I23fc1c4bf112d2c36bd9f1baf76cf5057e4f10e1
rather than pass around the thread pool, a "Submitter' is supplied to Tasks; there's a disabled one -which triggers the downgrade to single threaded execution, and a multithreaded one which submits work through the committer's (now private) thread pool. As submission is in a synchronized block, the race condition should be blocked Change-Id: I9dbb6f6eb5a32c3b7bc8d638eae93d38261c9d6c
Rework previous commit for a more minimal design -one which reduces the diff with the current code. Change-Id: Ic2a1a05d29f6974eb14f9279fbb0575e3f4ddd24
0c86f35
to
64e31d0
Compare
🎊 +1 overall
This message was automatically generated. |
@jojochuang you had a chance to look at the latest changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I was late. LGTM. Thank you!
Contributed by Steve Loughran. Fixes a condition which can cause job commit to fail if a task was aborted < 60s before the job commit commenced: the task abort will shut down the thread pool with a hard exit after 60s; the job commit POST requests would be scheduled through the same pool, so be interrupted and fail. At present the access is synchronized, but presumably the executor shutdown code is calling wait() and releasing locks. Task abort is triggered from the AM when task attempts succeed but there are still active speculative task attempts running. Thus it only surfaces when speculation is enabled and the final tasks are speculating, which, given they are the stragglers, is not unheard of. Note: this problem has never been seen in production; it has surfaced in the hadoop-aws tests on a heavily overloaded desktop Change-Id: I3b433356d01fcc50d88b4353dbca018484984bc8
thx. merged to trunk & 3.3 |
apache#1963) Contributed by Steve Loughran. Fixes a condition which can cause job commit to fail if a task was aborted < 60s before the job commit commenced: the task abort will shut down the thread pool with a hard exit after 60s; the job commit POST requests would be scheduled through the same pool, so be interrupted and fail. At present the access is synchronized, but presumably the executor shutdown code is calling wait() and releasing locks. Task abort is triggered from the AM when task attempts succeed but there are still active speculative task attempts running. Thus it only surfaces when speculation is enabled and the final tasks are speculating, which, given they are the stragglers, is not unheard of. Note: this problem has never been seen in production; it has surfaced in the hadoop-aws tests on a heavily overloaded desktop Change-Id: Ib1574f2ca696e60594bb06f05e06193b74fd9a4e
Contributed by Steve Loughran.
Fixes a condition which can cause job commit to fail if a task was
aborted < 60s before the job commit commenced: the task abort
will shut down the thread pool with a hard exit after 60s; the
job commit POST requests would be scheduled through the same pool,
so be interrupted and fail. At present the access is synchronized,
but presumably the executor shutdown code is calling wait() and releasing
locks.
Task abort is triggered from the AM when task attempts succeed but
there are still active speculative task attempts running. Thus it
only surfaces when speculation is enabled and the final tasks are
speculating, which, given they are the stragglers, is not
unheard of.
The fix copies and clears the threadPool field in a synchronized block,
then shuts it down; job commit will encounter the empty field and
demand-create a new one. As would a sequence of task aborts.