Skip to content

HADOOP-16798. S3A Committer thread pool shutdown problems. #1963

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

steveloughran
Copy link
Contributor

Contributed by Steve Loughran.

Fixes a condition which can cause job commit to fail if a task was
aborted < 60s before the job commit commenced: the task abort
will shut down the thread pool with a hard exit after 60s; the
job commit POST requests would be scheduled through the same pool,
so be interrupted and fail. At present the access is synchronized,
but presumably the executor shutdown code is calling wait() and releasing
locks.

Task abort is triggered from the AM when task attempts succeed but
there are still active speculative task attempts running. Thus it
only surfaces when speculation is enabled and the final tasks are
speculating, which, given they are the stragglers, is not
unheard of.

The fix copies and clears the threadPool field in a synchronized block,
then shuts it down; job commit will encounter the empty field and
demand-create a new one. As would a sequence of task aborts.

@steveloughran
Copy link
Contributor Author

Test run against london with exactly the same flags which triggered the failure -you have to have an overloaded test system to cause the execution delays which trigger speculative task execution and then this failure. All good, at least on this run.

-Dparallel-tests -DtestsThreadCount=8 -Dscale -Dmarkers=keep -Dauth

Rerunning with 12 threads to see if that can trigger it

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 39s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 19m 53s trunk passed
+1 💚 compile 0m 34s trunk passed
+1 💚 checkstyle 0m 26s trunk passed
+1 💚 mvnsite 0m 40s trunk passed
+1 💚 shadedclient 15m 5s branch has no errors when building and testing our client artifacts.
+1 💚 javadoc 0m 29s trunk passed
+0 🆗 spotbugs 1m 0s Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚 findbugs 0m 57s trunk passed
_ Patch Compile Tests _
+1 💚 mvninstall 0m 32s the patch passed
+1 💚 compile 0m 28s the patch passed
+1 💚 javac 0m 28s the patch passed
+1 💚 checkstyle 0m 19s the patch passed
+1 💚 mvnsite 0m 33s the patch passed
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 shadedclient 13m 51s patch has no errors when building and testing our client artifacts.
+1 💚 javadoc 0m 26s the patch passed
+1 💚 findbugs 1m 3s the patch passed
_ Other Tests _
+1 💚 unit 1m 27s hadoop-aws in the patch passed.
+1 💚 asflicense 0m 32s The patch does not generate ASF License warnings.
59m 24s
Subsystem Report/Notes
Docker Client=19.03.8 Server=19.03.8 base: https://builds.apache.org/job/hadoop-multibranch/job/PR-1963/1/artifact/out/Dockerfile
GITHUB PR #1963
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle
uname Linux f3bf80f4b706 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/hadoop.sh
git revision trunk / 3601054
Default Java 1.8.0_242
Test Results https://builds.apache.org/job/hadoop-multibranch/job/PR-1963/1/testReport/
Max. process+thread count 415 (vs. ulimit of 5500)
modules C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output https://builds.apache.org/job/hadoop-multibranch/job/PR-1963/1/console
versions git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by Apache Yetus 0.11.1 https://yetus.apache.org

This message was automatically generated.

@steveloughran
Copy link
Contributor Author

run full suite 2x more; no recreation of the failure

@jojochuang jojochuang self-requested a review April 20, 2020 20:15
Copy link
Contributor

@jojochuang jojochuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably missed something but I fail to see how this'll address the problem.

IMO the problem is that buildThreadPool() returns a reference to the thread pool object to its caller, and therefore the caller can use the thread pool without holding the lock on the AbstractS3ACommitter.

If that's the case, immediately reseting the threadpool reference in the destroyThreadPool() isn't going to help here because the caller can always access the thread pool object at anytime (e.g. during shutdown).

@steveloughran
Copy link
Contributor Author

hmm. you've probably got a point.

what If I make the submit() operation part of the sync block so you can never submit work while another thread is shutting it down?

@apache apache deleted a comment from hadoop-yetus Jun 2, 2020
@steveloughran
Copy link
Contributor Author

yetus didnt run the last patch. i will rebase and resubmit to force it

Contributed by Steve Loughran.

Fixes a condition which can cause job commit to fail if a task was
aborted < 60s before the job commit commenced: the task abort
will shut down the thread pool with a hard exit after 60s; the
job commit POST requests would be scheduled through the same pool,
so be interrupted and fail. At present the access is synchronized,
but presumably the executor shutdown code is calling wait() and releasing
locks.

Task abort is triggered from the AM when task attempts succeed but
there are still active speculative task attempts running. Thus it
only surfaces when speculation is enabled and the final tasks are
speculating, which, given they are the stragglers, is not
unheard of.

The fix copies and clears the threadPool field in a synchronized block,
then shuts it down; job commit will encounter the empty field and
demand-create a new one. As would a sequence of task aborts.

Change-Id: I23fc1c4bf112d2c36bd9f1baf76cf5057e4f10e1
rather than pass around the thread pool, a "Submitter' is supplied
to Tasks; there's a disabled one -which triggers the downgrade to single
threaded execution, and a multithreaded one which submits work through
the committer's (now private) thread pool. As submission is in
a synchronized block, the race condition should be blocked

Change-Id: I9dbb6f6eb5a32c3b7bc8d638eae93d38261c9d6c
Rework previous commit for a more minimal design -one which reduces the diff
with the current code.

Change-Id: Ic2a1a05d29f6974eb14f9279fbb0575e3f4ddd24
@steveloughran steveloughran force-pushed the s3/HADOOP-16798-committer-race branch from 0c86f35 to 64e31d0 Compare June 2, 2020 12:53
@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 20m 57s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 21m 22s trunk passed
+1 💚 compile 0m 32s trunk passed
+1 💚 checkstyle 0m 23s trunk passed
+1 💚 mvnsite 0m 35s trunk passed
+1 💚 shadedclient 15m 58s branch has no errors when building and testing our client artifacts.
+1 💚 javadoc 0m 25s trunk passed
+0 🆗 spotbugs 1m 0s Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚 findbugs 0m 58s trunk passed
_ Patch Compile Tests _
+1 💚 mvninstall 0m 31s the patch passed
+1 💚 compile 0m 25s the patch passed
+1 💚 javac 0m 25s the patch passed
-0 ⚠️ checkstyle 0m 17s hadoop-tools/hadoop-aws: The patch generated 1 new + 2 unchanged - 0 fixed = 3 total (was 2)
+1 💚 mvnsite 0m 31s the patch passed
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 shadedclient 15m 8s patch has no errors when building and testing our client artifacts.
+1 💚 javadoc 0m 22s the patch passed
+1 💚 findbugs 1m 3s the patch passed
_ Other Tests _
+1 💚 unit 1m 26s hadoop-aws in the patch passed.
+1 💚 asflicense 0m 28s The patch does not generate ASF License warnings.
82m 30s
Subsystem Report/Notes
Docker ClientAPI=1.40 ServerAPI=1.40 base: https://builds.apache.org/job/hadoop-multibranch/job/PR-1963/4/artifact/out/Dockerfile
GITHUB PR #1963
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle
uname Linux 5c76d9e4f62f 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/hadoop.sh
git revision trunk / 9fe4c37
Default Java Private Build-1.8.0_252-8u252-b09-1~18.04-b09
checkstyle https://builds.apache.org/job/hadoop-multibranch/job/PR-1963/4/artifact/out/diff-checkstyle-hadoop-tools_hadoop-aws.txt
Test Results https://builds.apache.org/job/hadoop-multibranch/job/PR-1963/4/testReport/
Max. process+thread count 346 (vs. ulimit of 5500)
modules C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output https://builds.apache.org/job/hadoop-multibranch/job/PR-1963/4/console
versions git=2.17.1 maven=3.6.0 findbugs=3.1.0-RC1
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@steveloughran
Copy link
Contributor Author

@jojochuang you had a chance to look at the latest changes?

Copy link
Contributor

@jojochuang jojochuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I was late. LGTM. Thank you!

@steveloughran steveloughran merged commit 4249c04 into apache:trunk Jun 30, 2020
asfgit pushed a commit that referenced this pull request Jun 30, 2020
Contributed by Steve Loughran.

Fixes a condition which can cause job commit to fail if a task was
aborted < 60s before the job commit commenced: the task abort
will shut down the thread pool with a hard exit after 60s; the
job commit POST requests would be scheduled through the same pool,
so be interrupted and fail. At present the access is synchronized,
but presumably the executor shutdown code is calling wait() and releasing
locks.

Task abort is triggered from the AM when task attempts succeed but
there are still active speculative task attempts running. Thus it
only surfaces when speculation is enabled and the final tasks are
speculating, which, given they are the stragglers, is not unheard of.

Note: this problem has never been seen in production; it has surfaced
in the hadoop-aws tests on a heavily overloaded desktop

Change-Id: I3b433356d01fcc50d88b4353dbca018484984bc8
@steveloughran
Copy link
Contributor Author

thx. merged to trunk & 3.3

@steveloughran steveloughran deleted the s3/HADOOP-16798-committer-race branch October 15, 2021 19:41
jojochuang pushed a commit to jojochuang/hadoop that referenced this pull request May 23, 2023
apache#1963)

Contributed by Steve Loughran.

Fixes a condition which can cause job commit to fail if a task was
aborted < 60s before the job commit commenced: the task abort
will shut down the thread pool with a hard exit after 60s; the
job commit POST requests would be scheduled through the same pool,
so be interrupted and fail. At present the access is synchronized,
but presumably the executor shutdown code is calling wait() and releasing
locks.

Task abort is triggered from the AM when task attempts succeed but
there are still active speculative task attempts running. Thus it
only surfaces when speculation is enabled and the final tasks are
speculating, which, given they are the stragglers, is not unheard of.

Note: this problem has never been seen in production; it has surfaced
in the hadoop-aws tests on a heavily overloaded desktop

Change-Id: Ib1574f2ca696e60594bb06f05e06193b74fd9a4e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants