HADOOP-16798. S3A Committer thread pool shutdown problems. #1963

steveloughran · 2020-04-17T17:40:59Z

Contributed by Steve Loughran.

Fixes a condition which can cause job commit to fail if a task was
aborted < 60s before the job commit commenced: the task abort
will shut down the thread pool with a hard exit after 60s; the
job commit POST requests would be scheduled through the same pool,
so be interrupted and fail. At present the access is synchronized,
but presumably the executor shutdown code is calling wait() and releasing
locks.

Task abort is triggered from the AM when task attempts succeed but
there are still active speculative task attempts running. Thus it
only surfaces when speculation is enabled and the final tasks are
speculating, which, given they are the stragglers, is not
unheard of.

The fix copies and clears the threadPool field in a synchronized block,
then shuts it down; job commit will encounter the empty field and
demand-create a new one. As would a sequence of task aborts.

steveloughran · 2020-04-17T17:43:34Z

Test run against london with exactly the same flags which triggered the failure -you have to have an overloaded test system to cause the execution delays which trigger speculative task execution and then this failure. All good, at least on this run.

-Dparallel-tests -DtestsThreadCount=8 -Dscale -Dmarkers=keep -Dauth

Rerunning with 12 threads to see if that can trigger it

hadoop-yetus · 2020-04-17T18:41:24Z

💔 -1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	0m 39s	Docker mode activated.
		_ Prechecks _
+1 💚	dupname	0m 0s	No case conflicting files found.
+1 💚	@author	0m 0s	The patch does not contain any @author tags.
-1 ❌	test4tests	0m 0s	The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
		_ trunk Compile Tests _
+1 💚	mvninstall	19m 53s	trunk passed
+1 💚	compile	0m 34s	trunk passed
+1 💚	checkstyle	0m 26s	trunk passed
+1 💚	mvnsite	0m 40s	trunk passed
+1 💚	shadedclient	15m 5s	branch has no errors when building and testing our client artifacts.
+1 💚	javadoc	0m 29s	trunk passed
+0 🆗	spotbugs	1m 0s	Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚	findbugs	0m 57s	trunk passed
		_ Patch Compile Tests _
+1 💚	mvninstall	0m 32s	the patch passed
+1 💚	compile	0m 28s	the patch passed
+1 💚	javac	0m 28s	the patch passed
+1 💚	checkstyle	0m 19s	the patch passed
+1 💚	mvnsite	0m 33s	the patch passed
+1 💚	whitespace	0m 0s	The patch has no whitespace issues.
+1 💚	shadedclient	13m 51s	patch has no errors when building and testing our client artifacts.
+1 💚	javadoc	0m 26s	the patch passed
+1 💚	findbugs	1m 3s	the patch passed
		_ Other Tests _
+1 💚	unit	1m 27s	hadoop-aws in the patch passed.
+1 💚	asflicense	0m 32s	The patch does not generate ASF License warnings.
		59m 24s

Subsystem	Report/Notes
Docker	Client=19.03.8 Server=19.03.8 base: https://builds.apache.org/job/hadoop-multibranch/job/PR-1963/1/artifact/out/Dockerfile
GITHUB PR	#1963
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle
uname	Linux f3bf80f4b706 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	personality/hadoop.sh
git revision	trunk / 3601054
Default Java	1.8.0_242
Test Results	https://builds.apache.org/job/hadoop-multibranch/job/PR-1963/1/testReport/
Max. process+thread count	415 (vs. ulimit of 5500)
modules	C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output	https://builds.apache.org/job/hadoop-multibranch/job/PR-1963/1/console
versions	git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by	Apache Yetus 0.11.1 https://yetus.apache.org

This message was automatically generated.

steveloughran · 2020-04-20T19:49:22Z

run full suite 2x more; no recreation of the failure

jojochuang

I probably missed something but I fail to see how this'll address the problem.

IMO the problem is that buildThreadPool() returns a reference to the thread pool object to its caller, and therefore the caller can use the thread pool without holding the lock on the AbstractS3ACommitter.

If that's the case, immediately reseting the threadpool reference in the destroyThreadPool() isn't going to help here because the caller can always access the thread pool object at anytime (e.g. during shutdown).

steveloughran · 2020-04-24T12:49:48Z

hmm. you've probably got a point.

what If I make the submit() operation part of the sync block so you can never submit work while another thread is shutting it down?

steveloughran · 2020-06-02T12:52:03Z

yetus didnt run the last patch. i will rebase and resubmit to force it

Contributed by Steve Loughran. Fixes a condition which can cause job commit to fail if a task was aborted < 60s before the job commit commenced: the task abort will shut down the thread pool with a hard exit after 60s; the job commit POST requests would be scheduled through the same pool, so be interrupted and fail. At present the access is synchronized, but presumably the executor shutdown code is calling wait() and releasing locks. Task abort is triggered from the AM when task attempts succeed but there are still active speculative task attempts running. Thus it only surfaces when speculation is enabled and the final tasks are speculating, which, given they are the stragglers, is not unheard of. The fix copies and clears the threadPool field in a synchronized block, then shuts it down; job commit will encounter the empty field and demand-create a new one. As would a sequence of task aborts. Change-Id: I23fc1c4bf112d2c36bd9f1baf76cf5057e4f10e1

rather than pass around the thread pool, a "Submitter' is supplied to Tasks; there's a disabled one -which triggers the downgrade to single threaded execution, and a multithreaded one which submits work through the committer's (now private) thread pool. As submission is in a synchronized block, the race condition should be blocked Change-Id: I9dbb6f6eb5a32c3b7bc8d638eae93d38261c9d6c

Rework previous commit for a more minimal design -one which reduces the diff with the current code. Change-Id: Ic2a1a05d29f6974eb14f9279fbb0575e3f4ddd24

hadoop-yetus · 2020-06-02T14:17:12Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	20m 57s	Docker mode activated.
		_ Prechecks _
+1 💚	dupname	0m 0s	No case conflicting files found.
+1 💚	@author	0m 0s	The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s	The patch appears to include 1 new or modified test files.
		_ trunk Compile Tests _
+1 💚	mvninstall	21m 22s	trunk passed
+1 💚	compile	0m 32s	trunk passed
+1 💚	checkstyle	0m 23s	trunk passed
+1 💚	mvnsite	0m 35s	trunk passed
+1 💚	shadedclient	15m 58s	branch has no errors when building and testing our client artifacts.
+1 💚	javadoc	0m 25s	trunk passed
+0 🆗	spotbugs	1m 0s	Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚	findbugs	0m 58s	trunk passed
		_ Patch Compile Tests _
+1 💚	mvninstall	0m 31s	the patch passed
+1 💚	compile	0m 25s	the patch passed
+1 💚	javac	0m 25s	the patch passed
-0 ⚠️	checkstyle	0m 17s	hadoop-tools/hadoop-aws: The patch generated 1 new + 2 unchanged - 0 fixed = 3 total (was 2)
+1 💚	mvnsite	0m 31s	the patch passed
+1 💚	whitespace	0m 0s	The patch has no whitespace issues.
+1 💚	shadedclient	15m 8s	patch has no errors when building and testing our client artifacts.
+1 💚	javadoc	0m 22s	the patch passed
+1 💚	findbugs	1m 3s	the patch passed
		_ Other Tests _
+1 💚	unit	1m 26s	hadoop-aws in the patch passed.
+1 💚	asflicense	0m 28s	The patch does not generate ASF License warnings.
		82m 30s

Subsystem	Report/Notes
Docker	ClientAPI=1.40 ServerAPI=1.40 base: https://builds.apache.org/job/hadoop-multibranch/job/PR-1963/4/artifact/out/Dockerfile
GITHUB PR	#1963
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle
uname	Linux 5c76d9e4f62f 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	personality/hadoop.sh
git revision	trunk / `9fe4c37`
Default Java	Private Build-1.8.0_252-8u252-b09-1~18.04-b09
checkstyle	https://builds.apache.org/job/hadoop-multibranch/job/PR-1963/4/artifact/out/diff-checkstyle-hadoop-tools_hadoop-aws.txt
Test Results	https://builds.apache.org/job/hadoop-multibranch/job/PR-1963/4/testReport/
Max. process+thread count	346 (vs. ulimit of 5500)
modules	C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output	https://builds.apache.org/job/hadoop-multibranch/job/PR-1963/4/console
versions	git=2.17.1 maven=3.6.0 findbugs=3.1.0-RC1
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

steveloughran · 2020-06-11T13:18:47Z

@jojochuang you had a chance to look at the latest changes?

jojochuang

Sorry I was late. LGTM. Thank you!

Contributed by Steve Loughran. Fixes a condition which can cause job commit to fail if a task was aborted < 60s before the job commit commenced: the task abort will shut down the thread pool with a hard exit after 60s; the job commit POST requests would be scheduled through the same pool, so be interrupted and fail. At present the access is synchronized, but presumably the executor shutdown code is calling wait() and releasing locks. Task abort is triggered from the AM when task attempts succeed but there are still active speculative task attempts running. Thus it only surfaces when speculation is enabled and the final tasks are speculating, which, given they are the stragglers, is not unheard of. Note: this problem has never been seen in production; it has surfaced in the hadoop-aws tests on a heavily overloaded desktop Change-Id: I3b433356d01fcc50d88b4353dbca018484984bc8

steveloughran · 2020-06-30T09:53:36Z

thx. merged to trunk & 3.3

apache#1963) Contributed by Steve Loughran. Fixes a condition which can cause job commit to fail if a task was aborted < 60s before the job commit commenced: the task abort will shut down the thread pool with a hard exit after 60s; the job commit POST requests would be scheduled through the same pool, so be interrupted and fail. At present the access is synchronized, but presumably the executor shutdown code is calling wait() and releasing locks. Task abort is triggered from the AM when task attempts succeed but there are still active speculative task attempts running. Thus it only surfaces when speculation is enabled and the final tasks are speculating, which, given they are the stragglers, is not unheard of. Note: this problem has never been seen in production; it has surfaced in the hadoop-aws tests on a heavily overloaded desktop Change-Id: Ib1574f2ca696e60594bb06f05e06193b74fd9a4e

jojochuang self-requested a review April 20, 2020 20:15

jojochuang reviewed Apr 21, 2020

View reviewed changes

apache deleted a comment from hadoop-yetus Jun 2, 2020

steveloughran added 3 commits June 2, 2020 13:53

HADOOP-16798 committer race condition.

64e31d0

Rework previous commit for a more minimal design -one which reduces the diff with the current code. Change-Id: Ic2a1a05d29f6974eb14f9279fbb0575e3f4ddd24

steveloughran force-pushed the s3/HADOOP-16798-committer-race branch from 0c86f35 to 64e31d0 Compare June 2, 2020 12:53

jojochuang approved these changes Jun 26, 2020

View reviewed changes

steveloughran merged commit 4249c04 into apache:trunk Jun 30, 2020

steveloughran deleted the s3/HADOOP-16798-committer-race branch October 15, 2021 19:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HADOOP-16798. S3A Committer thread pool shutdown problems. #1963

HADOOP-16798. S3A Committer thread pool shutdown problems. #1963

Uh oh!

steveloughran commented Apr 17, 2020

Uh oh!

steveloughran commented Apr 17, 2020

Uh oh!

hadoop-yetus commented Apr 17, 2020

Uh oh!

steveloughran commented Apr 20, 2020

Uh oh!

jojochuang left a comment

Uh oh!

steveloughran commented Apr 24, 2020

Uh oh!

steveloughran commented Jun 2, 2020

Uh oh!

hadoop-yetus commented Jun 2, 2020

Uh oh!

steveloughran commented Jun 11, 2020

Uh oh!

jojochuang left a comment

Uh oh!

steveloughran commented Jun 30, 2020

Uh oh!

Uh oh!

HADOOP-16798. S3A Committer thread pool shutdown problems. #1963

HADOOP-16798. S3A Committer thread pool shutdown problems. #1963

Uh oh!

Conversation

steveloughran commented Apr 17, 2020

Uh oh!

steveloughran commented Apr 17, 2020

Uh oh!

hadoop-yetus commented Apr 17, 2020

Uh oh!

steveloughran commented Apr 20, 2020

Uh oh!

jojochuang left a comment

Choose a reason for hiding this comment

Uh oh!

steveloughran commented Apr 24, 2020

Uh oh!

steveloughran commented Jun 2, 2020

Uh oh!

hadoop-yetus commented Jun 2, 2020

Uh oh!

steveloughran commented Jun 11, 2020

Uh oh!

jojochuang left a comment

Choose a reason for hiding this comment

Uh oh!

steveloughran commented Jun 30, 2020

Uh oh!

Uh oh!