This repository was archived by the owner on Apr 22, 2023. It is now read-only.
-
-
Notifications
You must be signed in to change notification settings - Fork 7.3k
test: make cluster tests more time tolerant #9431
Closed
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I had discussed this with Gireesh earlier so lgtm. The one question we had was whether to limit the extension to aix, which is what we did or just extend for all platforms. |
IMO the timeout should just be extended for all platforms. |
+1 and LGTM |
Gireesh can you update to just extend the timeout for all platforms |
@mdawsonibm, thanks, done - extended the new timeout to all the platforms. |
Ok can you squash into one commit and then I'll pull in. |
simple tests test-cluster-master-error.js, test-cluster-master-kill.js fails in AIX with assertion failure indicating that the workers were alive even after the master terminated. A 200ms leeway is provided for the workers to actually terminate, but the isAlive check returns true in both the cases. In AIX, the workers were actually terminating, but they took more time - as much as 800ms (normal) to 1000ms (in rare cases). Based on a C test we ran, it is found that the exit routines in AIX is a bit more longer than that in Linux. There are a number of cleanup activities performed in exit() system call, and depending on when the signal handlers are shutdown in that sequence, the process will be deemed as dead or alive, from another process's perspective. process.kill(pid) is used in the test case to check the liveliness of the worker, and when the kill() call is issued, even if the target process is in it's exit sequences, if the signal handlers are not shut down, it will respond to external signals, causing those calls to pass. This fix extends the additional timeout for all platforms
@mdawsonibm, squashed the commits into one, thanks. |
will pull this one in |
mhdawson
pushed a commit
that referenced
this pull request
Mar 31, 2015
simple tests test-cluster-master-error.js, test-cluster-master-kill.js fails in AIX with assertion failure indicating that the workers were alive even after the master terminated. A 200ms leeway is provided for the workers to actually terminate, but the isAlive check returns true in both the cases. In AIX, the workers were actually terminating, but they took more time - as much as 800ms (normal) to 1000ms (in rare cases). Based on a C test we ran, it is found that the exit routines in AIX is a bit more longer than that in Linux. There are a number of cleanup activities performed in exit() system call, and depending on when the signal handlers are shutdown in that sequence, the process will be deemed as dead or alive, from another process's perspective. process.kill(pid) is used in the test case to check the liveliness of the worker, and when the kill() call is issued, even if the target process is in it's exit sequences, if the signal handlers are not shut down, it will respond to external signals, causing those calls to pass. This fix extends the additional timeout for all platforms Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com> PR-URL: #9431
Landed in f3f4e28, thank you! |
Moved to the 0.12.3 milestone as f3f4e28 landed after the v0.12.2-release branch was cut. It's not too significant though since tests don't have any direct impact on Node.js users. |
3 tasks
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
On AIX we were seeing failures in the below Node tests:
simple/test-cluster-master-error.js
with an assertion failure:
AssertionError: The workers did not die after an error in the master
and
simple/test-cluster-master-kill.js
with an assertion failure:
AssertionError: worker was alive after master died
The root cause for both the failures are same: the worker nodes of the
cluster was found to be alive when it was not expected to be. The isAlive
check is performed upon the master's exit, in both the cases.
A 200ms leeway is provided for the workers to actually terminate.
In AIX, the workers were actually terminating, but they took more
time - as much as 800ms (normal) to 1000ms (in rare cases).
Based on a C test we ran, it is found that the exit routines in AIX
is a bit more longer than that in Linux. There are a number of cleanup
activities performed in exit() system call, and depending on when the
signal handlers are shutdown in that sequence, the process will be
deemed as dead or alive, from another process's perspective -
process.kill(pid) is what is used in the test case to check the
liveliness of the worker, and when the kill() call is issued, even if the
target process is in it's exit sequences, if the signal handlers are not
shut down, it will respond to external signals, causing those calls to pass.
This pull request is to essentially increase the wait timeout to 1000ms for AIX.
in order to allow the node tests simple/test-cluster-master-error.js and
simple/test-cluster-master-kill.js to pass.