Skip to content
This repository was archived by the owner on Apr 22, 2023. It is now read-only.

Conversation

gireeshpunathil
Copy link
Member

On AIX we were seeing failures in the below Node tests:

simple/test-cluster-master-error.js
with an assertion failure:
AssertionError: The workers did not die after an error in the master

and
simple/test-cluster-master-kill.js
with an assertion failure:
AssertionError: worker was alive after master died

The root cause for both the failures are same: the worker nodes of the
cluster was found to be alive when it was not expected to be. The isAlive
check is performed upon the master's exit, in both the cases.

A 200ms leeway is provided for the workers to actually terminate.

In AIX, the workers were actually terminating, but they took more
time - as much as 800ms (normal) to 1000ms (in rare cases).

Based on a C test we ran, it is found that the exit routines in AIX
is a bit more longer than that in Linux. There are a number of cleanup
activities performed in exit() system call, and depending on when the
signal handlers are shutdown in that sequence, the process will be
deemed as dead or alive, from another process's perspective -
process.kill(pid) is what is used in the test case to check the
liveliness of the worker, and when the kill() call is issued, even if the
target process is in it's exit sequences, if the signal handlers are not
shut down, it will respond to external signals, causing those calls to pass.

This pull request is to essentially increase the wait timeout to 1000ms for AIX.
in order to allow the node tests simple/test-cluster-master-error.js and
simple/test-cluster-master-kill.js to pass.

@mhdawson
Copy link
Member

I had discussed this with Gireesh earlier so lgtm. The one question we had was whether to limit the extension to aix, which is what we did or just extend for all platforms.

@cjihrig
Copy link

cjihrig commented Mar 18, 2015

IMO the timeout should just be extended for all platforms.

@jasnell
Copy link
Member

jasnell commented Mar 18, 2015

+1 and LGTM

@mhdawson mhdawson added this to the 0.12.2 milestone Mar 19, 2015
@mhdawson
Copy link
Member

Gireesh can you update to just extend the timeout for all platforms

@gireeshpunathil
Copy link
Member Author

@mdawsonibm, thanks, done - extended the new timeout to all the platforms.

@mhdawson
Copy link
Member

Ok can you squash into one commit and then I'll pull in.

simple tests test-cluster-master-error.js, test-cluster-master-kill.js
fails in AIX with assertion failure indicating that the workers were
alive even after the master terminated. A 200ms leeway is provided for
the workers to actually terminate, but the isAlive check returns
true in both the cases.

In AIX, the workers were actually terminating, but they took more time
- as much as 800ms (normal) to 1000ms (in rare cases).

Based on a C test we ran, it is found that the exit routines in AIX
is a bit more longer than that in Linux. There are a number of cleanup
activities performed in exit() system call, and depending on when the
signal handlers are shutdown in that sequence, the process will be
deemed as dead or alive, from another process's perspective.

process.kill(pid) is used in the test case to check the liveliness of
the worker, and when the kill() call is issued, even if the target
process is in it's exit sequences, if the signal handlers are not shut
down, it will respond to external signals, causing those calls to pass.

This fix extends the additional timeout for all platforms
@gireeshpunathil
Copy link
Member Author

@mdawsonibm, squashed the commits into one, thanks.

@mhdawson mhdawson self-assigned this Mar 31, 2015
@mhdawson
Copy link
Member

will pull this one in

mhdawson pushed a commit that referenced this pull request Mar 31, 2015
simple tests test-cluster-master-error.js, test-cluster-master-kill.js
fails in AIX with assertion failure indicating that the workers were
alive even after the master terminated. A 200ms leeway is provided for
the workers to actually terminate, but the isAlive check returns
true in both the cases.

In AIX, the workers were actually terminating, but they took more time
- as much as 800ms (normal) to 1000ms (in rare cases).

Based on a C test we ran, it is found that the exit routines in AIX
is a bit more longer than that in Linux. There are a number of cleanup
activities performed in exit() system call, and depending on when the
signal handlers are shutdown in that sequence, the process will be
deemed as dead or alive, from another process's perspective.

process.kill(pid) is used in the test case to check the liveliness of
the worker, and when the kill() call is issued, even if the target
process is in it's exit sequences, if the signal handlers are not shut
down, it will respond to external signals, causing those calls to pass.

This fix extends the additional timeout for all platforms

Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
PR-URL: #9431
@misterdjules misterdjules modified the milestones: 0.12.3, 0.12.2 Apr 1, 2015
@misterdjules
Copy link

Landed in f3f4e28, thank you!

@misterdjules
Copy link

Moved to the 0.12.3 milestone as f3f4e28 landed after the v0.12.2-release branch was cut.

It's not too significant though since tests don't have any direct impact on Node.js users.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants