-
Notifications
You must be signed in to change notification settings - Fork 29.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: make cluster test more time tolerant on more platforms #5056
Conversation
// AIX needs more time due to default exit performance | ||
timeout = 1000; | ||
} | ||
// AIX needs more time due to default exit performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd drop the comment if it's no longer AIX specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense.
@nodejs/build @nodejs/testing anyone have a reason we should care about this 5-fold increase? I find it very strange that any platform needs this long but it doesn't seem like exit performance has anything to do with what this test is addressing so I guess we're not supposed to mind? |
ping @Trott as well |
Here is a strace: https://gist.github.com/ncopa/ca39045d23f841a5de6b What happens is: master process is forked. master process forks worker processes. master process dies. At this point there are 2 worker processes on the system. When they exit they become zombies because their parent did not have a chance to reap them. Calling kill(pid,0) will return 'alive' because nothing has reaped the exited process. since the parent (the master process) died the worker processes are reparented to pid 1, and pid 1 will eventually reap the worker processes. This is what takes 200ms on some systems and on others more. On my system I happens to use busybox init which needs more than 200ms to do the cleanup work. In other words, the timeout needed depends completely on how fast the pid 1 (the init process) cleans up zombie processes. |
Not necessarily a deal-breaker, but these changes have the effect of making the tests take longer to execute on all platforms, not just the platforms that need it. That might be OK, but perhaps it would be possible to have a setInterval that polls the status every 200ms or so and then a setTimeout that gives up and throws an error after 1000ms, or something like that. Even better, as always, would be if we could just find a way to get rid of the arbitrary timeout values entirely. But that's not always straightforward either. |
Also, if the reaction to my previous comment is "Let's not over-engineer a test just to shave off a few hundred milliseconds", I can respect that too. |
Technically, when the master exits with error, it needs to kill its children and waitpid() for each of them. When master exits without error, we need to either waitpid() each child or kill(2) and waitpid() each child. If we do that we can totally remove the timeout. It does not appear to be any child_process.waitpid() though. |
e2abbc2
to
5cb4ae3
Compare
I rebased the patch and included some explaining comments on why it takes so long time. |
I also know why busybox init is slow. It has a subtimal implementation of reaping the children. node will trigger a sleep(1) in busybox init that is there to protect it from respawning too fast. I guess AIX init has similar issue. |
LGTM if CI is green https://ci.nodejs.org/job/node-test-pull-request/1544/ |
Note that the timeout is a hack in the first place. The parent process (the master) should wait for the children (the workers) using wait/waitpid before it exits instead of letting init (pid 1) do the cleanup. |
@nodejs/testing ... any further thoughts on this one? |
I dislike the change in As I said above, I won't stand in the way if others feel this is acceptable or desirable. But if any of them are possible, I would much prefer any of these solutions:
|
// AIX needs more time due to default exit performance | ||
timeout = 1000; | ||
} | ||
// normally 200ms is enough AIX's init and busybox init needs more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: .
after enough
EDIT: While we're in Nit Mode: needs
-> need
@Trott I agree that waiting 1 sec always on all platforms is ugly. We can do better than that. I don't like checking for specific platforms because the affected platforms may be fixed in future. Possible alternative: @@ -58,15 +62,20 @@ if (cluster.isWorker) {
// make sure that the master died by purpose
assert.equal(code, 0);
- // check worker process status
- var timeout = 200;
- if (common.isAix) {
- // AIX needs more time due to default exit performance
- timeout = 1000;
- }
- setTimeout(function() {
+ // wait for init (pid 1) to collect the worker process
+ // normally 200ms is enough, but it depends on the init (pid 1)
+ // implementation. AIX's init and busybox init need more. We wait
+ // up to 1 sec (1000ms) before we trigger error.
+ var timeout = 1000;
+
+ function waitWorker() {
alive = isAlive(pid);
- }, timeout);
+ if (alive && timeout > 0) {
+ setTimeout(waitWorker, 10);
+ timeout -= 10;
+ }
+ }
+ waitWorker();
});
process.once('exit', function() { As mentioned the proper fix is to keep the master process collect all children before exit, but i can not see that |
Alternative using @@ -58,15 +62,18 @@ if (cluster.isWorker) {
// make sure that the master died by purpose
assert.equal(code, 0);
- // check worker process status
- var timeout = 200;
- if (common.isAix) {
- // AIX needs more time due to default exit performance
- timeout = 1000;
- }
- setTimeout(function() {
+ // wait for init (pid 1) to collect the worker process
+ // normally 200ms is enough, but it depends on the init (pid 1)
+ // implementation. AIX's init and busybox init need more. We wait
+ // up to 1 sec (1000ms) before we trigger error.
+ var timeout = 1000;
+
+ var waitWorker = setInterval(function() {
+ timeout -= 10;
alive = isAlive(pid);
- }, timeout);
+ if (!alive || (timeout <= 0))
+ clearInterval(waitWorker);
+ }, 10);
});
process.once('exit', function() { which do you prefer? |
@ncopa Either is fine by me. Anyone object to polling like that instead of the current "wait 1000ms and check once" approach? @nodejs/testing EDIT: "Either" meaning "either of the two polling approaches you came up with". |
7da4fd4
to
c7066fb
Compare
The problem nodejs/node-v0.x-archive@f3f4e28 resolves also affects Alpine Linux (musl libc) so wait up to 1 second for all platforms. But poll regularily so we don´t need to wait more than necessary. Also add some explaining comments on why it takes so long for the child processes to die. Signed-off-by: Natanael Copa <ncopa@alpinelinux.org>
5cb4ae3
to
6e7ef0e
Compare
Changed to polling with |
@santigimeno Just pointed out I made a very similar PR here: #6531. @ncopa With regards to this PR, my concern is that even the 1 second interval is sometimes insufficient, so in #6531 I made the these two tests wait indefinitely, with the idea being that the (Python) test harness will kill them if they hang and fail them anyway. It's inelegant, but timeout-based liveness checks are inherently brittle - we may as well centralize on the single master timeout in the test harness instead of picking arbitrary numbers in the tests themselves. Edit: My original approach relied more on serializing the master/worker exit, but it used undocumented API so I adopted the suggest interval approach used here too. |
@stefanmb the idea with setting a timeout is to detect breakages as early as possible. I´d be ok with 10 seconds but is it more than that then i´d say something is broken. I also think it does not hurt to poll more frequent that 2 times/sec. Timeout or wait indefinite are both equally ugly hacks so for me either is ok. Maybe indefinite wait is better since its simpler. |
I wonder if #6193 is related. |
@stefanmb has the ability to merge PRs and both PRs have |
The problem
nodejs/node-v0.x-archive@f3f4e28
resolves also affects Alpine Linux (musl libc) so use increased timeout
for all platforms and not only for AIX.