Description
- Version: v12.13.1
- Platform: Linux 5.3.11
- Subsystem: child_process
See https://gist.github.com/jgehrcke/ab4656353c1155173d2dde5ffceb0d0b for repro code. The output when executing this:
$ node kill_no_pid_repro_js.js
2019-11-26T16:49:50.509Z info: waiting for the startup error to be handled
2019-11-26T16:49:50.511Z info: loop iteration
2019-11-26T16:49:50.514Z error: 'error' handler: Error: spawn does-not-exist-- ENOENT
at Process.ChildProcess._handle.onexit (internal/child_process.js:264:19)
at onErrorNT (internal/child_process.js:456:16)
at processTicksAndRejections (internal/process/task_queues.js:80:21) {
errno: 'ENOENT',
code: 'ENOENT',
syscall: 'spawn does-not-exist--',
path: 'does-not-exist--',
spawnargs: []
}
2019-11-26T16:49:50.521Z error: child process startup error
2019-11-26T16:49:50.521Z info: send SIGTERM
2019-11-26T16:49:51.523Z info: send SIGTERM
The log output with millisecond resolution shows the chronological order of events.
In the repro code I use an asynchronous startup error detection technique, using the "error" event. In the log output we can see that this event is being caught, indicating ENOENT (command/executable not found, as desired in this repro). That's great.
After this one can still call the kill()
method on the process. The call "succeeds", silently swallowing a problem. The problem as I would put it into words: "there is no process that you can kill here", and that problem should not be silently swallowed, because it makes it too easy to write code with race conditions.
I believe that this should be considered a bug in NodeJS: this is a programmer error, i.e. this should throw an Error, as it would in other programming environments. If unhandled, this should crash the code, indicating to the programmer that their assumption that the process is alive was wrong. The programmer should be required to explicitly handle an error thrown by kill()
.
It's not necessary to demonstrate the struggle, but maybe makes it easier to understand: the repro code calls kill() another time, about 1 second after the startup error had happened. That kill() also swallows the problem.
I don't know if internally the kill()
method just is a noop in this case (where the runtime knows that there is no PID to issue a kill()
system call to), or if it actually calls the system call and then swallows the ENOENT. But in both cases it makes a conscious choice, knows about the absence of the process, and hides the erroneous attempt from the programmer.
There could be two ways to check for the error synchronously:
- The underlying
kill()
system call does fail with ENOENT, if it is even executed. - If the underlying
kill()
system call is not executed then the runtime seems to have internal state about the fact that the process is not there (I am pretty sure that this is what's happening, see my code comments in the repro code). That state could be used.
It is documented that
The ChildProcess object may emit an 'error' event if the signal cannot be delivered
If this is meant to be the only, reliable, documented way to find out that a kill()
failed (not quite clear from the documentation) then I think the repro code is also quite insightful: in the repro code that event handler is not called after the erroneous kill()
. That's in fact proven by the additional 1-second wait, during which I would expect that handler to be called.
In the repro code comments I have pointed out that the runtime magically detects that the child process is gone, and it terminates the code pre-maturely, to prevent an indefinitely long wait from happening.
That is, I think this issue mainly reveals a rather mean inconsistency where the runtime sometimes seems to consider the fact that there is no process, and sometimes it doesn't, making it difficult to write robust child process management code.
Note: currently there does not seem to be a documented synchronous way to detect a child process startup error (upon common system call errors such as ENOENT and EACCES), and no documented synchronous way to check for process "liveness", although both could be done synchronously with "fast" system calls. This might be strongly related to this topic here. Related: eclipse-theia/theia#3447 and nodejs/help#1191.