Description
Description
Foundation.Process
on Linux uses a trick (that doesn't actually work...) to detect if the child process has exited: It inherits a socketpair descriptor into the child and it expects this socket to be closed when the child exits. In simple scenarios that is true but UNIX by default inherits all file descriptors into child processes. That means if the sub process itself spawns another process, the special socket will be inherited into the child.
That's a huge issue however because now the parent process will no longer detect if the child is dying because the child's child also has that file descriptor...
Attached, please find a reproduction which does the following:
The parent
process spawns a /bin/sh
as its child
process. That child
process spawns another process (childs child
) which does sleep 12345678
which is a very very long sleep. After one second, parent
kills child
with SIGKILL
which means that child
now immediately exits. Then, the parent
calls process.waitUntilExit()
which should immediately return (because the child is dead). Alas, Foundation.Process
does not realise that child
is dead because that special socketpair is also inherited into childs child
(and further sub processes)...
Expected behaviour (observed on Darwin)
$ swift /tmp/process_bug_repro.swift
[in parent: 11427] start subprocess 'child'
[in parent: 11427] waiting 1 second (for child with pid 11428)
[in child: 11428] start subprocess 'childs child'
[in child: 11428] waiting for childs child (with pid 11429)
[in childs child: 11429] start
[in parent: 11427] kill SIGKILL child with pid 11428)
[in parent: 11427] kill successful
[in parent: 11427] waiting for child with pid 11428 to exit
[in parent: 11427] done
Actual behaviour (observed on Linux, Swift 5.8)
[in parent: 13] start subprocess 'child'
[in parent: 13] waiting 1 second (for child with pid 35)
[in child: 35] start subprocess 'childs child'
[in child: 35] waiting for childs child (with pid 36)
[in childs child: 36] start
[in parent: 13] kill SIGKILL child with pid 35)
[in parent: 13] kill successful
[in parent: 13] waiting for child with pid 35 to exit
[in parent: 13] WEIRD (THIS IS THE BUG), still waiting at 2023-07-12 14:23:08 +0000. Running ps uw -p 13 -p 35 -p 36
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 13 11.3 4.3 591104 175580 pts/0 Sl+ 14:23 0:00 /usr/bin/swift-frontend -frontend -interpret process_bug_repro.swif
root 35 0.0 0.0 0 0 pts/0 Z 14:23 0:00 [sh] <defunct> <<--- JW: THIS IS THE CHILD THAT's a zombie now
root 36 0.0 0.0 2308 832 pts/0 S 14:23 0:00 /bin/sh -c echo "[in childs child: $$] start"; sleep 12345678; echo
[in parent: 13] WEIRD (THIS IS THE BUG), still waiting at 2023-07-12 14:23:13 +0000. Running ps uw -p 13 -p 35 -p 36
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 13 8.3 4.3 664932 175584 pts/0 Sl+ 14:23 0:00 /usr/bin/swift-frontend -frontend -interpret process_bug_repro.swift -Xllvm -aarch64-use-tbi -disable-objc-interop
root 35 0.0 0.0 0 0 pts/0 Z 14:23 0:00 [sh] <defunct>
root 36 0.0 0.0 2308 832 pts/0 S 14:23 0:00 /bin/sh -c echo "[in childs child: $$] start"; sleep 12345678; echo "[in childs child: $$] done"
[...] output continues "forever"
Fix
Instead of using this special socketpair which has two issues:
- As demonstrated above, this can lead to false negatives (because fd gets inherited further)
- This can also lead to false positives (because the child process could close all its file descriptors making Foundation.Process think that the child has exited when it hasn't)
To fix both of these, Foundation.Process should either use pidfd_open
or signalfd
on SIGCHLD
to get an epoll
able signal when the child process dies.