Description
Modern Unix systems appear to have a fundamental design flaw in the interaction between multithreaded programs, fork+exec, and the prohibition on executing a program if that program is open for writing.
Below is a simple multithreaded C program. It creates 20 threads all doing the same thing: write an exit 0 shell script to /var/tmp/fork-exec-N (for different N), and then fork and exec that script. Repeat ad infinitum. Note that the shell script fds are opened O_CLOEXEC, so that an fd being written by one thread does not leak into the fork+exec's shell script of a different thread.
On my Linux workstation, this program produces a never-ending stream of ETXTBSY errors. The problem is that O_CLOEXEC is not enough. The fd being written by one thread can leak into the forked child of a second thread, and it stays there until that child calls exec. If the first thread closes the fd and calls exec before the second thread's child does exec, then the first thread's exec will get ETXTBSY, because somewhere in the system (specifically, in the child of the second thread), there is an fd still open for writing the first thread's shell script, and according to modern Unix rules, one must not exec a program if there exists any fd anywhere open for writing that program.
Five years ago this bit us because cmd/go installed cmd/cgo (that is, copied the binary from a temporary location to somewhere/bin/cgo) and then executed it. To fix this we put a sleep+retry loop around the fork+exec of cgo when it gets ETXTBSY. Now (as of last week or so) we don't ever install cmd/cgo and execute it in the same cmd/go process, so that specific race is gone, although as I write this cmd/go still has the sleep+retry loop, which I intend to remove.
Last week this bit us again because cmd/go updated a build stamp in the binary, closed it, and executed it. The resulting flaky ETXTBSY failures were reported as #22220. A pending CL fixes this by not updating the build stamp in temporary binaries, which are the main ones we execute. There's still one case where we write+execute a program, which is go test -cpuprofile x.prof pkg
. The cpuprofile flag (and a few others) cause cmd/go to leave the pkg.test in the current directory for debugging purposes but also run the test. Luckily running the test is currently the final thing cmd/go does, and it waits for any other fork+exec'ed programs to finish before fork+exec'ing the test. So the race cannot happen in this case.
In general this race is going to happen every time anyone writes a program that both writes and executes a program. It's easy to imagine other build systems running into this, but also programs that do things like unzip a zip file and then run a program inside it - think a program supervisor or mini container runtime. As soon as there are multiple threads doing fork+exec at the same time, and one of them is doing fork+exec of a program that was previously open for write in the same process, you have a mysterious flaky problem.
It seems like maybe Go should take care of this, if possible. We've now hit it twice in cmd/go, five years apart, and at least this past time it took the better part of a day to figure out. (I don't remember how long it took five years ago, in part because I don't remember anything about discovering it five years ago. I also don't want to rediscover all this five years from now.)
There are a few hacks we could use:
- In os.StartProcess, if we see ETXTBSY, sleep 100ms and try again, maybe a few times, up to say 1 second of sleeping. In general we don't know how long to sleep.
- Arrange with a locking mechanism that close must never complete during a fork+exec sequence. The end of the fork+exec sequence needs to be the point where we know the close-on-exec fds have been closed. Unfortunately there is no portable way to identify that point.
- If the exec fails and the child tells us and exits, we can wait for the exit. That's easy.
- If the exec succeeds, we find out because the exec closes the child's end of the status pipe, and we get EOF.
- If we know that an OS does close-on-exec work in increasing fd order, then we could also track the maximum fd we've opened and move the status pipe above that. Then seeing the status pipe close would mean all other fds are closed too.
- If the OS had a "close all fds above x", we could use that. (I don't know of any that do, but it sure would help.)
- It may not be OK to block all closes on a wedged fork+exec (in general an exec'ed program may be loaded from some slow network server).
- Note that vfork(2) is not a solution. Vfork is defined as the parent does not continue executing until the child is no longer using the parent's memory image. In the case of a successful exec, at least on Linux, vfork releases the memory image before doing any of the close-on-exec work, so the parent continues running before the child has closed the fds we care about.
None of these seem great. The ETXTBSY sleep, up to 1 second, might be the best option. It would certainly reduce the flake rate and in many cases would probably make it undetectable. It would not help exec of very slow-to-load programs, but that's not the common case.
I wondered how Java deals with this, and the answer seems to be that Java doesn't deal with this. https://bugs.openjdk.java.net/browse/JDK-8068370 was filed in 2014 and is still open.
#include <pthread.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <errno.h>
#include <stdint.h>
void* runner(void*);
int
main(void)
{
int i;
pthread_t pid[20];
for(i=1; i<20; i++)
pthread_create(&pid[i], 0, runner, (void*)(uintptr_t)i);
runner(0);
return 0;
}
char script[] = "#!/bin/sh\nexit 0\n";
void*
runner(void *v)
{
int i, fd, pid, status;
char buf[100], *argv[2];
i = (int)(uintptr_t)v;
snprintf(buf, sizeof buf, "/var/tmp/fork-exec-%d", i);
argv[0] = buf;
argv[1] = 0;
for(;;) {
fd = open(buf, O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0777);
if(fd < 0) {
perror("open");
exit(2);
}
write(fd, script, strlen(script));
close(fd);
pid = fork();
if(pid < 0) {
perror("fork");
exit(2);
}
if(pid == 0) {
execve(buf, argv, 0);
exit(errno);
}
if(waitpid(pid, &status, 0) < 0) {
perror("waitpid");
exit(2);
}
if(!WIFEXITED(status)) {
perror("waitpid not exited");
exit(2);
}
status = WEXITSTATUS(status);
if(status != 0)
fprintf(stderr, "exec: %d %s\n", status, strerror(status));
}
return 0;
}