-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Description
On my x86 Mac laptop using macOS 12.6.1, all.bash often hangs in the os/exec test. In particular, this never finishes:
% cd go/src/os/exec
% for (i in `{seq 100}) go test -short -count=1
The chance of a hang in any given iteration is something like 50%. It's possible this is related to #33565, but I'm opening a separate bug just in case, and to focus the discussion on the fact that our own os/exec tests don't pass.
If I attach to the hung process in lldb, I was originally seeing backtraces like:
* frame #0: 0x00007ff801638f85 libsystem_platform.dylib`_os_once_gate_corruption_abort + 23
frame #1: 0x00007ff8016347c1 libsystem_platform.dylib`_os_once_gate_wait + 212
frame #2: 0x00007ff8016327e9 libsystem_platform.dylib`_os_alloc_once + 42
frame #3: 0x00007ff804089144 libsystem_notify.dylib`_notify_fork_child + 349
frame #4: 0x00007ff80c41ac89 libSystem.B.dylib`libSystem_atfork_child + 58
frame #5: 0x00007ff80151382d libsystem_c.dylib`fork + 84
frame #6: 0x000000000106d55f exec.test`runtime.syscall.abi0 + 31
frame #7: 0x000000000106b3e4 exec.test`runtime.asmcgocall.abi0 + 100
frame #8: 0x000000000106824b exec.test`syscall.rawSyscall + 139
frame #9: 0x00000000010771b0 exec.test`syscall.forkAndExecInChild + 240
This specific hang seems to match dart-lang/sdk#29539, and inspection of the Apple libc code shows that the problem is a race with an os_alloc_once that is in progress in the parent when the address space is split, making the same call die in the child. I changed the Go runtime to do an early call to notify_is_valid_token(0) in osinit. That call is a no-op except that it guarantees the os_alloc_once has been done already, so it cannot race with any future forks.
With that fix, I get a different hang:
* frame #0: 0x00007ff8015e53ea libsystem_kernel.dylib`__psynch_cvwait + 10
frame #1: 0x00007ff80161fa6f libsystem_pthread.dylib`_pthread_cond_wait + 1249
frame #2: 0x00007ff8014c2aca libobjc.A.dylib`WAITING_FOR_ANOTHER_THREAD_TO_FINISH_CALLING_+initialize + 115
frame #3: 0x00007ff8014b4194 libobjc.A.dylib`initializeNonMetaClass + 646
frame #4: 0x00007ff8014b3f6a libobjc.A.dylib`initializeNonMetaClass + 92
frame #5: 0x00007ff8014b3f6a libobjc.A.dylib`initializeNonMetaClass + 92
frame #6: 0x00007ff8014b3c18 libobjc.A.dylib`initializeAndMaybeRelock(objc_class*, objc_object*, mutex_tt<false>&, bool) + 232
frame #7: 0x00007ff8014b3995 libobjc.A.dylib`lookUpImpOrForward + 1087
frame #8: 0x00007ff8014b2f9b libobjc.A.dylib`_objc_msgSend_uncached + 75
frame #9: 0x00007ff80137145e libxpc.dylib`xpc_atfork_child + 125
frame #10: 0x00007ff80c41ac8e libSystem.B.dylib`libSystem_atfork_child + 63
frame #11: 0x00007ff80151382d libsystem_c.dylib`fork + 84
frame #12: 0x000000000106d5df exec.test`runtime.syscall.abi0 + 31
frame #13: 0x000000000106b444 exec.test`runtime.asmcgocall.abi0 + 100
frame #14: 0x0000000001069589 exec.test`runtime.systemstack.abi0 + 73
frame #15: 0x00000000010682ab exec.test`syscall.rawSyscall + 139
frame #16: 0x0000000001077250 exec.test`syscall.forkAndExecInChild + 240
This one seems to match what @jacobvosmaer posted in #33565 (comment).
I can't find the libobjc source code so I'm not sure what a workaround for xpc_atfork_child might be.
It must be that C programs on macOS do not use fork. I looked into posix_spawn but it looks like we don't have any other ports that use that.
We need to figure something out for Go 1.20 though.