Skip to content

os/exec: tests hang on macOS due to Apple libc fork bugs #56784

@rsc

Description

@rsc

On my x86 Mac laptop using macOS 12.6.1, all.bash often hangs in the os/exec test. In particular, this never finishes:

% cd go/src/os/exec
% for (i in `{seq 100}) go test -short -count=1

The chance of a hang in any given iteration is something like 50%. It's possible this is related to #33565, but I'm opening a separate bug just in case, and to focus the discussion on the fact that our own os/exec tests don't pass.

If I attach to the hung process in lldb, I was originally seeing backtraces like:

  * frame #0: 0x00007ff801638f85 libsystem_platform.dylib`_os_once_gate_corruption_abort + 23
    frame #1: 0x00007ff8016347c1 libsystem_platform.dylib`_os_once_gate_wait + 212
    frame #2: 0x00007ff8016327e9 libsystem_platform.dylib`_os_alloc_once + 42
    frame #3: 0x00007ff804089144 libsystem_notify.dylib`_notify_fork_child + 349
    frame #4: 0x00007ff80c41ac89 libSystem.B.dylib`libSystem_atfork_child + 58
    frame #5: 0x00007ff80151382d libsystem_c.dylib`fork + 84
    frame #6: 0x000000000106d55f exec.test`runtime.syscall.abi0 + 31
    frame #7: 0x000000000106b3e4 exec.test`runtime.asmcgocall.abi0 + 100
    frame #8: 0x000000000106824b exec.test`syscall.rawSyscall + 139
    frame #9: 0x00000000010771b0 exec.test`syscall.forkAndExecInChild + 240

This specific hang seems to match dart-lang/sdk#29539, and inspection of the Apple libc code shows that the problem is a race with an os_alloc_once that is in progress in the parent when the address space is split, making the same call die in the child. I changed the Go runtime to do an early call to notify_is_valid_token(0) in osinit. That call is a no-op except that it guarantees the os_alloc_once has been done already, so it cannot race with any future forks.

With that fix, I get a different hang:

  * frame #0: 0x00007ff8015e53ea libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007ff80161fa6f libsystem_pthread.dylib`_pthread_cond_wait + 1249
    frame #2: 0x00007ff8014c2aca libobjc.A.dylib`WAITING_FOR_ANOTHER_THREAD_TO_FINISH_CALLING_+initialize + 115
    frame #3: 0x00007ff8014b4194 libobjc.A.dylib`initializeNonMetaClass + 646
    frame #4: 0x00007ff8014b3f6a libobjc.A.dylib`initializeNonMetaClass + 92
    frame #5: 0x00007ff8014b3f6a libobjc.A.dylib`initializeNonMetaClass + 92
    frame #6: 0x00007ff8014b3c18 libobjc.A.dylib`initializeAndMaybeRelock(objc_class*, objc_object*, mutex_tt<false>&, bool) + 232
    frame #7: 0x00007ff8014b3995 libobjc.A.dylib`lookUpImpOrForward + 1087
    frame #8: 0x00007ff8014b2f9b libobjc.A.dylib`_objc_msgSend_uncached + 75
    frame #9: 0x00007ff80137145e libxpc.dylib`xpc_atfork_child + 125
    frame #10: 0x00007ff80c41ac8e libSystem.B.dylib`libSystem_atfork_child + 63
    frame #11: 0x00007ff80151382d libsystem_c.dylib`fork + 84
    frame #12: 0x000000000106d5df exec.test`runtime.syscall.abi0 + 31
    frame #13: 0x000000000106b444 exec.test`runtime.asmcgocall.abi0 + 100
    frame #14: 0x0000000001069589 exec.test`runtime.systemstack.abi0 + 73
    frame #15: 0x00000000010682ab exec.test`syscall.rawSyscall + 139
    frame #16: 0x0000000001077250 exec.test`syscall.forkAndExecInChild + 240

This one seems to match what @jacobvosmaer posted in #33565 (comment).

I can't find the libobjc source code so I'm not sure what a workaround for xpc_atfork_child might be.

It must be that C programs on macOS do not use fork. I looked into posix_spawn but it looks like we don't have any other ports that use that.

We need to figure something out for Go 1.20 though.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions