Description
Dear maintainers,
We still observe some very-hard-to-reproduce races when building libc/rt prerequisites.
Some anecdotal evidence:
error: FileNotFound
When executed on a fresh installation, zig build-exe toolchain/launcher.zig
(exact command) sometimes fails with:
error: FileNotFound
This happens only on a fresh $ZIG_GLOBAL_CACHE_DIR
(which we keep in /tmp/bazel-zig-cc
). We have seen this happen on Darwin x86_64 and Darwin M1. We may saw it on Linux, but I no longer have the logs to verify. My memory is poor.
I tried to reproduce this on my MacOS machine overnight, without success. But we do receive a couple of complaints a week consistently over the last few weeks. Note that the sample size is quite large.
libcompiler_rt.a: No such file or directory
This happened on our CI yesterday:
/tmp/bazel-zig-cc/o/4421bb1adcf01feee7185ccb98640027/libcompiler_rt.a: No such file or directory
Unfortunately, I can no longer access the build host nor access it's global cache dir. It may be related.
Summary
I understand this is very little information to troubleshoot effectively. Here are the steps I am trying to do:
- Capture the first error (
FileNotFound
) on any Linux machine and instruct the engineer to re-run the command understrace
, to see which file they are missing. However, this was not reported on Linux for the last week or so: either it did not happen, or people learned to remove the cache directory and move on. Since this happens more on OSX, it would make sense to debug it here. However, our engineers cannot rundtruss
for compliance reasons. - I will try to repro this on my MacOS machine again, but slightly differently.
Food for thought: is it time to reconsider how error context is propagated during the build phase, so errors could be augmented with additional context?