-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why is a transient error in http download_and_extract not retried? #23687
Comments
I'm going to look into this myself on Monday - if anyone wants to drop any hints about how to schedule retries of async actions in the bazel http downloader code, that would be welcome. |
Out of curiosity (and also because it's an assumption we rely upon elsewhere), which filesystem have you observed this on? |
It’s most noticeable on NFS (I know I know) specifically on some scale-out caching appliances. |
On NFS we tracked down the root cause on our system. It's not so much atomic delete/unlink, it's actually that if you have an open file inside that directory and ask NFS to delete that file, NFS will keep it around as a On non-NFS filesystems, you can rely on the kernel to keep that file resource around until the file handle it closed, so you can delete a file even if it's still open somewhere (that's how tempfiles work on linux). But NFS uses that hidden file technique since it's got multiple kernels that might have open file handles to deleted files (I haven't looked too deeply at the kernel code though maybe this could be fixed). |
Possibly my patch would solve your situation too, if it’s a race between Bazel and itself. |
That NFS behaves in that way makes sense to me. But if it's a race between Bazel and itself, why can't we arrange for the file to be closed before we attempt the cleanup? Or is there some other process holding the file open? |
I can't speak to @mark64 's problem, but in my case, a file is correctly removed, but that remove is not visible (to bazel, for example) for a few tens or sometimes hundreds of milliseconds, as the deletion is socialized to all of the cache cluster members. So, the bazel sequence of "remove all the files in a subdirectory" then "remove the subdirectory" throws an exception because at step 2, sometimes, the subdirectory is not yet empty (yet). My patch above is simply to retry step 2 until it succeeds, and only ultimately re-throw the exception and abort if it continued to be impossible for five seconds. |
Thanks, that was the missing detail. I just wanted to make sure the retries weren't simply papering over a Bazel bug (failure to issue operations in the right order). I am a bit concerned that this pattern does exist elsewhere in the codebase (we generally assume the filesystem is POSIX-compliant) but I guess we can cross that bridge when we get to it. |
Something that occurred to me yesterday is that “clean —expunge” never seems to have this issue, so however that is implemented, it is doing something different. |
Description of the bug:
Downloading artefects using http_archive, such as the rust_rules do for downloading cargo crates, can sometimes run into issues where it fails to delete a temporary directory because it's not yet empty:
This error seems to originate from
bazel/src/main/java/com/google/devtools/build/lib/bazel/repository/starlark/StarlarkBaseExternalContext.java
Line 1061 in ce64b1a
Which category does this issue belong to?
Starlark Interpreter
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
Build a rust project with lots of crate dependencies and an output_user_root on a filesystem that does not guarantee atomic delete/unlink visibility.
Which operating system are you running Bazel on?
linux
What is the output of
bazel info release
?release 7.3.1
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse HEAD
?No response
If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.
No response
Have you found anything relevant by searching the web?
#20013 seems like the same problem - it just seems to me that exceptions marked TRANSIENT which are to do with cleaning up things like temporary scratch directories should be retried instead of killing the entire build.
Any other information, logs, or outputs that you want to share?
No response
The text was updated successfully, but these errors were encountered: