Why is a transient error in http download_and_extract not retried? #23687

sthornington · 2024-09-19T21:00:03Z

Description of the bug:

Downloading artefects using http_archive, such as the rust_rules do for downloading cargo crates, can sometimes run into issues where it fails to delete a temporary directory because it's not yet empty:

INFO: Repository crate_index__fastrand-2.1.1 instantiated at:
  /scratch/simont/src/quadcap/WORKSPACE:122:19: in <toplevel>
  /stuff/simont/bazel_base/4fd9a0903ffbf1e501ab97de5c381dcc/external/crate_index/defs.bzl:579:10: in crate_repositories
  /stuff/simont/bazel_base/4fd9a0903ffbf1e501ab97de5c381dcc/external/bazel_tools/tools/build_defs/repo/utils.bzl:268:18: in maybe
Repository rule http_archive defined at:
  /stuff/simont/bazel_base/4fd9a0903ffbf1e501ab97de5c381dcc/external/bazel_tools/tools/build_defs/repo/http.bzl:382:31: in <toplevel>
ERROR: An error occurred during the fetch of repository 'crate_index__fastrand-2.1.1':
   Traceback (most recent call last):
        File "/stuff/simont/bazel_base/4fd9a0903ffbf1e501ab97de5c381dcc/external/bazel_tools/tools/build_defs/repo/http.bzl", line 131, column 45, in _http_archive_impl
                download_info = ctx.download_and_extract(
Error in download_and_extract: java.io.IOException: Couldn't delete temporary directory (/stuff/simont/bazel_base/4fd9a0903ffbf1e501ab97de5c381dcc/external/crate_index__fastrand-2.1.1/temp1055349778882693560): /stuff/simont/bazel_base/4fd9a0903ffbf1e501ab97de5c381dcc/external/crate_index__fastrand-2.1.1/temp1055349778882693560 (Directory not empty)
ERROR: no such package '@@crate_index__fastrand-2.1.1//': java.io.IOException: Couldn't delete temporary directory (/stuff/simont/bazel_base/4fd9a0903ffbf1e501ab97de5c381dcc/external/crate_index__fastrand-2.1.1/temp1055349778882693560): /stuff/simont/bazel_base/4fd9a0903ffbf1e501ab97de5c381dcc/external/crate_index__fastrand-2.1.1/temp1055349778882693560 (Directory not empty)
ERROR: /stuff/simont/bazel_base/4fd9a0903ffbf1e501ab97de5c381dcc/external/crate_index__tempfile-3.12.0/BUILD.bazel:16:13: @@crate_index__tempfile-3.12.0//:tempfile depends on @@crate_index__fastrand-2.1.1//:fastrand in repository @@crate_index__fastrand-2.1.1 which failed to fetch. no such package '@@crate_index__fastrand-2.1.1//': java.io.IOException: Couldn't delete temporary directory (/stuff/simont/bazel_base/4fd9a0903ffbf1e501ab97de5c381dcc/external/crate_index__fastrand-2.1.1/temp1055349778882693560): /stuff/simont/bazel_base/4fd9a0903ffbf1e501ab97de5c381dcc/external/crate_index__fastrand-2.1.1/temp1055349778882693560 (Directory not empty)
Use --verbose_failures to see the command lines of failed build steps.
ERROR: Analysis of target '//rust/qcrs_link_demo:qcrs_link_demo' failed; build aborted: Analysis failed
INFO: Elapsed time: 28.348s, Critical Path: 0.91s
INFO: 269 processes: 108 remote cache hit, 161 internal.
ERROR: Build did NOT complete successfully

This error seems to originate from

bazel/src/main/java/com/google/devtools/build/lib/bazel/repository/starlark/StarlarkBaseExternalContext.java

Line 1061 in ce64b1a

throw new RepositoryFunctionException(

and if one's filesystem does this, there does not seem to be any possible mitigation? Perhaps temporary downloads could be downloaded to a real temporary directory, not part for the output_base, before being emplaced in the final spot?

Which category does this issue belong to?

Starlark Interpreter

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Build a rust project with lots of crate dependencies and an output_user_root on a filesystem that does not guarantee atomic delete/unlink visibility.

Which operating system are you running Bazel on?

linux

What is the output of `bazel info release`?

release 7.3.1

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

No response

What's the output of `git remote get-url origin; git rev-parse HEAD` ?

No response

If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

No response

Have you found anything relevant by searching the web?

#20013 seems like the same problem - it just seems to me that exceptions marked TRANSIENT which are to do with cleaning up things like temporary scratch directories should be retried instead of killing the entire build.

Any other information, logs, or outputs that you want to share?

No response

The text was updated successfully, but these errors were encountered:

sthornington · 2024-11-08T20:11:06Z

I'm going to look into this myself on Monday - if anyone wants to drop any hints about how to schedule retries of async actions in the bazel http downloader code, that would be welcome.

tjgq · 2024-11-15T00:33:02Z

on a filesystem that does not guarantee atomic delete/unlink visibility

Out of curiosity (and also because it's an assumption we rely upon elsewhere), which filesystem have you observed this on?

sthornington · 2024-11-15T01:11:47Z

It’s most noticeable on NFS (I know I know) specifically on some scale-out caching appliances.

mark64 · 2024-11-15T01:18:56Z

On NFS we tracked down the root cause on our system. It's not so much atomic delete/unlink, it's actually that if you have an open file inside that directory and ask NFS to delete that file, NFS will keep it around as a .nfs-<some hash> hidden file inside the directory until you close that file reference. That makes it impossible to remove the directory containing that hidden file, and so bazel's downloader fails saying it tried to remove a non-empty directory. But by the time you go check if the file is still around, bazel has closed its open reference to that file and so the directory is indeed empty.

On non-NFS filesystems, you can rely on the kernel to keep that file resource around until the file handle it closed, so you can delete a file even if it's still open somewhere (that's how tempfiles work on linux). But NFS uses that hidden file technique since it's got multiple kernels that might have open file handles to deleted files (I haven't looked too deeply at the kernel code though maybe this could be fixed).

sthornington · 2024-11-15T01:32:15Z

Possibly my patch would solve your situation too, if it’s a race between Bazel and itself.

tjgq · 2024-11-15T12:17:41Z

it's actually that if you have an open file inside that directory and ask NFS to delete that file, NFS will keep it around as a .nfs- hidden file inside the directory until you close that file reference.

That NFS behaves in that way makes sense to me. But if it's a race between Bazel and itself, why can't we arrange for the file to be closed before we attempt the cleanup? Or is there some other process holding the file open?

sthornington · 2024-11-15T13:05:16Z

That NFS behaves in that way makes sense to me. But if it's a race between Bazel and itself, why can't we arrange for the file to be closed before we attempt the cleanup? Or is there some other process holding the file open?

I can't speak to @mark64 's problem, but in my case, a file is correctly removed, but that remove is not visible (to bazel, for example) for a few tens or sometimes hundreds of milliseconds, as the deletion is socialized to all of the cache cluster members. So, the bazel sequence of "remove all the files in a subdirectory" then "remove the subdirectory" throws an exception because at step 2, sometimes, the subdirectory is not yet empty (yet).

My patch above is simply to retry step 2 until it succeeds, and only ultimately re-throw the exception and abort if it continued to be impossible for five seconds.

tjgq · 2024-11-15T13:29:00Z

as the deletion is socialized to all of the cache cluster members

Thanks, that was the missing detail. I just wanted to make sure the retries weren't simply papering over a Bazel bug (failure to issue operations in the right order). I am a bit concerned that this pattern does exist elsewhere in the codebase (we generally assume the filesystem is POSIX-compliant) but I guess we can cross that bridge when we get to it.

sthornington · 2024-11-15T13:36:35Z

Something that occurred to me yesterday is that “clean —expunge” never seems to have this issue, so however that is implemented, it is doing something different.

sthornington added type: bug untriaged labels Sep 19, 2024

sthornington assigned iancha1992, satyanandak and sgowroji Sep 19, 2024

github-actions bot added the team-Starlark-Interpreter Issues involving the Starlark interpreter used by Bazel label Sep 19, 2024

iancha1992 unassigned sgowroji, iancha1992 and satyanandak Sep 19, 2024

meteorcloudy added team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. and removed team-Starlark-Interpreter Issues involving the Starlark interpreter used by Bazel labels Sep 23, 2024

Wyverald added P3 We're not considering working on this, but happy to review a PR. (No assignee) and removed untriaged labels Sep 24, 2024

sthornington mentioned this issue Nov 12, 2024

Retry the cleanup of downloadAndExtract #24295

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is a transient error in http download_and_extract not retried? #23687

Why is a transient error in http download_and_extract not retried? #23687

sthornington commented Sep 19, 2024 •

edited

Loading

sthornington commented Nov 8, 2024

tjgq commented Nov 15, 2024

sthornington commented Nov 15, 2024 •

edited

Loading

mark64 commented Nov 15, 2024

sthornington commented Nov 15, 2024

tjgq commented Nov 15, 2024

sthornington commented Nov 15, 2024 •

edited

Loading

tjgq commented Nov 15, 2024

sthornington commented Nov 15, 2024

Why is a transient error in http download_and_extract not retried? #23687

Why is a transient error in http download_and_extract not retried? #23687

Comments

sthornington commented Sep 19, 2024 • edited Loading

Description of the bug:

Which category does this issue belong to?

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Which operating system are you running Bazel on?

What is the output of bazel info release?

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

What's the output of git remote get-url origin; git rev-parse HEAD ?

If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

Have you found anything relevant by searching the web?

Any other information, logs, or outputs that you want to share?

sthornington commented Nov 8, 2024

tjgq commented Nov 15, 2024

sthornington commented Nov 15, 2024 • edited Loading

mark64 commented Nov 15, 2024

sthornington commented Nov 15, 2024

tjgq commented Nov 15, 2024

sthornington commented Nov 15, 2024 • edited Loading

tjgq commented Nov 15, 2024

sthornington commented Nov 15, 2024

sthornington commented Sep 19, 2024 •

edited

Loading

What is the output of `bazel info release`?

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

What's the output of `git remote get-url origin; git rev-parse HEAD` ?

sthornington commented Nov 15, 2024 •

edited

Loading

sthornington commented Nov 15, 2024 •

edited

Loading