Skip to content

Conversation

@ranger-ross
Copy link
Member

@ranger-ross ranger-ross commented Oct 26, 2025

This PR adds fine grain locking for the build cache using build unit level locking.
I'd recommend reading the design details in this description and then reviewing commit by commit.
Part of #4282

Previous attempt: #16089

Design decisions / rational

  • Using build unit level locking instead of a temporary working directory.
    • After experimenting with multiple approaches, I am currently leaning to towards build unit level locking.
    • The working directory approach introduces a fair bit of uplifting complexity and I further along I pushed my prototype the more I ran into unexpected issues.
      • mtime changes in fingerprints due to uplifting/downlifting order
      • tests/benches need to be ran before being uplifted OR uplifted and locked during execution which leads to more locking design needed. (also running pre-uplift introduces other potential side effects like the path displayed to the user being deleted as its temporary)
    • The trade off here is that with build unit level locks, we need a more advanced locking mechanism and we will have more open locks at once.
    • The reason I think this is a worth while trade of is that the locking complexity can largely be contained to single module where the uplifting complexity would be spread through out the cargo codebase anywhere we do uplifting. The increased locks count while unavoidable can be mitigated (see below for more details)
  • Risk of too many locks (file descriptors)
    • On Linux 1024 is a fairly common default soft limit. Windows is even lower at 256.
    • Having 2 locks per build unit makes is possible to hit with a moderate amount of dependencies
    • There are a few mitigations I could think of for this problem (that are included in this PR)
      • Increasing the file descriptor limits of based on the number of build units (if hard limit is high enough)
      • Share file descriptors for shared locks across jobs (within a single process) using a virtual lock
        • This could be implemented using reference counting.
      • Falling back to coarse grain locking if some heuristic is not met

Implementation details

  • We have a stateful lock per build unit made up of multiple file locks primary.lock and secondary.lock (see locking.rs module docs for more details on the states)
    • This is needed to enable pipelined builds
  • We fall back to coarse grain locking if fine grain locking is determined to be unsafe (see determine_locking_mode())
  • Fine grain locking continues to take the existing .cargo-lock lock as RO shared to continue working with older cargo versions while allowing multiple newer cargo instances to run in parallel.
  • Locking is disabled on network filesystems. (keeping existing behavior from Don't use flock on NFS mounts #2623)
  • cargo clean continues to use coarse grain locking for simplicity.
  • File descriptors
    • I added functionality to increase the file descriptors if cargo detects that there will not be enough based on the number of build units in the UnitGraph.
    • If we aren’t able to increase a threshold (currently number of build units * 10) we automatically fallback to coarse grain locking and display a warning to the user.
      • I picked 10 times the number of build units a conservative estimate for now. I think lowering this number may be reasonable.
    • While testing, I was seeing a peak of ~3,200 open file descriptors while compiling Zed. This is approximately x2 the number of build units.
      • Without the RcFileLock I was seeing peaks of ~12,000 open fds which I felt was quiet high even for a large project like Zed.
  • We use a global FileLockInterner that holds on to the file descriptors (RcFileLock) until the end of the process. (We could potentially add it to JobState if preferred, it would just be a bit more plumbing)

Open Questions

  • How do we want to handle locking on the artifact directory?
    • We could simply continue using coarse grain locking, locking and unlocking when files are uplifted.
    • One downside of locking/unlocking multiple times per invocation is that artifact-dir is touch many times across the compilation process (for example, there is a pre-rustc clean up step Also we need to take into account other commands like cargo doc
    • Another option would to only take a lock on the artifact-dir for commands that we know will uplift files. (e.g. cargo check would not take a lock artifact-dir but cargo build would). This would mean that 2 cargo build invocations would not run in parallel because one of them would hold the lock artifact-dir (blocking the other). This might actually be ideal to avoid 2 instances fighting over the CPU while recompiling the same crates.
  • What should our testing strategy for locking be?
    • My testing strategy thus far has been to run cargo on dummy projects to verify the locking.
    • For the max file descriptor testing, I have been using the Zed codebase as a testbed as it has over 1,500 build units which is more than the default ulimit on my linux system. (I am happy to test this on other large codebase that we think would be good to verify against)
    • It’s not immediately obvious to me as to how to create repeatable unit tests for this or what those tests should be testing for.
    • For performance testing, I have been using hyperfine to benchmark builds with and without -Zbuild-dir-new-layout. With the current implementation I am not seeing any perf regression on linux but I have yet to test on windows/macos.
  • Should we expose an option in the future to allow users to force coarse grain locking?
    • In the event a user’s system can’t support fine grain locking for some reason, should we provide some way to control the locking mode like CARGO_BUILD_LOCKING_MODE=coarse?
    • If so, would this be something that would be temporary during the transition period or a permanent feature?
  • What should the heuristics to disable fine grain locking be?
    • Currently, it's if the max file descriptors are less than 10 times the number of build units. This is pretty conservative. From my testing, it generally peaks around 2 times the number of build units.
    • I wonder if there is any other information in the crate graph that we could as a heuristic?
  • How should we interface with file system lock? See Implement fine grain locking for the target directory #16155 (comment)

@rustbot
Copy link
Collaborator

rustbot commented Oct 26, 2025

r? @ehuss

rustbot has assigned @ehuss.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added A-build-execution Area: anything dealing with executing the compiler A-filesystem Area: issues with filesystems A-layout Area: target output directory layout, naming, and organization Command-clean S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 26, 2025
Comment on lines +53 to +57
/// Coarse grain locking (Profile level)
Coarse,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Falling back to coarse grain locking if some heuristic is not met

  • This prevents us from reducing the layers in the new build dir layout. Not a requirement but sure was hoping for that
  • What does this mean for Per-user compiled artifact cache #5931 as grabbing a coarse lock for the entire shared cache may be a no-go?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prevents us from reducing the layers in the new build dir layout. Not a requirement but sure was hoping for that

I think the benefits of having coarse grain locking as a fallback are worth the trade of the extra nesting. At least in the short term. If there is some system configuration that cant do fine grain locking for some reason they would still at least be able to build their project. It also makes our lives easier for commands other than build/check. (thinking about cargo clean which does not construct a BuildRunner that has a lot of the information we use for fine grain locking)

What does this mean for #5931 as grabbing a coarse lock for the entire shared cache may be a no-go?

For the artifact cache, the locking mode does not need to match the locking mode of build-dir.
I agree that coarse grain locking would certainly not be idea of the artifact cache.
Given that the artifact cache will be green field, I am hopeful that we can have a fairly simple flat structure with fine grain locking (with a single lock per build unit since pipelined builds will not be a problem for the cache)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the artifact cache, the locking mode does not need to match the locking mode of build-dir.

Part of the purpose of this is to explore what we'll need for that.

Given that the artifact cache will be green field, I am hopeful that we can have a fairly simple flat structure with fine grain locking (with a single lock per build unit since pipelined builds will not be a problem for the cache)

How will it not have a problem with pipelined builds?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will it not have a problem with pipelined builds?

How I am imagining it in my head is

  1. lock build-dir unit
  2. compile rmeta
  3. downgrade build-dir unit lock (unblocking other build units)
  4. compile rlib
  5. acquire artifact-cache unit lock
  6. uplift completed unit
  7. release both artifact cache and build-dir locks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't that apply equally to this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think artifact-cache and build-dir are different in that, build-dir will have the rmeta exists but not rlib state.
artifact-cache will likely have an immutable atomically created (copied/linked from build-dir) unit.

When we have a build that needs to pull from the cache it would:

  1. take a write lock on the build-dir unit
  2. a shared lock on the artifact-cache unit
  3. uplift from artifact-cache into build-dir
  4. unlock artifact-cache
  5. unlock build-dir
  6. proceed with further build units

since artifact cache will any handle immutable units pipelined builds should not a problem, right?

Comment on lines +53 to +57
/// Coarse grain locking (Profile level)
Coarse,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mtime changes in fingerprints due to uplifting/downlifting order

This will also be a concern for #5931

For non-local packages, we don't check mtimes. Unsure what we do for their build script runs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't quiet thought through how we will handle mtime for the artifact cache.

Checksum freshness (#14136) would sure make this easier as mtimes are painful to deal with.
Might be worth exploring pushing that forward before starting on the artifact cache...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that mtimes are still used for build scripts

Comment on lines +53 to +57
/// Coarse grain locking (Profile level)
Coarse,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests/benches need to be ran before being uplifted OR uplifted and locked during execution which leads to more locking design needed. (also running pre-uplift introduces other potential side effects like the path displayed to the user being deleted as its temporary)

Do we hold any locks during test execution today? I'm not aware of any.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, I was under the assumption that the lock in Layout was held while running the tests, but I never explicitly tested that. Upon further inspection, we do indeed release the lock before executing the tests.

/// The lockfile for a build (`.cargo-lock`). Will be unlocked when this
/// struct is `drop`ped.
_lock: FileLock,
_lock: Option<FileLock>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a change that would be worth moving out into a refactor that gets merged independent of the rest of this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this is referring to the work dealing with is_on_nfs_mount

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pulled this out into #16177

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we want to handle locking on the artifact directory?

Isn't resolving this a pre-requisite since without it, we are still blocking and can't see this in action? I had thought I had already brought up doing this first but can't seem to find it.

I feel like the first step is "only locking whats used". That would ideally be small and something we can insta-stabilize independent of this. We could explore any of the other ideas in the future as we find we need them. The results of this approach is analyzed in #4282 (comment). In the short term, it lets users think in terms of splitting their build-dir for r-a rather than splitting their target-dir.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't resolving this a pre-requisite since without it, we are still blocking and can't see this in action? I had thought I had already brought up doing this first but can't seem to find it.

Yes. Currently we aren't getting any real benefits until this is figured out since coarse grain locking on artifact-dir will block 2 cargo instances from running in parallel.

only locking whats used

Agreed. I'd be happy to completely remove locking on the artifact-dir and indirectly rely on the build-dir locking mitigate the risk of 2 cargo builds from uplifting at the same time. Though others may have different opinions on this.

There are approaches to locking the artifact-dir I mentioned in the PR description but will require some more effort.
I wanted to post this PR to get some initial feedback to make sure the overall direction was good before investing more time with artifact-dir locking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I'd be happy to completely remove locking on the artifact-dir and indirectly rely on the build-dir locking mitigate the risk of 2 cargo builds from uplifting at the same time. Though others may have different opinions on this.

The artifact-dir can be completely independent of the build-dir and so we can't indirectly rely on it. We still need to lock the artifact-dir, just only when needed. I feel like that should be fairly uncontroversial of a change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The artifact-dir can be completely independent of the build-dir and so we can't indirectly rely on it. We still need to lock the artifact-dir, just only when needed. I feel like that should be fairly uncontroversial of a change.

Sure, I think we can keep coarse locking on the artifact-dir and lock it as needed.
I just think the code changes to do this might be more than expected since uplifting happens in many places. (correct me if I am wrong)
Would you prefer to include that in this PR or split out into a follow up?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just think the code changes to do this might be more than expected since uplifting happens in many places. (correct me if I am wrong)

The problem is less where we do uplifting but more do we know if uplifting will happen (at least for the solution I've been highlighting as a minimal step that can go in now). I can see the number of places we uplift possibly complicating that but I'm hoping we have a good rough feel for this. check doesn't need a lock while likely everything else does.

Would you prefer to include that in this PR or split out into a follow up?

I was actually thinking we should do it first:

Isn't resolving this a pre-requisite since without it, we are still blocking and can't see this in action? I had thought I had already brought up doing this first but can't seem to find it.

I feel like the first step is "only locking whats used". That would ideally be small and something we can insta-stabilize independent of this. W

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh okay, I see where you are coming from. I suppose that is what you were hinting at in this comment.

Seems like a reasonable strategy, I could I open a separate PR to focus on that first, then come back to this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh okay, I see where you are coming from. I suppose that is what you were hinting at in #16155 (comment).

I added clarifying note that it is the is_on_nfs_mount work

//! The locking scheme is based on build unit level locking.
//! Each build unit consists of a primary and secondary lock used to represent multiple lock states.
//!
//! | State | Primary | Secondary |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we give semantic names to the locks, rather than using Primary and Secondary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going back and forth on what to call these.
I went with generic names for now because I was afraid that picking semantic names may lead the names becoming out of date overtime as cargo/rustc evolves and semantics change.

But I don't have a super strong opinion on this if we can think of names that better communicate the meaning of the lock.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave some thought to this and I came up with a few potential lock names.

  • primary.lock -> partial.lock and secondary.lock -> full.lock
    • Given that the first partial.lock is enough to take a shared lock and read, but to write or guarantee no one else is writing you have to take the full.lock.
    • This also aligns with the semantics of the SharedLockType enum.
  • primary.lock -> read.lock and secondary.lock -> write.lock
    • You can read if you have any kind of lock on read.lock but you must have a write lock write.lock to write.
    • This gets a bit weird because to take a "shared full" you take a read lock on write.lock 😅

open to more suggestions if you have any

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts

  • partial, interface, or rmeta (being specific means we can lock if we ever break this up into smaller pieces)
  • full, all, or complete

No strong preferences. I feel like read and write is conveyed with the type of lock we acquire for those names and doesn't make sense to use those as names

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think partial and full seem good to me

//! The locking scheme is based on build unit level locking.
//! Each build unit consists of a primary and secondary lock used to represent multiple lock states.
//!
//! | State | Primary | Secondary |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would also be good to consistently use the lockfile name, rather than switching between primary.lock and Primary

Comment on lines +61 to +67
/// A shared lock that might still be compiling a .rlib
Partial,
/// A shared lock that is guaranteed to not be compiling
Full,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem to correspond to the locks, should we tie the names together?

Comment on lines 133 to 143
let primary_lock = open_file(&self.primary)?;
primary_lock.lock()?;

let secondary_lock = open_file(&self.secondary)?;
secondary_lock.lock()?;

self.guard = Some(UnitLockGuard {
primary: primary_lock,
_secondary: Some(secondary_lock),
});
Ok(())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we double checked if we run into problems like #15698?

Comment on lines 167 to 193
pub fn downgrade(&mut self) -> CargoResult<()> {
let guard = self
.guard
.as_ref()
.context("guard was None while calling downgrade")?;

// NOTE:
// > Subsequent flock() calls on an already locked file will convert an existing lock to the new lock mode.
// https://man7.org/linux/man-pages/man2/flock.2.html
//
// However, the `std::file::File::lock/lock_shared` is allowed to change this in the
// future. So its probably up to us if we are okay with using this or if we want to use a
// different interface to flock.
guard.primary.lock_shared()?;

Ok(())
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should rely on advertised behavior, especially as I'm assuming not all platforms are backed by flock, like windows

Copy link
Member Author

@ranger-ross ranger-ross Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably move away from the std interface or perhaps open an issue to see if the t-lib would be willing to clarify the behavior of calling lock_shared() while holding an exclusive lock.

side note: I came across this crate which advertises the behavior we need. It's fairly small (and MIT) so we could potentially reuse part of this code cargo's usecase. (or use it directly, though I don't know cargo's policy on taking dependencies on third party crates)

Any concerns with keeping the std lock for now and having an action item for this prior to stabilization?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move this to an Unresolved issue.

@ranger-ross ranger-ross force-pushed the multi-locking-attempt branch from b234db5 to 1deca86 Compare October 28, 2025 07:17
@rustbot rustbot added A-documenting-cargo-itself Area: Cargo's documentation A-unstable Area: nightly unstable support labels Oct 28, 2025
Comment on lines 1519 to 1525
fn fail_if_incompatible_opts(&self) -> CargoResult<()> {
if self.fine_grain_locking && !self.build_dir_new_layout {
bail!("-Z fine-grain-locking requires -Z build-dir-new-layout");
}

Ok(())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we could have fine-grain-locking set build-dir-new-layout when we set it. Unsure which way to go

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered that. I went with the more conservative option of requiring both flags.

Though practically speaking I think it would make user's lives easier since they probably wouldn't be using this unstable feature without being aware of the build-dir layout change requirement.

I'd be okay with fine-grain-locking implicitly enable build-dir-new-layout

}

let display_fd_warning = || -> CargoResult<()> {
bcx.gctx.shell().verbose(|shell| shell.warn("ulimit was to low to safely enable fine grain locking. falling back to coarse grain locking"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bcx.gctx.shell().verbose(|shell| shell.warn("ulimit was to low to safely enable fine grain locking, falling back to coarse grain locking"))

This commit adds handling to attempt to increase the max file
descriptors if we detect that there is a risk of hitting the limit.

If we cannot increase the max file descriptors, we fall back to coarse
grain locking as that is better than a build crashing due to resource
limits.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-build-execution Area: anything dealing with executing the compiler A-documenting-cargo-itself Area: Cargo's documentation A-filesystem Area: issues with filesystems A-layout Area: target output directory layout, naming, and organization A-unstable Area: nightly unstable support Command-clean S-waiting-on-review Status: Awaiting review from the assignee but also interested parties.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants