-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Implement fine grain locking for the target directory
#16155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| /// Coarse grain locking (Profile level) | ||
| Coarse, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Falling back to coarse grain locking if some heuristic is not met
- This prevents us from reducing the layers in the new build dir layout. Not a requirement but sure was hoping for that
- What does this mean for Per-user compiled artifact cache #5931 as grabbing a coarse lock for the entire shared cache may be a no-go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This prevents us from reducing the layers in the new build dir layout. Not a requirement but sure was hoping for that
I think the benefits of having coarse grain locking as a fallback are worth the trade of the extra nesting. At least in the short term. If there is some system configuration that cant do fine grain locking for some reason they would still at least be able to build their project. It also makes our lives easier for commands other than build/check. (thinking about cargo clean which does not construct a BuildRunner that has a lot of the information we use for fine grain locking)
What does this mean for #5931 as grabbing a coarse lock for the entire shared cache may be a no-go?
For the artifact cache, the locking mode does not need to match the locking mode of build-dir.
I agree that coarse grain locking would certainly not be idea of the artifact cache.
Given that the artifact cache will be green field, I am hopeful that we can have a fairly simple flat structure with fine grain locking (with a single lock per build unit since pipelined builds will not be a problem for the cache)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the artifact cache, the locking mode does not need to match the locking mode of build-dir.
Part of the purpose of this is to explore what we'll need for that.
Given that the artifact cache will be green field, I am hopeful that we can have a fairly simple flat structure with fine grain locking (with a single lock per build unit since pipelined builds will not be a problem for the cache)
How will it not have a problem with pipelined builds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will it not have a problem with pipelined builds?
How I am imagining it in my head is
- lock build-dir unit
- compile rmeta
- downgrade build-dir unit lock (unblocking other build units)
- compile rlib
- acquire artifact-cache unit lock
- uplift completed unit
- release both artifact cache and build-dir locks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't that apply equally to this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think artifact-cache and build-dir are different in that, build-dir will have the rmeta exists but not rlib state.
artifact-cache will likely have an immutable atomically created (copied/linked from build-dir) unit.
When we have a build that needs to pull from the cache it would:
- take a write lock on the build-dir unit
- a shared lock on the artifact-cache unit
- uplift from artifact-cache into build-dir
- unlock artifact-cache
- unlock build-dir
- proceed with further build units
since artifact cache will any handle immutable units pipelined builds should not a problem, right?
| /// Coarse grain locking (Profile level) | ||
| Coarse, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mtime changes in fingerprints due to uplifting/downlifting order
This will also be a concern for #5931
For non-local packages, we don't check mtimes. Unsure what we do for their build script runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't quiet thought through how we will handle mtime for the artifact cache.
Checksum freshness (#14136) would sure make this easier as mtimes are painful to deal with.
Might be worth exploring pushing that forward before starting on the artifact cache...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that mtimes are still used for build scripts
| /// Coarse grain locking (Profile level) | ||
| Coarse, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests/benches need to be ran before being uplifted OR uplifted and locked during execution which leads to more locking design needed. (also running pre-uplift introduces other potential side effects like the path displayed to the user being deleted as its temporary)
Do we hold any locks during test execution today? I'm not aware of any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, I was under the assumption that the lock in Layout was held while running the tests, but I never explicitly tested that. Upon further inspection, we do indeed release the lock before executing the tests.
| /// The lockfile for a build (`.cargo-lock`). Will be unlocked when this | ||
| /// struct is `drop`ped. | ||
| _lock: FileLock, | ||
| _lock: Option<FileLock>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a change that would be worth moving out into a refactor that gets merged independent of the rest of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this is referring to the work dealing with is_on_nfs_mount
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pulled this out into #16177
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we want to handle locking on the artifact directory?
Isn't resolving this a pre-requisite since without it, we are still blocking and can't see this in action? I had thought I had already brought up doing this first but can't seem to find it.
I feel like the first step is "only locking whats used". That would ideally be small and something we can insta-stabilize independent of this. We could explore any of the other ideas in the future as we find we need them. The results of this approach is analyzed in #4282 (comment). In the short term, it lets users think in terms of splitting their build-dir for r-a rather than splitting their target-dir.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't resolving this a pre-requisite since without it, we are still blocking and can't see this in action? I had thought I had already brought up doing this first but can't seem to find it.
Yes. Currently we aren't getting any real benefits until this is figured out since coarse grain locking on artifact-dir will block 2 cargo instances from running in parallel.
only locking whats used
Agreed. I'd be happy to completely remove locking on the artifact-dir and indirectly rely on the build-dir locking mitigate the risk of 2 cargo builds from uplifting at the same time. Though others may have different opinions on this.
There are approaches to locking the artifact-dir I mentioned in the PR description but will require some more effort.
I wanted to post this PR to get some initial feedback to make sure the overall direction was good before investing more time with artifact-dir locking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I'd be happy to completely remove locking on the artifact-dir and indirectly rely on the build-dir locking mitigate the risk of 2 cargo builds from uplifting at the same time. Though others may have different opinions on this.
The artifact-dir can be completely independent of the build-dir and so we can't indirectly rely on it. We still need to lock the artifact-dir, just only when needed. I feel like that should be fairly uncontroversial of a change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The artifact-dir can be completely independent of the build-dir and so we can't indirectly rely on it. We still need to lock the artifact-dir, just only when needed. I feel like that should be fairly uncontroversial of a change.
Sure, I think we can keep coarse locking on the artifact-dir and lock it as needed.
I just think the code changes to do this might be more than expected since uplifting happens in many places. (correct me if I am wrong)
Would you prefer to include that in this PR or split out into a follow up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just think the code changes to do this might be more than expected since uplifting happens in many places. (correct me if I am wrong)
The problem is less where we do uplifting but more do we know if uplifting will happen (at least for the solution I've been highlighting as a minimal step that can go in now). I can see the number of places we uplift possibly complicating that but I'm hoping we have a good rough feel for this. check doesn't need a lock while likely everything else does.
Would you prefer to include that in this PR or split out into a follow up?
I was actually thinking we should do it first:
Isn't resolving this a pre-requisite since without it, we are still blocking and can't see this in action? I had thought I had already brought up doing this first but can't seem to find it.
I feel like the first step is "only locking whats used". That would ideally be small and something we can insta-stabilize independent of this. W
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh okay, I see where you are coming from. I suppose that is what you were hinting at in this comment.
Seems like a reasonable strategy, I could I open a separate PR to focus on that first, then come back to this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh okay, I see where you are coming from. I suppose that is what you were hinting at in #16155 (comment).
I added clarifying note that it is the is_on_nfs_mount work
src/cargo/core/compiler/locking.rs
Outdated
| //! The locking scheme is based on build unit level locking. | ||
| //! Each build unit consists of a primary and secondary lock used to represent multiple lock states. | ||
| //! | ||
| //! | State | Primary | Secondary | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we give semantic names to the locks, rather than using Primary and Secondary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going back and forth on what to call these.
I went with generic names for now because I was afraid that picking semantic names may lead the names becoming out of date overtime as cargo/rustc evolves and semantics change.
But I don't have a super strong opinion on this if we can think of names that better communicate the meaning of the lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave some thought to this and I came up with a few potential lock names.
primary.lock -> partial.lockandsecondary.lock -> full.lock- Given that the first
partial.lockis enough to take a shared lock and read, but to write or guarantee no one else is writing you have to take thefull.lock. - This also aligns with the semantics of the
SharedLockTypeenum.
- Given that the first
primary.lock -> read.lockandsecondary.lock -> write.lock- You can read if you have any kind of lock on
read.lockbut you must have a write lockwrite.lockto write. - This gets a bit weird because to take a "shared full" you take a read lock on
write.lock😅
- You can read if you have any kind of lock on
open to more suggestions if you have any
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some thoughts
- partial, interface, or rmeta (being specific means we can lock if we ever break this up into smaller pieces)
- full, all, or complete
No strong preferences. I feel like read and write is conveyed with the type of lock we acquire for those names and doesn't make sense to use those as names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think partial and full seem good to me
src/cargo/core/compiler/locking.rs
Outdated
| //! The locking scheme is based on build unit level locking. | ||
| //! Each build unit consists of a primary and secondary lock used to represent multiple lock states. | ||
| //! | ||
| //! | State | Primary | Secondary | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would also be good to consistently use the lockfile name, rather than switching between primary.lock and Primary
| /// A shared lock that might still be compiling a .rlib | ||
| Partial, | ||
| /// A shared lock that is guaranteed to not be compiling | ||
| Full, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These seem to correspond to the locks, should we tie the names together?
src/cargo/core/compiler/locking.rs
Outdated
| let primary_lock = open_file(&self.primary)?; | ||
| primary_lock.lock()?; | ||
|
|
||
| let secondary_lock = open_file(&self.secondary)?; | ||
| secondary_lock.lock()?; | ||
|
|
||
| self.guard = Some(UnitLockGuard { | ||
| primary: primary_lock, | ||
| _secondary: Some(secondary_lock), | ||
| }); | ||
| Ok(()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we double checked if we run into problems like #15698?
| pub fn downgrade(&mut self) -> CargoResult<()> { | ||
| let guard = self | ||
| .guard | ||
| .as_ref() | ||
| .context("guard was None while calling downgrade")?; | ||
|
|
||
| // NOTE: | ||
| // > Subsequent flock() calls on an already locked file will convert an existing lock to the new lock mode. | ||
| // https://man7.org/linux/man-pages/man2/flock.2.html | ||
| // | ||
| // However, the `std::file::File::lock/lock_shared` is allowed to change this in the | ||
| // future. So its probably up to us if we are okay with using this or if we want to use a | ||
| // different interface to flock. | ||
| guard.primary.lock_shared()?; | ||
|
|
||
| Ok(()) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should rely on advertised behavior, especially as I'm assuming not all platforms are backed by flock, like windows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could probably move away from the std interface or perhaps open an issue to see if the t-lib would be willing to clarify the behavior of calling lock_shared() while holding an exclusive lock.
side note: I came across this crate which advertises the behavior we need. It's fairly small (and MIT) so we could potentially reuse part of this code cargo's usecase. (or use it directly, though I don't know cargo's policy on taking dependencies on third party crates)
Any concerns with keeping the std lock for now and having an action item for this prior to stabilization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can move this to an Unresolved issue.
b234db5 to
1deca86
Compare
src/cargo/core/features.rs
Outdated
| fn fail_if_incompatible_opts(&self) -> CargoResult<()> { | ||
| if self.fine_grain_locking && !self.build_dir_new_layout { | ||
| bail!("-Z fine-grain-locking requires -Z build-dir-new-layout"); | ||
| } | ||
|
|
||
| Ok(()) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we could have fine-grain-locking set build-dir-new-layout when we set it. Unsure which way to go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered that. I went with the more conservative option of requiring both flags.
Though practically speaking I think it would make user's lives easier since they probably wouldn't be using this unstable feature without being aware of the build-dir layout change requirement.
I'd be okay with fine-grain-locking implicitly enable build-dir-new-layout
| } | ||
|
|
||
| let display_fd_warning = || -> CargoResult<()> { | ||
| bcx.gctx.shell().verbose(|shell| shell.warn("ulimit was to low to safely enable fine grain locking. falling back to coarse grain locking")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bcx.gctx.shell().verbose(|shell| shell.warn("ulimit was to low to safely enable fine grain locking, falling back to coarse grain locking"))
This commit adds handling to attempt to increase the max file descriptors if we detect that there is a risk of hitting the limit. If we cannot increase the max file descriptors, we fall back to coarse grain locking as that is better than a build crashing due to resource limits.
1deca86 to
eed6ccd
Compare
This PR adds fine grain locking for the build cache using build unit level locking.
I'd recommend reading the design details in this description and then reviewing commit by commit.
Part of #4282
Previous attempt: #16089
Design decisions / rational
Implementation details
primary.lockandsecondary.lock(seelocking.rsmodule docs for more details on the states)determine_locking_mode()).cargo-locklock as RO shared to continue working with older cargo versions while allowing multiple newer cargo instances to run in parallel.cargo cleancontinues to use coarse grain locking for simplicity.UnitGraph.number of build units * 10) we automatically fallback to coarse grain locking and display a warning to the user.RcFileLockI was seeing peaks of ~12,000 open fds which I felt was quiet high even for a large project like Zed.FileLockInternerthat holds on to the file descriptors (RcFileLock) until the end of the process. (We could potentially add it toJobStateif preferred, it would just be a bit more plumbing)Open Questions
cargo doccargo checkwould not take a lock artifact-dir butcargo buildwould). This would mean that 2cargo buildinvocations would not run in parallel because one of them would hold the lock artifact-dir (blocking the other). This might actually be ideal to avoid 2 instances fighting over the CPU while recompiling the same crates.-Zbuild-dir-new-layout. With the current implementation I am not seeing any perf regression on linux but I have yet to test on windows/macos.CARGO_BUILD_LOCKING_MODE=coarse?targetdirectory #16155 (comment)