-
Notifications
You must be signed in to change notification settings - Fork 712
WIP: concurrent store updates #4400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This will develop into a fix for #3741. |
@dcoutts do you want reviews now or later? |
@ezyang now would be perfect. It's just tests to finish now. Though I'll also do a bit of combining/splitting patches before merging. |
@ezyang oh actually the appveyor test failure points out a missing piece. I need to register with ghc-pkg with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any use of the new locking code, so I guess this isn't done yet.
{-# LANGUAGE MultiWayIf #-} | ||
{-# LANGUAGE DeriveDataTypeable #-} | ||
|
||
module Distribution.Client.Compat.FileLock ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love it if, for all future Compat modules we add, we also put an 'expiration date' on them; e.g., when they can be safely dropped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really good idea!
#else | ||
|
||
-- The remainder of this file is a modified copy | ||
-- of GHC.IO.Handle.Lock from ghc-8.2.x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any substantive modifications, or just dropping some functions that weren't used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly adjusting the imports since we can depend on Prelude and other non GHC.* modules. But also removing the reliance on ./configure scripts to detect HAVE_FLOCK, and hard-coding that it's solaris2 that does not support flock.
@@ -107,14 +111,20 @@ data DistDirLayout = DistDirLayout { | |||
} | |||
|
|||
|
|||
data StoreDirLayout = StoreDirLayout { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about storing all dynamic libraries in a single directory (as is needed on Mac OS X, because of RPATH length limit?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That idea is not really compatible with the notion of a nix style store. A nix style store really means putting all files for a package in their own dirs, and not having shared stuff. I realise we sort-of break that with the shared package db, but one can see it as a thing derived from the store.
Also, it's not clear that for the OSX limits that we must use a scheme with a single shared dir for all libs. It may well be possible to use a scheme with a single RPATH for the store and incorporate the store subdir names into the library install names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB this comment is in reference to #4400 (comment)
Whatever we end up doing, we should have a source code note or ticket mentioning this so we can solve this eventually.
The strategy suggested here is similar to what chak described at https://ghc.haskell.org/trac/ghc/ticket/11587#comment:4 (which you have commented in.) Probably the biggest obstacle is this: according to rwbarton, it's not possible to store subdir names in the NEEDED entries if the SONAME is present but does not encode the subdir. darchon points out some other problems at https://ghc.haskell.org/trac/ghc/ticket/12479#comment:24 and it seems that in the ensuing discussion, it was agreed that it is better to go a different route. Also CC'ing @christiaanb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The strategy suggested here is similar to what chak described at https://ghc.haskell.org/trac/ghc/ticket/11587#comment:4 (which you have commented in.)
Yes, this is indeed the strategy I was thinking of for OSX.
It's true that for ELF we cannot include the dir names. I wonder what nix does. At least on ELF we don't have stupid limitations, the downside of a poor scheme is just that it slows down the loading process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, apparently having really long rpaths can slow start times down a bit, so it would be nice if we could get the best of both worlds. :/
The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This basically looks good. There are some things that I would quite like done because I suspect they will help us debug problems in the future.
@@ -107,14 +111,20 @@ data DistDirLayout = DistDirLayout { | |||
} | |||
|
|||
|
|||
data StoreDirLayout = StoreDirLayout { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB this comment is in reference to #4400 (comment)
Whatever we end up doing, we should have a source code note or ticket mentioning this so we can solve this eventually.
The strategy suggested here is similar to what chak described at https://ghc.haskell.org/trac/ghc/ticket/11587#comment:4 (which you have commented in.) Probably the biggest obstacle is this: according to rwbarton, it's not possible to store subdir names in the NEEDED entries if the SONAME is present but does not encode the subdir. darchon points out some other problems at https://ghc.haskell.org/trac/ghc/ticket/12479#comment:24 and it seems that in the ensuing discussion, it was agreed that it is better to go a different route. Also CC'ing @christiaanb
-> IO a -> IO a | ||
withIncomingUnitIdLock StoreDirLayout{storeIncomingLock} compid unitid action = do | ||
bracket | ||
(do h <- openFile lockFile ReadWriteMode; hLock h ExclusiveLock; return h) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Occasionally on Windows, bad process control can result in zombie cabal.exe processes living way longer than you expect them to (this generally doesn't happen to me on Linux.) If this occurs, and the process had taken out a file lock, then all other cabal processes will block perpetually trying to take the lock. This sounds very confusing to debug.
The easiest way I can think to make it easier to debug in this situation is to add logging, emitting when we are taking out a lock, and on which file. This way if you rerun cabal in verbose mode you can see that it hangs immediately after trying to take the lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. And/or we can warn if taking the lock blocks for too long. We only expect to hold the lock while registering.
then do | ||
removeDirectoryRecursive incomingEntryDir | ||
|
||
debug verbosity $ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this debug verbosity? I think this should be output by default: it's a pretty unusual thing to occur!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a failure or anything to act on. I don't think it justifies a warning. Is it informative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The primary case I'm thinking of if a user runs into a strange concurrency bug while operating at non-debug mode; if they paste the log and we see this message, that gives an important clue when debugging.
-- If the entry does not exist then we won the race and can proceed. | ||
else do | ||
|
||
-- Register the package into the package db (if appropriate). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if there was a reference to the $concurrency comment above here. I wrote a comment asking for an explanation why it's done this way and then realized you had written the comment for it far away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could move the $concurrency comment to be immediately before the impl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh, the current location is fine; I just want a -- See $concurrency
:)
-- If the entry exists then we lost the race and we must abandon, | ||
-- unlock and re-use the existing store entry. | ||
then do | ||
removeDirectoryRecursive incomingEntryDir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't necesary, right, because withTempIncomingDir will remove the directory when we exit the bracket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only if we throw an exception. Or we could use normal bracket rather than bracketOnError and make sure that the cleanup does not fail if the dir no longer exists anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds safer!
finalEntryDir = storePackageDirectory compid unitid | ||
|
||
|
||
withTempIncomingDir :: StoreDirLayout -> CompilerId |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc should mention that the temporary allocated directory can be moved away without causing an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That wasn't the way I wrote it, but we could make it that way, as you suggested above.
Test failures look legit. |
Remove the variants reregister and registerMultiInstance and generalise the main register variant to cover them all. Introduce a RegisterOptions record for the variations. Eliminate an unused form where we supply a file rather than an InstalledPackageInfo value. The motivation is so we can more easily add yet more variations shortly. This is an API change.
Turns out we need this for the approach we want to take for concurrent store updates.
This code is backported from base from ghc-8.2. It will be used as part of the concurrent store updates. It may also be used in future to work around the lack of file locking in ghc-pkg prior to version 8.2.
So we have a spec of the layout of the store, independent of the instance of the store within ~/.cabal/store. This will make new store handling code cleaner by not entangling it with other global ~/.cabal things. We may also in future want to allow global and per-user stores, and we certainly want the ability to specify a store location outside of ~/.cabal (though this could be done with the existing code too).
If the dir does not exist then the result is [] and it monitors the dir for non-existance. Previously it would just fail in this case. The new behaviour is more useful.
30b34d7
to
7aa01d5
Compare
@ezyang hopefully this will fix the build failures. It's also restructured into cleaner patches. It should address the other issues you raised, except for doing anything about dyn lib layouts which I consider orthogonal. |
Failure looks legit. (Side note: we should figure out how to make cabal-testsuite run the tests automatically when you run cabal new-tests.) |
OK, well, merge it when the tests pass :) |
Ok, this now includes unit tests. I cannot currently see how to sanely automate the parallel test, since due to H98 file locking this cannot be done conveniently from within a single process. In the absence of that I've tested it manually in two ghci sessions, with the following test code:
and with output like this:
and
The two interactive sessions are run so that while one process is blocking in the register action the other one also tries to enter the register step, and so has to wait on the file lock. So I think this demonstrates at least in the straightforward concurrent case that it behaves as expected. |
Test failure indicates I need to think again about when the ghc package db is created. It cannot be as late as I moved it (registration) since the read (ghc-pkg dump) fails when the db has not been initialised yet. |
Ok, there's another problem with location of files when installing new entries. Will have to rethink things slightly, especially to keep the path lengths within limits. |
Adjust things so that instead of creating the store package db during planning, we instead create them at the beginning of the building phase along with the other package dbs we create at that point. That is, create it in a write operation rather than in what is morally a read operation. Originally we had to do it in the planning phase because we used to read the compiler package db to see what was installed, whereas now we just use the store dir listing. In fact we were already creating some package dbs during the building phase, just not all of them, so this cleans things up in that respect.
A new module with utilities for managing the store. This includes a new approach for store updates that allows for concurrent updates to the store from different processes. This relies on the new file locking code. We log a message if we wait for a store file lock. This should be a rare occurrence in practice, but would help debugging if some other zombie process was holding the file lock.
Sadly we cannot do proper parallel tests within a single process becuase our fancy inter-process file locking code is thrwarted by the stricter normal intra-process file locking.
This should mean that concurrent updates to the store are now safe. In practice it means working on separate projects at the same time is safe, not concurrent builds within the same project.
1dc750f
to
b75a162
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I merged it. Some other thoughts.
, (pkgdb:_) <- map reverse [ elabBuildPackageDBStack elab, | ||
elabRegisterPackageDBStack elab, | ||
elabSetupPackageDBStack elab ] | ||
, pkgdb <- concat [ elabBuildPackageDBStack elab |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has nothing to do with the current patch, but I wonder, should we really be in the business of creating the global package database (which I imagine is recorded in the database stack?) NB: this list comprension is being used to compute packageDBsToUse, the list of package dbs we may need to create.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
createPackageDBIfMissing
is a no-op for the global and user dbs.
createDirectoryMonitored True (storeDirectory compid) | ||
liftIO $ createPackageDBIfMissing verbosity | ||
compiler progdb | ||
(storePackageDB compid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good.
We need to be able to safely do concurrent builds of different projects that share the store. This code documents and implements a concurrency strategy for reading and writing entries in the store.
This patch includes new compat code for file locks, taken from the base library from ghc-8.2.x.
It is not yet used or indeed tested. This PR is here for feedback on the concurrency strategy (which is clearly documents in the haddock docs in the module) and to check that the locking code does indeed build on windows.