Skip to content

WIP: concurrent store updates #4400

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Apr 2, 2017
Merged

Conversation

dcoutts
Copy link
Contributor

@dcoutts dcoutts commented Mar 15, 2017

We need to be able to safely do concurrent builds of different projects that share the store. This code documents and implements a concurrency strategy for reading and writing entries in the store.

This patch includes new compat code for file locks, taken from the base library from ghc-8.2.x.

It is not yet used or indeed tested. This PR is here for feedback on the concurrency strategy (which is clearly documents in the haddock docs in the module) and to check that the locking code does indeed build on windows.

@mention-bot
Copy link

@dcoutts, thanks for your PR! By analyzing the history of the files in this pull request, we identified @ezyang, @23Skidoo and @grayjay to be potential reviewers.

@dcoutts
Copy link
Contributor Author

dcoutts commented Mar 15, 2017

This will develop into a fix for #3741.

@ezyang
Copy link
Contributor

ezyang commented Mar 16, 2017

@dcoutts do you want reviews now or later?

@dcoutts
Copy link
Contributor Author

dcoutts commented Mar 16, 2017

do you want reviews now or later?

@ezyang now would be perfect. It's just tests to finish now. Though I'll also do a bit of combining/splitting patches before merging.

@dcoutts
Copy link
Contributor Author

dcoutts commented Mar 16, 2017

@ezyang oh actually the appveyor test failure points out a missing piece. I need to register with ghc-pkg with the --force-files (or equivalent) flag because we have to update the package db before moving the files into their final place.

Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any use of the new locking code, so I guess this isn't done yet.

{-# LANGUAGE MultiWayIf #-}
{-# LANGUAGE DeriveDataTypeable #-}

module Distribution.Client.Compat.FileLock (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love it if, for all future Compat modules we add, we also put an 'expiration date' on them; e.g., when they can be safely dropped.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good idea!

#else

-- The remainder of this file is a modified copy
-- of GHC.IO.Handle.Lock from ghc-8.2.x
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any substantive modifications, or just dropping some functions that weren't used?

Copy link
Contributor Author

@dcoutts dcoutts Mar 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly adjusting the imports since we can depend on Prelude and other non GHC.* modules. But also removing the reliance on ./configure scripts to detect HAVE_FLOCK, and hard-coding that it's solaris2 that does not support flock.

@@ -107,14 +111,20 @@ data DistDirLayout = DistDirLayout {
}


data StoreDirLayout = StoreDirLayout {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about storing all dynamic libraries in a single directory (as is needed on Mac OS X, because of RPATH length limit?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That idea is not really compatible with the notion of a nix style store. A nix style store really means putting all files for a package in their own dirs, and not having shared stuff. I realise we sort-of break that with the shared package db, but one can see it as a thing derived from the store.

Also, it's not clear that for the OSX limits that we must use a scheme with a single shared dir for all libs. It may well be possible to use a scheme with a single RPATH for the store and incorporate the store subdir names into the library install names.

Copy link
Contributor

@ezyang ezyang Mar 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB this comment is in reference to #4400 (comment)

Whatever we end up doing, we should have a source code note or ticket mentioning this so we can solve this eventually.

The strategy suggested here is similar to what chak described at https://ghc.haskell.org/trac/ghc/ticket/11587#comment:4 (which you have commented in.) Probably the biggest obstacle is this: according to rwbarton, it's not possible to store subdir names in the NEEDED entries if the SONAME is present but does not encode the subdir. darchon points out some other problems at https://ghc.haskell.org/trac/ghc/ticket/12479#comment:24 and it seems that in the ensuing discussion, it was agreed that it is better to go a different route. Also CC'ing @christiaanb

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The strategy suggested here is similar to what chak described at https://ghc.haskell.org/trac/ghc/ticket/11587#comment:4 (which you have commented in.)

Yes, this is indeed the strategy I was thinking of for OSX.

It's true that for ELF we cannot include the dir names. I wonder what nix does. At least on ELF we don't have stupid limitations, the downside of a poor scheme is just that it slows down the loading process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, apparently having really long rpaths can slow start times down a bit, so it would be nice if we could get the best of both worlds. :/

@dcoutts
Copy link
Contributor Author

dcoutts commented Mar 16, 2017

I don't see any use of the new locking code, so I guess this isn't done yet.

The hLock function is used in withIncomingUnitIdLock.

Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This basically looks good. There are some things that I would quite like done because I suspect they will help us debug problems in the future.

@@ -107,14 +111,20 @@ data DistDirLayout = DistDirLayout {
}


data StoreDirLayout = StoreDirLayout {
Copy link
Contributor

@ezyang ezyang Mar 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB this comment is in reference to #4400 (comment)

Whatever we end up doing, we should have a source code note or ticket mentioning this so we can solve this eventually.

The strategy suggested here is similar to what chak described at https://ghc.haskell.org/trac/ghc/ticket/11587#comment:4 (which you have commented in.) Probably the biggest obstacle is this: according to rwbarton, it's not possible to store subdir names in the NEEDED entries if the SONAME is present but does not encode the subdir. darchon points out some other problems at https://ghc.haskell.org/trac/ghc/ticket/12479#comment:24 and it seems that in the ensuing discussion, it was agreed that it is better to go a different route. Also CC'ing @christiaanb

-> IO a -> IO a
withIncomingUnitIdLock StoreDirLayout{storeIncomingLock} compid unitid action = do
bracket
(do h <- openFile lockFile ReadWriteMode; hLock h ExclusiveLock; return h)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Occasionally on Windows, bad process control can result in zombie cabal.exe processes living way longer than you expect them to (this generally doesn't happen to me on Linux.) If this occurs, and the process had taken out a file lock, then all other cabal processes will block perpetually trying to take the lock. This sounds very confusing to debug.

The easiest way I can think to make it easier to debug in this situation is to add logging, emitting when we are taking out a lock, and on which file. This way if you rerun cabal in verbose mode you can see that it hangs immediately after trying to take the lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. And/or we can warn if taking the lock blocks for too long. We only expect to hold the lock while registering.

then do
removeDirectoryRecursive incomingEntryDir

debug verbosity $
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this debug verbosity? I think this should be output by default: it's a pretty unusual thing to occur!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a failure or anything to act on. I don't think it justifies a warning. Is it informative?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary case I'm thinking of if a user runs into a strange concurrency bug while operating at non-debug mode; if they paste the log and we see this message, that gives an important clue when debugging.

-- If the entry does not exist then we won the race and can proceed.
else do

-- Register the package into the package db (if appropriate).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if there was a reference to the $concurrency comment above here. I wrote a comment asking for an explanation why it's done this way and then realized you had written the comment for it far away.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could move the $concurrency comment to be immediately before the impl.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, the current location is fine; I just want a -- See $concurrency :)

-- If the entry exists then we lost the race and we must abandon,
-- unlock and re-use the existing store entry.
then do
removeDirectoryRecursive incomingEntryDir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't necesary, right, because withTempIncomingDir will remove the directory when we exit the bracket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only if we throw an exception. Or we could use normal bracket rather than bracketOnError and make sure that the cleanup does not fail if the dir no longer exists anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds safer!

finalEntryDir = storePackageDirectory compid unitid


withTempIncomingDir :: StoreDirLayout -> CompilerId
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc should mention that the temporary allocated directory can be moved away without causing an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That wasn't the way I wrote it, but we could make it that way, as you suggested above.

@ezyang
Copy link
Contributor

ezyang commented Mar 17, 2017

Test failures look legit.

dcoutts added 6 commits March 22, 2017 02:25
Remove the variants reregister and registerMultiInstance and generalise
the main register variant to cover them all. Introduce a RegisterOptions
record for the variations.

Eliminate an unused form where we supply a file rather than an
InstalledPackageInfo value.

The motivation is so we can more easily add yet more variations shortly.

This is an API change.
Turns out we need this for the approach we want to take for
concurrent store updates.
This code is backported from base from ghc-8.2.

It will be used as part of the concurrent store updates. It may also be
used in future to work around the lack of file locking in ghc-pkg prior
to version 8.2.
So we have a spec of the layout of the store, independent of the
instance of the store within ~/.cabal/store. This will make new store
handling code cleaner by not entangling it with other global ~/.cabal
things.

We may also in future want to allow global and per-user stores, and we
certainly want the ability to specify a store location outside of
~/.cabal (though this could be done with the existing code too).
If the dir does not exist then the result is [] and it monitors the dir
for non-existance. Previously it would just fail in this case. The new
behaviour is more useful.
@dcoutts
Copy link
Contributor Author

dcoutts commented Mar 26, 2017

@ezyang hopefully this will fix the build failures. It's also restructured into cleaner patches. It should address the other issues you raised, except for doing anything about dyn lib layouts which I consider orthogonal.

@ezyang
Copy link
Contributor

ezyang commented Mar 27, 2017

Failure looks legit.

(Side note: we should figure out how to make cabal-testsuite run the tests automatically when you run cabal new-tests.)

@ezyang
Copy link
Contributor

ezyang commented Mar 27, 2017

OK, well, merge it when the tests pass :)

@dcoutts
Copy link
Contributor Author

dcoutts commented Mar 27, 2017

Ok, this now includes unit tests. I cannot currently see how to sanely automate the parallel test, since due to H98 file locking this cannot be done conveniently from within a single process.

In the absence of that I've tested it manually in two ghci sessions, with the following test code:

testInstallParallel n = do
    let storeDirLayout = defaultStoreDirLayout ("." </> "store")

    let copyFiles dir = do
          writeFile (dir </> "file") (show n)
          print ("copyFiles", n, dir)
          () <- readLn
          return ()

        register = do
          print ("register", n)
          () <- readLn
          return ()
    o <- newStoreEntry verbosity storeDirLayout
                       compid unitid
                       copyFiles register
    print o
  where
    compid = CompilerId GHC (mkVersion [1,0])
    unitid = mkUnitId "foo-1.0-xyz"

and with output like this:

> testInstallParallel 1
("copyFiles",1,"./store/ghc-1.0/incoming/new-5336")
()
("register",1)
()
UseNewStoreEntry

and

> testInstallParallel 2
("copyFiles",2,"./store/ghc-1.0/incoming/new-5505")
()
Waiting for file lock on store entry ghc-1.0/foo-1.0-xyz
Concurrent build race: abandoning build in favour of existing store entry
ghc-1.0/foo-1.0-xyz
UseExistingStoreEntry

The two interactive sessions are run so that while one process is blocking in the register action the other one also tries to enter the register step, and so has to wait on the file lock. So I think this demonstrates at least in the straightforward concurrent case that it behaves as expected.

@dcoutts
Copy link
Contributor Author

dcoutts commented Mar 27, 2017

Test failure indicates I need to think again about when the ghc package db is created. It cannot be as late as I moved it (registration) since the read (ghc-pkg dump) fails when the db has not been initialised yet.

@dcoutts
Copy link
Contributor Author

dcoutts commented Mar 28, 2017

Ok, there's another problem with location of files when installing new entries. Will have to rethink things slightly, especially to keep the path lengths within limits.

dcoutts added 4 commits April 1, 2017 19:23
Adjust things so that instead of creating the store package db during
planning, we instead create them at the beginning of the building phase
along with the other package dbs we create at that point. That is,
create it in a write operation rather than in what is morally a read
operation.

Originally we had to do it in the planning phase because we used to read
the compiler package db to see what was installed, whereas now we just
use the store dir listing.

In fact we were already creating some package dbs during the building
phase, just not all of them, so this cleans things up in that respect.
A new module with utilities for managing the store. This includes a new
approach for store updates that allows for concurrent updates to the
store from different processes.

This relies on the new file locking code. We log a message if we wait
for a store file lock. This should be a rare occurrence in practice,
but would help debugging if some other zombie process was holding the
file lock.
Sadly we cannot do proper parallel tests within a single process
becuase our fancy inter-process file locking code is thrwarted by the
stricter normal intra-process file locking.
This should mean that concurrent updates to the store are now safe.
In practice it means working on separate projects at the same time is
safe, not concurrent builds within the same project.
@dcoutts dcoutts force-pushed the store-concurrent branch from 1dc750f to b75a162 Compare April 2, 2017 13:08
@dcoutts
Copy link
Contributor Author

dcoutts commented Apr 2, 2017

@ezyang @23Skidoo it's green finally. Ok to merge? Or any last review comments?

@ezyang ezyang merged commit 3b8ee83 into haskell:master Apr 2, 2017
Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged it. Some other thoughts.

, (pkgdb:_) <- map reverse [ elabBuildPackageDBStack elab,
elabRegisterPackageDBStack elab,
elabSetupPackageDBStack elab ]
, pkgdb <- concat [ elabBuildPackageDBStack elab
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has nothing to do with the current patch, but I wonder, should we really be in the business of creating the global package database (which I imagine is recorded in the database stack?) NB: this list comprension is being used to compute packageDBsToUse, the list of package dbs we may need to create.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createPackageDBIfMissing is a no-op for the global and user dbs.

createDirectoryMonitored True (storeDirectory compid)
liftIO $ createPackageDBIfMissing verbosity
compiler progdb
(storePackageDB compid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good.

@hvr hvr mentioned this pull request Aug 2, 2017
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants