-
Notifications
You must be signed in to change notification settings - Fork 98
Prevent registered names from leaking #322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a very suspicious chunk in the code, @hyperthunk can you please check if that code correct?
In a case if it is not, can we extend our test cases so they will trigger the error, before fixing that.
Hold tight... Fix to the broken test(s) coming... |
The latter seems to be more likely to trigger race conditions, unsurprisingly.
Right... I'm happy that this functionality is implemented properly. I keep seeing intermittent test failures on travis, and having to fiddle around to make them less racy, but not with the tests in this patch! Although to be fair, the races span across the tests. Probably the thing I need to do is to convert the test suite to create a new local node for each test case, at least in the Mx case, so that we don't get agents interfering with one another with regards registrations and whatnot. A major issue in fixing the tests for this PR has been the fact that we have no idea when we will see events related to agents stopping and their names getting unregistered, etc etc. I might fix that now (in this PR), or file a separate bug and do a separate PR. Let me sleep on it and decide tomorrow. |
Hoo-bloomin-rah! Looks like we've finally got some stability in the tests... @qnikst, whenever you've got time, can you take a look at the updated patch set please? Also, we should probably squash these commits down, as the PR is a right mess of interventions now... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nitpicks, but great work!
distributed-process-tests/src/Control/Distributed/Process/Tests/CH.hs
Outdated
Show resolved
Hide resolved
distributed-process-tests/src/Control/Distributed/Process/Tests/CH.hs
Outdated
Show resolved
Hide resolved
distributed-process-tests/src/Control/Distributed/Process/Tests/CH.hs
Outdated
Show resolved
Hide resolved
* master: and do the same for 7.10.3 Make CI work again (temporary workaround for n-t-tcp API changes) Refactor tests to handle lack of MonadFail on our stack when supporting the latest GHC (tested with 8.6.1) adjust to deal with the latest network-transport-tcp API changes and refactor benchmarks and test runner accordingly Give more informative messages about invariant violations fiddle around with the lack of a MonadFail in the stack... make it possible to verify stack build --stack-yaml stack-nightly.yaml --resolver nightly --haddock --test --bench --no-run-benchmarks -v
…ocess This does seem to indicate that either the tests are not really validating anything useful, or the issue was a non-issue to begin with. This patch nevertheless contains useful improvements to our tracing of registration events, so I think it's still valid.
diff --git a/distributed-process-tests/src/Control/Distributed/Process/Tests/CH.hs b/distributed-process-tests/src/Control/Distributed/Process/Tests/CH.hs
index 91e1323..7430562 100644
--- a/distributed-process-tests/src/Control/Distributed/Process/Tests/CH.hs
+++ b/distributed-process-tests/src/Control/Distributed/Process/Tests/CH.hs
@@ -1476,6 +1476,37 @@ testExitRemote TestTransport{..} = do
takeMVar supervisedDone
takeMVar supervisorDone
+testRegisterRemote :: TestTransport -> Assertion
+testRegisterRemote TestTransport{..} = do
+ node1 <- newLocalNode testTransport initRemoteTable
+ node2 <- newLocalNode testTransport initRemoteTable
+ done <- newEmptyMVar
+
+ pid <- forkProcess node1 $ getSelfPid >>= runUntilRegistered
+
+ runProcess node2 $ do
+ register regName pid
+ liftIO $ putMVar done ()
+
+ takeMVar done
+ threadDelay 400000 -- TODO: this is racy, but we want to avoid using mx for now...
+
+ regHere <- newEmptyMVar
+
+ runProcess node2 $ whereis regName >>= liftIO . putMVar regHere
+ res <- takeMVar regHere
+ assertBool "expected Nothing, but process id was still registered" (res == Nothing)
+
+ where
+ runUntilRegistered us = do
+ mPid <- whereis regName
+ case mPid of
+ Nothing -> runUntilRegistered us
+ Just p
+ | p == us -> return ()
+
+ regName = "testRegisterRemote"
+
testUnsafeSend :: TestTransport -> Assertion
testUnsafeSend TestTransport{..} = do
serverAddr <- newEmptyMVar
@@ -1677,6 +1708,7 @@ tests testtrans = return [
, testCase "MaskRestoreScope" (testMaskRestoreScope testtrans)
, testCase "ExitLocal" (testExitLocal testtrans)
, testCase "ExitRemote" (testExitRemote testtrans)
+ , testCase "RegisterDiedRemote" (testRegisterRemote testtrans)
, testCase "TextCallLocal" (testCallLocal testtrans)
-- Unsafe Primitives
, testCase "TestUnsafeSend" (testUnsafeSend testtrans) Right... This patch (against master) seems to fail reliably. Let's incorporate that into this PR and see if I can demonstrate that the issue is resolved. I'm also going to see what happens when we drop the more complex test case into master. |
Indeed, the original test cases still pass when I don't start the mx agent in the node controller, whereas the simpler test case fails reliably. I'm going to remove the more complex test case, though we can roll it back in if you have suggestions about how to make it work. |
Oddly the monitoring doesn't fix the failing test case. Tracing shows that the map is updated correctly, so I'm going to trace the |
Oh good heavens.. I feel like I've really wasted my time on this one... Take a look at our register implementation... ncEffectRegister :: ProcessId -> String -> NodeId -> Maybe ProcessId -> Bool -> NC ()
ncEffectRegister from label atnode mPid reregistration = do
node <- ask
currentVal <- gets (^. registeredHereFor label)
isOk <-
case mPid of
Nothing -> -- unregister request
return $ isJust currentVal
Just thepid -> -- register request
do isvalidlocal <- isValidLocalIdentifier (ProcessIdentifier thepid)
return $ (isNothing currentVal /= reregistration) &&
(not (isLocal node (ProcessIdentifier thepid) ) || isvalidlocal ) -- WHAT!!??
if isLocal node (NodeIdentifier atnode)
then do when isOk $
do modify' $ registeredHereFor label ^= mPid
updateRemote node currentVal mPid
case mPid of
(Just p) -> do
if reregistration
then liftIO $ trace node (MxUnRegistered (fromJust currentVal) label)
else return ()
liftIO $ trace node (MxRegistered p label)
Nothing -> liftIO $ trace node (MxUnRegistered (fromJust currentVal) label)
newVal <- gets (^. registeredHereFor label)
ncSendToProcess from $ unsafeCreateUnencodedMessage $
RegisterReply label isOk newVal
else let operation =
case reregistration of
True -> flip decList
False -> flip incList
in case mPid of
Nothing -> return ()
Just pid -> modify' $ registeredOnNodesFor pid ^: (maybeify $ operation atnode)
where updateRemote node (Just oldval) (Just newval) | processNodeId oldval /= processNodeId newval =
do forward node (processNodeId oldval) (Register label atnode (Just oldval) True)
forward node (processNodeId newval) (Register label atnode (Just newval) False)
updateRemote node Nothing (Just newval) =
forward node (processNodeId newval) (Register label atnode (Just newval) False)
updateRemote node (Just oldval) Nothing =
forward node (processNodeId oldval) (Register label atnode (Just oldval) True)
updateRemote _ _ _ = return ()
maybeify f Nothing = unmaybeify $ f []
maybeify f (Just x) = unmaybeify $ f x
unmaybeify [] = Nothing
unmaybeify x = Just x
incList [] tag = [(tag,1)]
incList ((atag,acount):xs) tag | tag==atag = (atag,acount+1) : xs
incList (x:xs) tag = x : incList xs tag
decList [] _ = []
decList ((atag,1):xs) tag | atag == tag = xs
decList ((atag,n):xs) tag | atag == tag = (atag,n-1):xs
decList (x:xs) tag = x:decList xs tag
forward node to reg =
when (not $ isLocal node (NodeIdentifier to)) $
ncSendToNode to $ NCMsg { ctrlMsgSender = ProcessIdentifier from
, ctrlMsgSignal = reg
} You'll notice that in the case where we're storing a name for a remote process, we notify the remote node of the fact, and track our own processes registered elsewhere also. Note that call to So... I've really no idea why we need a monitoring agent. The node controllers coordinate this between them, which is why the test cases, once written properly, pass without the monitoring agent running. My inclination is to merge this patch with only the test cases, and the tracing fixes to the NC. I cannot see how the monitoring agent is needed. I'm also not sure whether both test cases really add much value, or if we should just merge one. I'm going to comment out |
Oh yes, they do indeed. |
@hyperthunk what you wrote makes sense. If I recall correctly there were some problems at the time when initial PR was written, but maybe there were some other issues that prevented the logic from working correctly fixed after that. |
Possibly, yes. I can leave the monitoring in, as it certainly won't do any harm. But it is not essential to this patch set. You can comment out the I now have what I believe is a viable patch anyway. I will push in a moment and we can review the final product. |
…ing from ncEffectRegister The NC does in fact notify remote peers when one of their processes is registered against a name locally. This should ensure that the remote's ncEffectDied triggers a message to the registry hosting node. Since there are some potential situations where that might fail, we still maintain monitors for remote registered processes, but they are not an essential function afaict. The original test case has been removed, since if we deliberately break the NC code to trigger failure, the test code will hang indefinitely (and chew cpu resources heavily), which isn't helpful or informative on CI. The replacement test case does make use of delays and timeouts, however this has been done in a structured fashion, breaking gracefully if we trigger failure, and passing relatively quickly on the happy path. It also seems to behave consistently.
We're getting intermittent failures in CI (and I've seen some locally) for the tracing suite, but only when run with the TCP transport. I will open a bug to investigate, but since we're not focusing on that in this patch set, I am rolling this change (from master) back and we will test tracing only with the in-memory transport for now.
Okay, so assuming CI is happy, I think we're good to go. We have fixed the tracing of There are also some generic fixes to the Finally, I have rolled back the move to run the @qnikst - since I've made quite a few changes, would you mind doing the merge once you're happy this all makes sense? I'm probably tree blind by now, having worked on it. |
How does the code before this patch ensure that when there is a connection failure, the label pointing to the remote process is unregistered? Is |
Hm, spent some time looking at the code and it looks ok.
The NC produces a If the monitor agent is not necessary, I'd leave it out to avoid confusion. Every person reading the code would ask what it is needed for and there is no motivation to maintain it. |
Agree with all that @facundominguez, I'll remove the code. We seem to have picked up an odd CI failure anyway so I'll resolve both and ping back |
Right... @facundominguez @qnikst - if you're both happy, I think we can finally get this one merged! |
This patch subsumes #184, and incorporates #321 (which I have submitted separately, since it's less contentious).
I have taken the code @qnikst wrote for #184, but refactored it to use the label tracking approach suggested by @mboes, which is not only simpler, but the Map handling code in the tweag:DP-100 branch is incorrect. I have also kept the agent code inside of the Node.hs module, as suggested elsewhere in the review for that PR.
This patch also incorporates the tests written by @qnikst (before the repos were reorganised), and I've included some additional tests in the Mx.hs test suite, which verifies that the ordering constraints are observed.
Although I have not managed to close out #241 yet, there is no reason to believe that the management event bus does not hold to the following assertion (made in the comments for that issue/ticket):