Skip to content

Prevent registered names from leaking #322

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Nov 15, 2018
Merged

Prevent registered names from leaking #322

merged 20 commits into from
Nov 15, 2018

Conversation

hyperthunk
Copy link
Member

@hyperthunk hyperthunk commented Nov 12, 2018

This patch subsumes #184, and incorporates #321 (which I have submitted separately, since it's less contentious).

I have taken the code @qnikst wrote for #184, but refactored it to use the label tracking approach suggested by @mboes, which is not only simpler, but the Map handling code in the tweag:DP-100 branch is incorrect. I have also kept the agent code inside of the Node.hs module, as suggested elsewhere in the review for that PR.

This patch also incorporates the tests written by @qnikst (before the repos were reorganised), and I've included some additional tests in the Mx.hs test suite, which verifies that the ordering constraints are observed.

Although I have not managed to close out #241 yet, there is no reason to believe that the management event bus does not hold to the following assertion (made in the comments for that issue/ticket):

messages dispatched on an STM broadcast channel (i.e., management event bus) are guaranteed to be delivered with the same FIFO ordering guarantees that exist between two communicating processes, such that communication from the node controller threads (i.e., MxEvent's) will never be re-ordered, but messages dispatched to the event bus by other processes (including, but not limited to agents) are only guaranteed to be ordered between one sender and one receiver.

Copy link
Contributor

@qnikst qnikst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a very suspicious chunk in the code, @hyperthunk can you please check if that code correct?
In a case if it is not, can we extend our test cases so they will trigger the error, before fixing that.

@hyperthunk
Copy link
Member Author

Hold tight... Fix to the broken test(s) coming...

@hyperthunk
Copy link
Member Author

Right... I'm happy that this functionality is implemented properly. I keep seeing intermittent test failures on travis, and having to fiddle around to make them less racy, but not with the tests in this patch!

Although to be fair, the races span across the tests.

Probably the thing I need to do is to convert the test suite to create a new local node for each test case, at least in the Mx case, so that we don't get agents interfering with one another with regards registrations and whatnot. A major issue in fixing the tests for this PR has been the fact that we have no idea when we will see events related to agents stopping and their names getting unregistered, etc etc.

I might fix that now (in this PR), or file a separate bug and do a separate PR. Let me sleep on it and decide tomorrow.

@hyperthunk
Copy link
Member Author

Hoo-bloomin-rah! Looks like we've finally got some stability in the tests...

@qnikst, whenever you've got time, can you take a look at the updated patch set please? Also, we should probably squash these commits down, as the PR is a right mess of interventions now...

Copy link
Contributor

@qnikst qnikst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nitpicks, but great work!

* master:
  and do the same for 7.10.3
  Make CI work again (temporary workaround for n-t-tcp API changes)
  Refactor tests to handle lack of MonadFail on our stack when supporting the latest GHC (tested with 8.6.1)
  adjust to deal with the latest network-transport-tcp API changes and refactor benchmarks and test runner accordingly
  Give more informative messages about invariant violations
  fiddle around with the lack of a MonadFail in the stack...
  make it possible to verify stack build --stack-yaml stack-nightly.yaml --resolver nightly --haddock --test --bench --no-run-benchmarks -v
…ocess

This does seem to indicate that either the tests are not really validating anything useful, or the issue was a non-issue to begin with. This patch nevertheless contains useful improvements to our tracing of registration events, so I think it's still valid.
@hyperthunk
Copy link
Member Author

diff --git a/distributed-process-tests/src/Control/Distributed/Process/Tests/CH.hs b/distributed-process-tests/src/Control/Distributed/Process/Tests/CH.hs
index 91e1323..7430562 100644
--- a/distributed-process-tests/src/Control/Distributed/Process/Tests/CH.hs
+++ b/distributed-process-tests/src/Control/Distributed/Process/Tests/CH.hs
@@ -1476,6 +1476,37 @@ testExitRemote TestTransport{..} = do
   takeMVar supervisedDone
   takeMVar supervisorDone
 
+testRegisterRemote :: TestTransport -> Assertion
+testRegisterRemote TestTransport{..} = do
+  node1 <- newLocalNode testTransport initRemoteTable
+  node2 <- newLocalNode testTransport initRemoteTable
+  done <- newEmptyMVar
+
+  pid <- forkProcess node1 $ getSelfPid >>= runUntilRegistered
+
+  runProcess node2 $ do
+    register regName pid
+    liftIO $ putMVar done ()
+
+  takeMVar done
+  threadDelay 400000  -- TODO: this is racy, but we want to avoid using mx for now...
+
+  regHere <- newEmptyMVar
+
+  runProcess node2 $ whereis regName >>= liftIO . putMVar regHere
+  res <- takeMVar regHere
+  assertBool "expected Nothing, but process id was still registered" (res == Nothing)
+
+  where
+    runUntilRegistered us = do
+      mPid <- whereis regName
+      case mPid of
+        Nothing -> runUntilRegistered us
+        Just p
+          | p == us -> return ()
+
+    regName = "testRegisterRemote"
+
 testUnsafeSend :: TestTransport -> Assertion
 testUnsafeSend TestTransport{..} = do
   serverAddr <- newEmptyMVar
@@ -1677,6 +1708,7 @@ tests testtrans = return [
       , testCase "MaskRestoreScope"    (testMaskRestoreScope    testtrans)
       , testCase "ExitLocal"           (testExitLocal           testtrans)
       , testCase "ExitRemote"          (testExitRemote          testtrans)
+      , testCase "RegisterDiedRemote"  (testRegisterRemote      testtrans)
       , testCase "TextCallLocal"       (testCallLocal           testtrans)
       -- Unsafe Primitives
       , testCase "TestUnsafeSend"      (testUnsafeSend          testtrans)

Right... This patch (against master) seems to fail reliably. Let's incorporate that into this PR and see if I can demonstrate that the issue is resolved. I'm also going to see what happens when we drop the more complex test case into master.

@hyperthunk
Copy link
Member Author

Right... This patch (against master) seems to fail reliably. Let's incorporate that into this PR and see if I can demonstrate that the issue is resolved. I'm also going to see what happens when we drop the more complex test case into master.

Indeed, the original test cases still pass when I don't start the mx agent in the node controller, whereas the simpler test case fails reliably. I'm going to remove the more complex test case, though we can roll it back in if you have suggestions about how to make it work.

@hyperthunk
Copy link
Member Author

Oddly the monitoring doesn't fix the failing test case. Tracing shows that the map is updated correctly, so I'm going to trace the whereIs handling in the NC to see if there's a race in the test case...

@hyperthunk
Copy link
Member Author

Oh good heavens.. I feel like I've really wasted my time on this one...

Take a look at our register implementation...

ncEffectRegister :: ProcessId -> String -> NodeId -> Maybe ProcessId -> Bool -> NC ()
ncEffectRegister from label atnode mPid reregistration = do
  node <- ask
  currentVal <- gets (^. registeredHereFor label)
  isOk <-
       case mPid of
         Nothing -> -- unregister request
           return $ isJust currentVal
         Just thepid -> -- register request
           do isvalidlocal <- isValidLocalIdentifier (ProcessIdentifier thepid)
              return $ (isNothing currentVal /= reregistration) &&
                (not (isLocal node (ProcessIdentifier thepid) ) || isvalidlocal ) -- WHAT!!??
  if isLocal node (NodeIdentifier atnode)
     then do when isOk $
               do modify' $ registeredHereFor label ^= mPid
                  updateRemote node currentVal mPid
                  case mPid of
                    (Just p) -> do
                      if reregistration
                        then liftIO $ trace node (MxUnRegistered (fromJust currentVal) label)
                        else return ()
                      liftIO $ trace node (MxRegistered p label)
                    Nothing  -> liftIO $ trace node (MxUnRegistered (fromJust currentVal) label)
             newVal <- gets (^. registeredHereFor label)
             ncSendToProcess from $ unsafeCreateUnencodedMessage $
               RegisterReply label isOk newVal
     else let operation =
                 case reregistration of
                    True -> flip decList
                    False -> flip incList
           in case mPid of
                Nothing -> return ()
                Just pid -> modify' $ registeredOnNodesFor pid ^: (maybeify $ operation atnode)
      where updateRemote node (Just oldval) (Just newval) | processNodeId oldval /= processNodeId newval =
              do forward node (processNodeId oldval) (Register label atnode (Just oldval) True)
                 forward node (processNodeId newval) (Register label atnode (Just newval) False)
            updateRemote node Nothing (Just newval) =
                 forward node (processNodeId newval) (Register label atnode (Just newval) False)
            updateRemote node (Just oldval) Nothing =
                 forward node (processNodeId oldval) (Register label atnode (Just oldval) True)
            updateRemote _ _ _ = return ()
            maybeify f Nothing = unmaybeify $ f []
            maybeify f (Just x) = unmaybeify $ f x

            unmaybeify [] = Nothing
            unmaybeify x = Just x
            incList [] tag = [(tag,1)]
            incList ((atag,acount):xs) tag | tag==atag = (atag,acount+1) : xs
            incList (x:xs) tag = x : incList xs tag
            decList [] _ = []
            decList ((atag,1):xs) tag | atag == tag = xs
            decList ((atag,n):xs) tag | atag == tag = (atag,n-1):xs
            decList (x:xs) tag = x:decList xs tag
            forward node to reg =
              when (not $ isLocal node (NodeIdentifier to)) $
                ncSendToNode to $ NCMsg { ctrlMsgSender = ProcessIdentifier from
                                        , ctrlMsgSignal = reg
                                        }

You'll notice that in the case where we're storing a name for a remote process, we notify the remote node of the fact, and track our own processes registered elsewhere also. Note that call to updateRemote in the local node's case, and the use of registeredOnNodesFor, which is interrogated during ncEffectDied.

So... I've really no idea why we need a monitoring agent. The node controllers coordinate this between them, which is why the test cases, once written properly, pass without the monitoring agent running.

My inclination is to merge this patch with only the test cases, and the tracing fixes to the NC. I cannot see how the monitoring agent is needed. I'm also not sure whether both test cases really add much value, or if we should just merge one. I'm going to comment out updateRemote in ncEffectRegister and see which test case, if either, breaks...

@hyperthunk
Copy link
Member Author

I'm going to comment out updateRemote in ncEffectRegister and see which test case, if either, breaks...

Oh yes, they do indeed.

@qnikst
Copy link
Contributor

qnikst commented Nov 13, 2018

@hyperthunk what you wrote makes sense. If I recall correctly there were some problems at the time when initial PR was written, but maybe there were some other issues that prevented the logic from working correctly fixed after that.

@hyperthunk
Copy link
Member Author

@hyperthunk what you wrote makes sense. If I recall correctly there were some problems at the time when initial PR was written, but maybe there were some other issues that prevented the logic from working correctly fixed after that.

Possibly, yes. I can leave the monitoring in, as it certainly won't do any harm. But it is not essential to this patch set. You can comment out the void $ registryMonitorAgent expression in startServiceProcesses and all the test cases pass.

I now have what I believe is a viable patch anyway. I will push in a moment and we can review the final product.

…ing from ncEffectRegister

The NC does in fact notify remote peers when one of their processes is registered against a name locally. This should ensure that the remote's ncEffectDied triggers a message to the registry hosting node. Since there are some potential situations where that might fail, we still maintain monitors for remote registered processes, but they are not an essential function afaict.

The original test case has been removed, since if we deliberately break the NC code to trigger failure, the test code will hang indefinitely (and chew cpu resources heavily), which isn't helpful or informative on CI. The replacement test case does make use of delays and timeouts, however this has been done in a structured fashion, breaking gracefully if we trigger failure, and passing relatively quickly on the happy path. It also seems to behave consistently.
We're getting intermittent failures in CI (and I've seen some locally) for the tracing suite, but only when run with the TCP transport. I will open a bug to investigate, but since we're not focusing on that in this patch set, I am rolling this change (from master) back and we will test tracing only with the in-memory transport for now.
@hyperthunk
Copy link
Member Author

hyperthunk commented Nov 14, 2018

Okay, so assuming CI is happy, I think we're good to go. We have fixed the tracing of MxUnregistered in the case where a (local or remote) process that is registered dies, and we have a test case for that in the Mx.hs suite. We have a new test case verifying that the node controllers collaborate to inform us if a remote process registered on our local node dies, and this test case fails in (what I think is) a reasonable way if we deliberately break the code.

There are also some generic fixes to the Mx.hs test suite to remove potential races and name clashes, and a few cosmetic entries. This PR also runs the Mx test suite with both transports.

Finally, I have rolled back the move to run the Tracing.hs using the TCP transport, since in master we only use the in-memory one, and I'm getting intermittent failures in one test case, both locally and in CI. I have raised #323 to address this separately.

@qnikst - since I've made quite a few changes, would you mind doing the merge once you're happy this all makes sense? I'm probably tree blind by now, having worked on it.

@facundominguez
Copy link
Contributor

facundominguez commented Nov 14, 2018

How does the code before this patch ensure that when there is a connection failure, the label pointing to the remote process is unregistered?

Is ncEffectDied going to be called if no process in the local node is monitoring the remote node?

@facundominguez
Copy link
Contributor

facundominguez commented Nov 14, 2018

Hm, spent some time looking at the code and it looks ok.

How does the code before this patch ensure that when there is a connection failure, the label pointing to the remote process is unregistered?

Is ncEffectDied going to be called if no process in the local node is monitoring the remote node?

The NC produces a Died message to itself when the transport detects an error. ncEffecDied removes the nodes from registeredHereFor and registeredOnNodesFor.

If the monitor agent is not necessary, I'd leave it out to avoid confusion. Every person reading the code would ask what it is needed for and there is no motivation to maintain it.

@hyperthunk
Copy link
Member Author

Agree with all that @facundominguez, I'll remove the code. We seem to have picked up an odd CI failure anyway so I'll resolve both and ping back

@hyperthunk
Copy link
Member Author

Right... @facundominguez @qnikst - if you're both happy, I think we can finally get this one merged!

@hyperthunk hyperthunk merged commit 003e522 into master Nov 15, 2018
@hyperthunk hyperthunk deleted the reg-mx branch November 15, 2018 23:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants