Skip to content

Stop all processes in closeLocalNode. #187

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 36 additions & 11 deletions src/Control/Distributed/Process/Node.hs
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ import Data.Maybe (isJust, fromJust, isNothing, catMaybes)
import Data.Typeable (Typeable)
import Control.Category ((>>>))
import Control.Applicative (Applicative, (<$>))
import Control.Monad (void, when)
import Control.Monad (void, when, join, replicateM_)
import Control.Monad.IO.Class (MonadIO, liftIO)
import Control.Monad.State.Strict (MonadState, StateT, evalStateT, gets)
import qualified Control.Monad.State.Strict as StateT (get, put)
Expand Down Expand Up @@ -198,6 +198,8 @@ import Control.Distributed.Process.Internal.Primitives
, sendChan
, catch
, unwrapMessage
, monitor
, kill
)
import Control.Distributed.Process.Internal.Types (SendPort, Tracer(..))
import qualified Control.Distributed.Process.Internal.Closure.BuiltIn as BuiltIn (remoteTable)
Expand Down Expand Up @@ -255,12 +257,6 @@ createBareLocalNode endPoint rtable = do
(stopNC node)

return tracedNode
where
stopNC node =
writeChan (localCtrlChan node) NCMsg
{ ctrlMsgSender = NodeIdentifier (localNodeId node)
, ctrlMsgSignal = SigShutdown
}

startMxAgent :: LocalNode -> IO LocalNode
startMxAgent node = do
Expand Down Expand Up @@ -319,11 +315,40 @@ startServiceProcesses node = do

-- | Force-close a local node
--
-- TODO: for now we just close the associated endpoint
-- This function first try to brutally kill all processes, this
-- process is not guaranteed to continue. So this is application
-- responsibility to either close all it's processes before dying
-- or being able to gracefully shutdown.
closeLocalNode :: LocalNode -> IO ()
closeLocalNode node =
-- TODO: close all our processes, surely!?
NT.closeEndPoint (localEndPoint node)
closeLocalNode node = join $ terminateProcesses
where
terminateProcesses :: IO (IO ())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the potentially endless loop hammering processes exit exceptions? Surely one only gets a ProcessMonitorNotification after the process in fact no longer exists? If not, that's a bug, which I don't think it's justified to paper over by looping in this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a new version it doesn't, because this function doesn't start new iteration unless all processes from current generation were killed. this means that for each process there will be only one exception. and then when all processes were killed then there will be a new check if there are new processes. If so then termination procedure continues. There are few other possible solutions:

  1. Invent new procedure that will not allow staring new processes when node is closing
  2. Use agent as agent has information about processes born and death.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About why this solution was chosen. Currently we have following solutions:

  1. original - same as this one but without looping at all. Basically this means that if any process will be spawned after start of terminationProcess they will survive. This solution may block any of process will refuse to die and may exit even if there are existing processes with a very large time frame for premature exit
  2. my first attempt - had a loop but had no monitoring notifications. This solution had a tiny frame for prepature exit, but could loop if this process is in race with self restarting process. I.e. process that traps kill and spawn new process.
  3. my second attempt (current one) - this is the same as 1 but with looping. Such looping reduces time frame for premature exit (and basically could entirely remote it), but can potentially block if there is self-restarting process.

Summing all this up. I think that we need to either decide to implement some heavyweight solution, or 3 could be the best fit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about new processes, but the current solution is insufficient in addition to being dangerous: the fact that there are no processes any longer on the node does not mean that a process on another node can't spawn a process on this node before the node is actually shut down.

The proper fix in my mind is neither of the 3 solutions above: clearly nodes need one additional state, say a "closing" state. When a node is in the closing state, no process is allowed to assign any new resources (e.g. newly spawned processes) to it.

terminateProcesses = do
st <- readMVar (localState node)
let mxACPid = case localEventBus node of
MxEventBus pid _ _ _ -> pid
MxEventBusInitialising -> error "closeLocalNode: The given node is not initialized."
pids = filter (/=mxACPid) $ map processId $ Map.elems (st ^. localProcesses)
if pids == []
then do runProcess node $ do
-- Trying to kill the management agent controller prevents other processes
-- from terminating.
mapM_ monitor pids
mapM_ (flip kill "closing node") pids
replicateM_ (length pids) $ receiveWait
[ match $ \(ProcessMonitorNotification _ _ _) -> return () ]
return (join terminateProcesses)
else return $ stopNC node

-- | Send stop signal to node controller
stopNC :: LocalNode -> IO ()
stopNC node =
writeChan (localCtrlChan node) NCMsg
{ ctrlMsgSender = NodeIdentifier (localNodeId node)
, ctrlMsgSignal = SigShutdown
}



-- | Run a process on a local node and wait for it to finish
runProcess :: LocalNode -> Process () -> IO ()
Expand Down