Skip to content

Improve exception handling in callLocal. #180

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 22, 2015

Conversation

qnikst
Copy link
Contributor

@qnikst qnikst commented Feb 11, 2015

callLocal had few flaws,

  1. it could leak worker process in parent process received an
    exception but handles it. Then process was not killed, so
    linking worker to parent didn't help
  2. it was not possible to handle asynchronous exception in worker
    process.

This this patch both things are solved alltogether, this patch
as a correct mechanism for exceptin handling, so if worker was
not finished then all exceptions are rethrown to that process.
In this case worker is capable for exception handling. However
if worker already finished, then exception in thrown from the
callLocal code. The resulting semantics is the same as wheb
code is executed in a single process.

throwTo tid (e::SomeException)
putMVar lock ()
release' fetchResult)
release $ liftIO $ fetchResult
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since fetchResult is interruptible there is no need to unmask exceptions. Then the mask in the catch handler becomes unnecessary too.

throwTo is an interruptible operation, so should be protected with an uninterruptible mask or the spawned thread could leak.

unmask seems a better name than release.

@facundominguez
Copy link
Contributor

The Exception and MVar spaghetti is difficult to follow. Could we avoid spawning a thread to implement callLocal?

@qnikst has mentioned to me an implementation of something similar to

inlineRunProcess :: LocalNode -> Process a -> IO a

which would run the action in the caller thread and in the given node.

Does this sound reasonable?

@hyperthunk
Copy link
Member

I don't understand. What's the point of callLocal if we don't spawn a process/thread? The haddock says this function is meant to "isolate [the worker] from messages sent to the caller process, and also allows silently dropping late or duplicate messages sent to the isolated process after it exits.". So to do that we have to spawn.

What would remove the spaghetti IMO would be to use a typed channel instead of mvars. This would utilise STM instead and we could get rid of the whole mask/unmask dance we're doing here. Why is it impossible to handle asynchronous exceptions in the spawned process? I don't understand that at all. Surely we don't care what the worker is doing. We can surround that with a try and just use a channel typed as Maybe a for the result!?

And what exactly do we think a worker should do if the parent gets an asynchronous exception and handles it? Why not just leave the caller unmasked (ie., let the parent die) and link as before, then asynchronous exceptions will propagate to the worker as expected. I think we're setting policy at the wrong level by handling exceptions here.

Overall I think we're overcooking the broth a bit. I would also strongly suggest that this API, which quite useful, is totally subsumed by https://github.com/haskell-distributed/distributed-process-async. I'd prefer people use that API in general, since it decouples the caller from the worker completely and the exception handling behaviour is simple and unsurprising (if the worker dies, we get a failure record back, if the parent dies, the worker dies too) and callLocal is a simple composition of async and wait.

@qnikst
Copy link
Contributor Author

qnikst commented Feb 12, 2015

I don't understand. What's the point of callLocal if we don't spawn a process/thread? The haddock says this function is meant to "isolate [the worker] from messages sent to the caller process, and also allows silently dropping late or duplicate messages sent to the isolated process after it exits.". So to do that we have to spawn.

the idea was that IO Thread will be associated with another Process (ProcessId and mailbox) for the worker livetime, that will solve isolation task. But I don't want to mix discussion about it with this PR.

What would remove the spaghetti IMO would be to use a typed channel instead of mvars.

No, the reason for MVars here is to have sane Exception behaviour and remove races, channels can't solve this problem.

Why is it impossible to handle asynchronous exceptions in the spawned process? I don't understand that at all.

Idea is the following, if you send an exception to "wrapper" then that exception should be transfered to spawned process, old code didn't implement it at all.

We can surround that with a try and just use a channel typed as Maybe a for the result!?

I don't understand it seems a discussion of a different problem..

And what exactly do we think a worker should do if the parent gets an asynchronous exception and handles it?

Consider following code:

another <- runProcess node $ do
   them <- expect
   throwTo them MyException -- maybe not exact
   "Boo!" <- expect  
runProcess node $
   self <- getSelfPID
   send another self
   callLocal $ \r -> doSmth `catch` (\(MyException e) -> send another "Boo!")

I'm expecting that another will send an exception to "wrapper" process, and this exception will be handled by "worker" and we will receive "Boo!".

Why not just leave the caller unmasked (ie., let the parent die) and link as before, then asynchronous exceptions will propagate to the worker as expected.

This is wrong: if parent will handle and exception and will not die, then spawned process will continue to work, but his result will never be received by parent.

I think we're setting policy at the wrong level by handling exceptions here.

Can you elaborate, please?

I'd prefer people use that API in general, since it decouples the caller from the worker completely and the exception handling behaviour is simple and unsurprising (if the worker dies, we get a failure record back, if the parent dies, the worker dies too) and callLocal is a simple composition of async and wait.

The idea here is that worker and wrapper do not decoupled, on the contrary there are a very strong requirements on how exceptions should be handled so it will not be lost, and correct process handles it. Async gives another set of guarantees and slightly different behaviour that this code.

@hyperthunk
Copy link
Member

I don't understand why an exception thrown to the parent should be thrown to the child/worker process. What purpose does that serve? I do understand why you're trying to avoid the caller loosing the result. Let's consider some code like this

example = do
  callLocal (runComplexThing >>= return . Just) `catch` (const $ return Nothing)

Problem here is that if someone throws an exception to the caller, the worker (runComplexThing) will carry on and its result be will be lost. But that is exactly what I mean by setting policy. That code above chooses to potentially loose the result of runComplexThing by handling exceptions in that way. It could equally not handle exceptions and then linking the caller and the worker would have the expected result.

No, the reason for MVars here is to have sane Exception behaviour and remove races, channels can't solve this problem

Why? How would using typed channels introduce races here? Why does the exception handling behaviour improve using MVars? The exception handling code we have in this PR is quite involved, whereas a channel based version seems like it would be much simpler to me.

And the point I make about decoupling versus coupling is this: if you implement this using Async then you get exactly the exception handling semantics you've got here. But you could equally cancel if the parent gets an exception, such as

exampleAsync = do
  hAsync <- async $ task $ runComplexThing
  res <- try (wait hAsync)
  case res of
    Nothing -> cancel hAsync >> return Nothing
    Just dat -> return $ Just dat 

I cannot see where there is a race in the above async example and it seems much easier to follow.

@qnikst
Copy link
Contributor Author

qnikst commented Feb 13, 2015

Tim, it seems that we are on the different pages about semantics that we want to see in callLocal. I'm arguing that exceptions should be propagated to worker thread, because it's an expected behaviour and could safe from potential process leak. It seems that you on the contrary say that processes should be isolated (exceptions should not be propagared), because it's an expected behaviour. (Offtopic: I also think that I could express semantics that I want using d-p-async but it will have same level of complexity. I also think that d-p-async primitives has same guarantees and semantics, but it's a bit different from call* one.)

So we have following questions about semantics, that we should agree upon:

  1. what should happen with worker process in parent process receives an exception, there are 3 variants:
    a. nothing at all
    b. nothing but worker should be linked to parent (current w/o patch)
    c. worker should be killed anyway

I'd say that a. is just insane, b. - is more or less ok, but it leaves worker process alive if parent
will handle an exception. For example in first code above runComplexThing will still be alive
and in some cases may harm system, and we don't have it's ProcessId or ThreadId in order
to communicate with it. exampleAsync is much better but it doesn't handle DiedException
so here we are still leaking a process.

  1. Should exception be propagated to worker thread, here we have 2 variants:
    a. It should
    b. It shouldn't

I don't have strong opinion about this question, but for me temporary isolation of 'workers'
mailbox from parent doesn't mean that 'worker' should be isolated from parents exception.
Really as far as I understand we don't have use scenario in order to prove that a. is better.

Once we will be in agreement on 1. and 2., we can continue implementation discussion,
because solutions heavily depend on semantics we want.

What do you think about questions above? Is there a way to convince you that 2.a is better
than 2.b, for example I can try to write semantics rules for both, or prepare an example that
can be expressed using 2.a but will be broken with 2.b?

P.S. @facundominguez mentioned that call has the same behaviour, so maybe it I should give up on full PR and resubmit simpler version :)

@facundominguez
Copy link
Contributor

If exceptions are not forwarded to callLocal p, and the worker refuses to be canceled or if it is slow to die, for whatever reason, the only way to avoid the leaking process is to have callLocal wait until the worker finishes.

@hyperthunk
Copy link
Member

what should happen with worker process in parent process receives an exception

That's the thing here. In the Erlang model (which we're trying to stick to here) the way that you handle this at all times is that if you want the child to exit then you use linking. How the parent and child handle exit signals is entirely up to them.

I'm arguing that exceptions should be propagated to worker thread, because it's an expected behaviour and could safe from potential process leak. It seems that you on the contrary say that processes should be isolated (exceptions should not be propagared), because it's an expected behaviour.

It is expected behaviour guys, because in erlang process isolation is a key principle. You only leak a process if you're expecting the child/worker to be bound to the lifetime of the parent, in which case linking is the correct approach. This idea that a spawned process has somehow leaked is, as I said earlier, very application specific. Look at how the generic rpc call works in erlang for example:

do_call(Node, Request, Timeout) ->
    Tag = make_ref(),
    {Receiver,Mref} =
    erlang:spawn_monitor(
      fun() ->
          %% Middleman process. Should be unsensitive to regular
          %% exit signals.
          process_flag(trap_exit, true),
          Result = gen_server:call({?NAME,Node}, Request, Timeout),
          exit({self(),Tag,Result})
      end),
    receive
    {'DOWN',Mref,_,_,{Receiver,Tag,Result}} ->
        rpc_check(Result);
    {'DOWN',Mref,_,_,Reason} ->
        %% The middleman code failed. Or someone did 
        %% exit(_, kill) on the middleman process => Reason==killed
        rpc_check_t({'EXIT',Reason})
    end.

So the middleman catches exit signals (which is the equivalent to masking asynchronous exceptions) and the parent deals with the result. If the parent dies the middleman just continues. The actual worker in this case is a gen_server, to which the worker is making a blocking call. The assumption these APIs all make is that the worker is either a long running process (such as a gen_server) or that you'll explicitly link to the parent yourself to avoid leaking.

If the parent catches an asynchronous exception, the parent can choose to cancel the async task. This is done (in d-p-async) by throwing an exception to a middleman process that isn't blocking on the actual task, which in turn kills the worker using an asynchronous exception. So unless the worker is masking it will die quickly. If the worker is masking, the caller (of cancel) can be made to block until the worker has exited, avoiding the leakage. That way, you avoid all these leaking problems to begin with. The relevant code in async works exactly this way:

    pollUntilExit wpid result' = do
      r <- receiveWait [
          match (\c@(CancelWait) -> kill wpid "cancel" >> return (Left c))
        , match (\(ProcessMonitorNotification _ pid' r) ->
                  return (Right (pid', r)))
        ]
      case r of
          Left CancelWait
            -> liftIO $ atomically $ putTMVar result' AsyncCancelled
          Right (fpid, d)
            | fpid == wpid
              -> case d of
                     DiedNormal -> return ()
                     _          -> liftIO $ atomically $ putTMVar result' (AsyncFailed d)
            | otherwise -> kill wpid "linkFailed"

So now if you wish to, you can utilise that in exampleAsync:

exampleAsync = do
  hAsync <- async $ task $ runComplexThing
  res <- try (wait hAsync)
  case res of
    Nothing -> cancelWait hAsync >> return Nothing
    Just dat -> return $ Just dat 

exampleAsync is much better but it doesn't handle DiedException so here we are still leaking a process

I don't understand this. In what way does exampleAsync not handle DiedException. Do you mean that it doesn't deal with the result? We can easily change that by just returning the AsyncResult:

exampleAsync = (async $ task $ runComplexThing) >>= try (wait hAsync) >>= maybe return (cancelWait hAsync)

So anyway, I think choice of what to do with the worker if the parent gets an exception really depends on what the worker is doing. If it is preparing to launch nuclear missiles then we should kill it asap (and the async API is the best approach for that IMHO) whereas if it's just calculating a bill and not changing state anywhere, we can just let it run and ignore the lost result. Also, if the lifespan of the parent is important to the child then linking deals with that neatly.

for me temporary isolation of 'workers' mailbox from parent doesn't mean that 'worker' should be isolated from parents exception

That runs completely counter to the Erlang philosophy. If the two are to depend on one another, you must link them. This re-throwing of exceptions is not a normal behaviour in Erlang. That's not to say that it is wrong or not useful - indeed lots of Erlang code does this, but it's not the default expected behaviour.

P.S. @facundominguez mentioned that call has the same behaviour, so maybe it I should give up on full PR and resubmit simpler version :)

If we give this function exactly the same semantics as the original call API then I'll accept that verbatim. :)

@hyperthunk
Copy link
Member

But overall, I would really recommend that people use the async API for this kind of stuff. It's been carefully put together to cover all the nasty edge cases.

@qnikst
Copy link
Contributor Author

qnikst commented Feb 18, 2015

I tend to agree that with using d-p-api it is possible to express more solutions and cover some corner cases. However tThe idea for callLocal was to have the cheapest possible solution for temporary isolation. The problem is that is we use a simplied API we don't have that level of control that d-p-api has but there are applications for this case. Now about proposed stuff, here is a diagram:

  Process
    |
   callLocal/start ----------------------------------------------+-------------+   
    |-------------------------------------> Worker                |   B-1       | B-2
    ..                                            |                      |               |
    .. <----------------------------------------+                     |               |
    |                                                                   |               |
   callLocal/end   --------------------------------------------+               |
    |                                                                                   |
    process/end ---------------------------------------------------------------+
  1. the idea is that if we use link than we a binding workers lifetime to process lifetime (B-2);
  2. however it seems that sane behaviour is binding workers lifetime to callLocal lifetime (B-1).
    This could be solved as simple as: 'waitForResult onException sendTo workerPid >>= waitForResult'

Almost the same as you wrote exampleAsync = (async $ task $ runComplexThing) >>= try (wait hAsync) >>= maybe return (cancelWait hAsync), except 2 things:

  1. try will not catch asynchrous exceptions, and in case of async exception worker will not be killed
  2. d-p-async uses heavier primitives.

It was a first proposed thing. The second was about exceptions handling and it was a root of all additional complexity..

I'll try to investigate how call works it this case and report you back.

@hyperthunk
Copy link
Member

try will not catch asynchrous exceptions

I did not know that. Oops!

This could be solved as simple as: 'waitForResult onException sendTo workerPid >>= waitForResult'

That's probably ok, though I don't think you should use sendTo here. I think it should be

waitForResult `onException` exit workerPid >>= waitForResult

@qnikst
Copy link
Contributor Author

qnikst commented Feb 18, 2015

That's probably ok, though I don't think you should use sendTo here. I think it should be

The idea here (in callLocal) was that it was action in IO, is case of SomeException worker will be killed, and worker is always local so we can use async exceptions to kill him.

Seems we are converging on how things should look like here :). So will it be ok if we will say:
callLocal spawns a temporary process on the local node and waits for it result, in case of any exception in parent process worker process will be killed, and incomming exception will be rethrown?

@hyperthunk
Copy link
Member

The idea here (in callLocal) was that it was action in IO, is case of SomeException worker will be killed, and worker is always local so we can use async exceptions to kill him

So don't we mean throwTo rather than sendTo here?

Seems we are converging on how things should look like here :).

Yep, I think we're getting there. :)

So will it be ok if we will say: callLocal spawns a temporary process on the local node and waits for it result, in case of any exception in parent process worker process will be killed, and incomming exception will be rethrown?

That sounds reasonable to me. Just to confirm... When you say "incoming exception will be re-thrown" do you mean re-thrown into the child thread? Or re-thrown in the parent? I would prefer that we kill the child using exit - because this shows up in the management/tracing infrastructure rather than just using throwTo. So we'd end up with

doCall child `catch` (exit child)

Or some such. Does that make sense?

@qnikst
Copy link
Contributor Author

qnikst commented Feb 19, 2015

So don't we mean throwTo rather than sendTo here?

My bad, I mean Control.Concurrent.throwTo.

When you say "incoming exception will be re-thrown" do you mean re-thrown into the child thread? Or re-thrown in the parent?

I mean resthrown in parent. I.e. (don't know how to express that P(callLocal) :< Exception |-> Exception if Process P receives an exception E while it evaluates callLocal then result of the evaluation will be exception E.

I would prefer that we kill the child using exit - because this shows up in the management/tracing infrastructure rather than just using throwTo.

I didn't think about management/tracing, so killing with exit looks reasonable. So I'd say we want:

doCall child `onException` (exit child)

However there is one more thing that I'm worry about, assume some process killing our with call to exit. Then there is no way to catch (recover process) or handle (do some actions if exception happened) that exception, this means that exit child will never be called. Worker eventually will be killed because it's linked to parent but in IO solution, Process P will exit from callLocal call only after worker receives an exception. So if I shouldn't worry about that then I agree on solution proposed above (modulo minor fixes), otherwise solution should be slightly more complex so it will have semantics I'm looking for. (To note, all my worries are specific to local case, as in distributed case those guarantees comes at higher price).

@facundominguez please also check our discussion as you understand constrains for callLocal that are used in the project.

@facundominguez
Copy link
Contributor

doCall child `onException` (exit child)

This is almost enough. In our use cases we also need the parent to actually wait for the worker to die.

@hyperthunk
Copy link
Member

This is almost enough. In our use cases we also need the parent to actually wait for the worker to die

Ok, so we want this...

callLocal = do
  blah blah...
  doCall child `onException` (waitForExit child << exit child)
  where
    waitForExit child = do
      mRef <- monitor child
      receiveWait [ matchMonitor mRef ]

@hyperthunk
Copy link
Member

However there is one more thing that I'm worry about, assume some process killing our with call to exit. Then there is no way to catch (recover process) or handle (do some actions if exception happened) that exception, this means that exit child will never be called. Worker eventually will be killed because it's linked to parent but in IO solution, Process P will exit from callLocal call only after worker receives an exception. So if I shouldn't worry about that then I agree on solution proposed above (modulo minor fixes), otherwise solution should be slightly more complex so it will have semantics I'm looking for. (To note, all my worries are specific to local case, as in distributed case those guarantees comes at higher price).

That's not correct. You can catch exit in the parent too, why not? It's only kill that is untrappable and that's where the linking comes in. So I don't think we need to worry here.

@qnikst
Copy link
Contributor Author

qnikst commented Feb 20, 2015

Yes you are right I have confused exit and die. However with linking we have a sort of race condition, as liking says that process that is linked to another will eventually be killed. And as @facundominguez mentioned that this breaks constrains that we need.

So additional guarantee is that we want:

if process leaves callLocal then worker process is guaranteed to be dead.

So the last question we need to agree upon is if this guarantee is sane for general case?

@qnikst
Copy link
Contributor Author

qnikst commented Feb 22, 2015

I have improved callLocal it looks that it's safe now. @facundominguez @hyperthunk please take a look.

where
waitForExit child = do
mRef <- monitor child
receiveWait
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unmonitor if there is another exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure that I understood..

parent <- getSelfPid
mv <- liftIO newEmptyMVar :: Process (MVar (Either SomeException a))
child <- spawnLocal $ mask $ \release -> do
link parent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linking is not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, fixed.

either (throwIO :: SomeException -> IO a) return

callLocal proc = mask_ $ do
parent <- getSelfPid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parent is unused.

@qnikst qnikst force-pushed the fix-call-local branch 4 times, most recently from 79d79b1 to cd9dff2 Compare February 24, 2015 15:43
@qnikst
Copy link
Contributor Author

qnikst commented Feb 25, 2015

Tests, everything except latest that should be handled separately, I think:

haskell-distributed/distributed-process-tests#13

@qnikst
Copy link
Contributor Author

qnikst commented Feb 25, 2015

rebased version. I think now everything is ok.

@qnikst qnikst closed this Jun 16, 2015
@qnikst qnikst deleted the fix-call-local branch June 16, 2015 16:42
@qnikst qnikst restored the fix-call-local branch June 17, 2015 12:01
@qnikst qnikst reopened this Jun 17, 2015
@qnikst
Copy link
Contributor Author

qnikst commented Jun 17, 2015

Reopened request, closed accidentally as I was thinking it was my temporary branch with no PR.

waitForExit child = do
bracket (monitor child)
(unmonitor)
(\mRef -> receiveWait
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we use the MVar mv here? receiveWait can leak the monitor notification if it is interrupted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that interrupt on exception in receiveWait will have correct semantics here, this means that we want thread to exit immediately we could just send another exception. However, I'm not insist on that.

@mboes
Copy link
Contributor

mboes commented Jun 19, 2015

try will not catch asynchrous exceptions

I did not know that. Oops!

@hyperthunk for the record, the statement you saw above is incorrect. try will catch an exception whether it is thrown synchronously or thrown asynchronously.

@qnikst
Copy link
Contributor Author

qnikst commented Jun 19, 2015

Yes, I was not correct there, but it's still not correct to use try for catching asynchronous exceptions, see Applying mask to an exception handler

@facundominguez
Copy link
Contributor

Here's a version we agree on chat. I cannot update the PR because the branch is in the @qnikst's repo.

callLocal :: Process a -> Process a
callLocal proc = mask $ \release -> do
  mv    <- liftIO newEmptyMVar :: Process (MVar (Either SomeException a))
  child <- spawnLocal $ try (release proc) >> liftIO . putMVar mv
  rs <- liftIO (takeMVar mv) `onException`
          (exit child "exception in parent process" >> takeMVar mv)
  either throwIO return rs

@qnikst
Copy link
Contributor Author

qnikst commented Jun 19, 2015

@facundominguez @mboes I've updated code (with a version like in comment above, module fixing errors).

@mboes
Copy link
Contributor

mboes commented Jun 19, 2015

Looks much better to me. 👍

@mboes
Copy link
Contributor

mboes commented Jun 19, 2015

@qnikst btw my username on Github is @mboes.

@mboes
Copy link
Contributor

mboes commented Jun 19, 2015

Though we should be using kill rather than exit, should we not?

@qnikst
Copy link
Contributor Author

qnikst commented Jun 19, 2015

@mboes sorry, for using incorrect nickname, and yes, you are right kill is a right solution here.

callLocal had the following flaws:

1. It could leak the worker process if the parent process was
   interrupted but not killed.
   Linking the worker to the parent wouldn't help.

2. If the parent was interrupted, `callLocal` would return the
   control before the worker is terminated. This would make
   slightly harder to synchronize with whatever task the
   worker is doing.

This patch fixes both issues.
@qnikst
Copy link
Contributor Author

qnikst commented Jun 19, 2015

Fixed!

@facundominguez
Copy link
Contributor

LGTM

qnikst added a commit that referenced this pull request Jun 22, 2015
Improve exception handling in callLocal.
@qnikst qnikst merged commit 13428ac into haskell-distributed:master Jun 22, 2015
@qnikst qnikst deleted the fix-call-local branch June 22, 2015 08:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants