-
Notifications
You must be signed in to change notification settings - Fork 98
Improve exception handling in callLocal. #180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
throwTo tid (e::SomeException) | ||
putMVar lock () | ||
release' fetchResult) | ||
release $ liftIO $ fetchResult |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since fetchResult
is interruptible there is no need to unmask exceptions. Then the mask in the catch
handler becomes unnecessary too.
throwTo
is an interruptible operation, so should be protected with an uninterruptible mask or the spawned thread could leak.
unmask
seems a better name than release
.
The @qnikst has mentioned to me an implementation of something similar to
which would run the action in the caller thread and in the given node. Does this sound reasonable? |
I don't understand. What's the point of What would remove the spaghetti IMO would be to use a typed channel instead of mvars. This would utilise STM instead and we could get rid of the whole mask/unmask dance we're doing here. Why is it impossible to handle asynchronous exceptions in the spawned process? I don't understand that at all. Surely we don't care what the worker is doing. We can surround that with a try and just use a channel typed as And what exactly do we think a worker should do if the parent gets an asynchronous exception and handles it? Why not just leave the caller unmasked (ie., let the parent die) and link as before, then asynchronous exceptions will propagate to the worker as expected. I think we're setting policy at the wrong level by handling exceptions here. Overall I think we're overcooking the broth a bit. I would also strongly suggest that this API, which quite useful, is totally subsumed by https://github.com/haskell-distributed/distributed-process-async. I'd prefer people use that API in general, since it decouples the caller from the worker completely and the exception handling behaviour is simple and unsurprising (if the worker dies, we get a failure record back, if the parent dies, the worker dies too) and |
the idea was that IO Thread will be associated with another Process (ProcessId and mailbox) for the worker livetime, that will solve isolation task. But I don't want to mix discussion about it with this PR.
No, the reason for MVars here is to have sane Exception behaviour and remove races, channels can't solve this problem.
Idea is the following, if you send an exception to "wrapper" then that exception should be transfered to spawned process, old code didn't implement it at all.
I don't understand it seems a discussion of a different problem..
Consider following code:
I'm expecting that another will send an exception to "wrapper" process, and this exception will be handled by "worker" and we will receive "Boo!".
This is wrong: if parent will handle and exception and will not die, then spawned process will continue to work, but his result will never be received by parent.
Can you elaborate, please?
The idea here is that worker and wrapper do not decoupled, on the contrary there are a very strong requirements on how exceptions should be handled so it will not be lost, and correct process handles it. Async gives another set of guarantees and slightly different behaviour that this code. |
I don't understand why an exception thrown to the parent should be thrown to the child/worker process. What purpose does that serve? I do understand why you're trying to avoid the caller loosing the result. Let's consider some code like this example = do
callLocal (runComplexThing >>= return . Just) `catch` (const $ return Nothing) Problem here is that if someone throws an exception to the caller, the worker (runComplexThing) will carry on and its result be will be lost. But that is exactly what I mean by setting policy. That code above chooses to potentially loose the result of
Why? How would using typed channels introduce races here? Why does the exception handling behaviour improve using MVars? The exception handling code we have in this PR is quite involved, whereas a channel based version seems like it would be much simpler to me. And the point I make about decoupling versus coupling is this: if you implement this using Async then you get exactly the exception handling semantics you've got here. But you could equally cancel if the parent gets an exception, such as exampleAsync = do
hAsync <- async $ task $ runComplexThing
res <- try (wait hAsync)
case res of
Nothing -> cancel hAsync >> return Nothing
Just dat -> return $ Just dat I cannot see where there is a race in the above async example and it seems much easier to follow. |
Tim, it seems that we are on the different pages about semantics that we want to see in So we have following questions about semantics, that we should agree upon:
I'd say that a. is just insane, b. - is more or less ok, but it leaves
I don't have strong opinion about this question, but for me temporary isolation of 'workers' Once we will be in agreement on What do you think about questions above? Is there a way to convince you that P.S. @facundominguez mentioned that |
If exceptions are not forwarded to |
That's the thing here. In the Erlang model (which we're trying to stick to here) the way that you handle this at all times is that if you want the child to exit then you use linking. How the parent and child handle exit signals is entirely up to them.
It is expected behaviour guys, because in erlang process isolation is a key principle. You only leak a process if you're expecting the child/worker to be bound to the lifetime of the parent, in which case linking is the correct approach. This idea that a spawned process has somehow leaked is, as I said earlier, very application specific. Look at how the generic rpc call works in erlang for example: do_call(Node, Request, Timeout) ->
Tag = make_ref(),
{Receiver,Mref} =
erlang:spawn_monitor(
fun() ->
%% Middleman process. Should be unsensitive to regular
%% exit signals.
process_flag(trap_exit, true),
Result = gen_server:call({?NAME,Node}, Request, Timeout),
exit({self(),Tag,Result})
end),
receive
{'DOWN',Mref,_,_,{Receiver,Tag,Result}} ->
rpc_check(Result);
{'DOWN',Mref,_,_,Reason} ->
%% The middleman code failed. Or someone did
%% exit(_, kill) on the middleman process => Reason==killed
rpc_check_t({'EXIT',Reason})
end. So the middleman catches exit signals (which is the equivalent to masking asynchronous exceptions) and the parent deals with the result. If the parent dies the middleman just continues. The actual worker in this case is a gen_server, to which the worker is making a blocking call. The assumption these APIs all make is that the worker is either a long running process (such as a gen_server) or that you'll explicitly link to the parent yourself to avoid leaking. If the parent catches an asynchronous exception, the parent can choose to cancel the async task. This is done (in d-p-async) by throwing an exception to a middleman process that isn't blocking on the actual task, which in turn kills the worker using an asynchronous exception. So unless the worker is masking it will die quickly. If the worker is masking, the caller (of cancel) can be made to block until the worker has exited, avoiding the leakage. That way, you avoid all these leaking problems to begin with. The relevant code in async works exactly this way: pollUntilExit wpid result' = do
r <- receiveWait [
match (\c@(CancelWait) -> kill wpid "cancel" >> return (Left c))
, match (\(ProcessMonitorNotification _ pid' r) ->
return (Right (pid', r)))
]
case r of
Left CancelWait
-> liftIO $ atomically $ putTMVar result' AsyncCancelled
Right (fpid, d)
| fpid == wpid
-> case d of
DiedNormal -> return ()
_ -> liftIO $ atomically $ putTMVar result' (AsyncFailed d)
| otherwise -> kill wpid "linkFailed" So now if you wish to, you can utilise that in exampleAsync: exampleAsync = do
hAsync <- async $ task $ runComplexThing
res <- try (wait hAsync)
case res of
Nothing -> cancelWait hAsync >> return Nothing
Just dat -> return $ Just dat
I don't understand this. In what way does exampleAsync = (async $ task $ runComplexThing) >>= try (wait hAsync) >>= maybe return (cancelWait hAsync) So anyway, I think choice of what to do with the worker if the parent gets an exception really depends on what the worker is doing. If it is preparing to launch nuclear missiles then we should kill it asap (and the async API is the best approach for that IMHO) whereas if it's just calculating a bill and not changing state anywhere, we can just let it run and ignore the lost result. Also, if the lifespan of the parent is important to the child then linking deals with that neatly.
That runs completely counter to the Erlang philosophy. If the two are to depend on one another, you must link them. This re-throwing of exceptions is not a normal behaviour in Erlang. That's not to say that it is wrong or not useful - indeed lots of Erlang code does this, but it's not the default expected behaviour.
If we give this function exactly the same semantics as the original |
But overall, I would really recommend that people use the async API for this kind of stuff. It's been carefully put together to cover all the nasty edge cases. |
I tend to agree that with using
Almost the same as you wrote
It was a first proposed thing. The second was about exceptions handling and it was a root of all additional complexity.. I'll try to investigate how call works it this case and report you back. |
I did not know that. Oops!
That's probably ok, though I don't think you should use waitForResult `onException` exit workerPid >>= waitForResult |
The idea here (in Seems we are converging on how things should look like here :). So will it be ok if we will say: |
So don't we mean
Yep, I think we're getting there. :)
That sounds reasonable to me. Just to confirm... When you say "incoming exception will be re-thrown" do you mean re-thrown into the child thread? Or re-thrown in the parent? I would prefer that we kill the child using doCall child `catch` (exit child) Or some such. Does that make sense? |
My bad, I mean
I mean resthrown in parent. I.e. (don't know how to express that
I didn't think about management/tracing, so killing with exit looks reasonable. So I'd say we want:
However there is one more thing that I'm worry about, assume some process killing our with call to @facundominguez please also check our discussion as you understand constrains for |
This is almost enough. In our use cases we also need the parent to actually wait for the worker to die. |
Ok, so we want this... callLocal = do
blah blah...
doCall child `onException` (waitForExit child << exit child)
where
waitForExit child = do
mRef <- monitor child
receiveWait [ matchMonitor mRef ] |
That's not correct. You can catch |
Yes you are right I have confused So additional guarantee is that we want:
So the last question we need to agree upon is if this guarantee is sane for general case? |
I have improved |
where | ||
waitForExit child = do | ||
mRef <- monitor child | ||
receiveWait |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unmonitor if there is another exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure that I understood..
b5b41e2
to
b892b69
Compare
parent <- getSelfPid | ||
mv <- liftIO newEmptyMVar :: Process (MVar (Either SomeException a)) | ||
child <- spawnLocal $ mask $ \release -> do | ||
link parent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linking is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, fixed.
b892b69
to
bb618f1
Compare
either (throwIO :: SomeException -> IO a) return | ||
|
||
callLocal proc = mask_ $ do | ||
parent <- getSelfPid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parent
is unused.
79d79b1
to
cd9dff2
Compare
Tests, everything except latest that should be handled separately, I think: |
a05ea06
to
c26928d
Compare
rebased version. I think now everything is ok. |
Reopened request, closed accidentally as I was thinking it was my temporary branch with no PR. |
waitForExit child = do | ||
bracket (monitor child) | ||
(unmonitor) | ||
(\mRef -> receiveWait |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we use the MVar mv
here? receiveWait
can leak the monitor notification if it is interrupted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that interrupt on exception in receiveWait
will have correct semantics here, this means that we want thread to exit immediately we could just send another exception. However, I'm not insist on that.
@hyperthunk for the record, the statement you saw above is incorrect. |
Yes, I was not correct there, but it's still not correct to use |
Here's a version we agree on chat. I cannot update the PR because the branch is in the @qnikst's repo. callLocal :: Process a -> Process a
callLocal proc = mask $ \release -> do
mv <- liftIO newEmptyMVar :: Process (MVar (Either SomeException a))
child <- spawnLocal $ try (release proc) >> liftIO . putMVar mv
rs <- liftIO (takeMVar mv) `onException`
(exit child "exception in parent process" >> takeMVar mv)
either throwIO return rs |
@facundominguez @mboes I've updated code (with a version like in comment above, module fixing errors). |
Looks much better to me. 👍 |
Though we should be using |
@mboes sorry, for using incorrect nickname, and yes, you are right |
callLocal had the following flaws: 1. It could leak the worker process if the parent process was interrupted but not killed. Linking the worker to the parent wouldn't help. 2. If the parent was interrupted, `callLocal` would return the control before the worker is terminated. This would make slightly harder to synchronize with whatever task the worker is doing. This patch fixes both issues.
Fixed! |
LGTM |
Improve exception handling in callLocal.
callLocal had few flaws,
exception but handles it. Then process was not killed, so
linking worker to parent didn't help
process.
This this patch both things are solved alltogether, this patch
as a correct mechanism for exceptin handling, so if worker was
not finished then all exceptions are rethrown to that process.
In this case worker is capable for exception handling. However
if worker already finished, then exception in thrown from the
callLocal code. The resulting semantics is the same as wheb
code is executed in a single process.