-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop machine monitoring on Session shutdown #90
base: master
Are you sure you want to change the base?
Conversation
exec/session.go
Outdated
@@ -272,6 +272,8 @@ var statusMu sync.Mutex | |||
|
|||
func (s *Session) run(ctx context.Context, calldepth int, funcv *bigslice.FuncValue, args ...interface{}) (*Result, error) { | |||
location := "<unknown>" | |||
runContext, runContextCancel := context.WithCancel(ctx) | |||
defer runContextCancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this will cause machine management to stop when this cancellation is executed when returning from this method, as this context is now the context passed to (*sliceMachine).Go
. This is undesirable, as Session
s can be used to run multiple invocations.
I think we want to continue to log the lost machine...
message as long as the session has not been shut down, as task results can be reused by future invocations, and it's reasonably informative to know that they will need to be recomputed. I think we should look to only clean up the messages that folks see after session shutdown, as lost tasks have no observable effect at that point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright. This brings up another issue however. Do you think it is alright to store a context on the session struct (I earlier found "Do not store Contexts inside a struct type" in the documentation). If not do you think there is another way to do this? The only other way I could think of is to require the user to create their own context and pass it to each session method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One possibility is to keep a channel that we close on shutdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just put up a channel based version of this. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem is unresolved. My reading is as follows:
(*Session).run
now creates aContext
, which I'll callctx
henceforth. It is cancelled when(*Session).run
returns.(*Session).run
passesctx
toEval
.Eval
passesctx
toexecutor.Run
. Supposeexecutor
is a*bigmachineExecutor
.(*bigmachineExecutor).Run
passesctx
to(*bigmachineExecutor).manager
.(*bigmachineExecutor).manager
passesctx
to(*machineManager).Do
.(*machineManager).Do
passesctx
tostartMachines
.startMachines
passesctx
to(*sliceMachine).Go
.
When (*Session).run
returns, both:
(*machineManager).Do
will return, thereby no longer managing machines.(*sliceMachine).Go
will return (for all machines).
Neither of these things should happen, as the *Session
needs to continue to be usable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of the current diff (*slicemachine).Go
returns based on a channel which is closed on session shutdown. Pushing up a diff now which will make (*machineManager).Do
exit based on this channel as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've leaved the context plumbed through but it won't deactivate those processes when its closed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the context is cancelled, the channel returned by ctx.Done()
will be closed, so the select
will unblock. When we check the for
-loop condition, ctx.Err()
will be non-nil
and (*sliceMachine).Go
will return. Could you help me understand how this reasoning is invalid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I forgot about the for loop condition I was just looking at the select at the bottom. I just updated it to use background context and double checked on adhoc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a still a reason for changing anything about context usage? If not, let's not change anything related to context. If so, could you explain?
No there is no reason anymore. The only change I've left in is moving the cancel from Eval to Session.run(), since it cleans up the call to maintainSliceGroup. |
Why touch anything related to the contexts at all? Is it fixing some issue? |
Just a refactor, trying to leave the code a bit nicer than I found it (no contexts will be canceled at any time different than before) I can revert those changes as well if you aren't comfortable with them. |
In general, I would try to do any refactoring in a separate PR, as it makes it clear that there should be no behavior change and allows us to roll things back more easily. In this specific case, I don't think it's actually the right thing to do. |
Thanks for the explanation. I've removed those changes. |
Relatedly, the code is a bit difficult to reason about because we've now got functions that have both a context and a channel that affect control flow. I think we should instead look to make the interaction between these mechanisms as clear as possible. I also think this implementation still has a race, albeit one that's probably not too harmful given current implementations. Specifically, on shutdown we both call the Here's another possible approach. Instead of returning type bigmachineExecutor struct {
...
shutdownc <-chan struct{}
machinesWG sync.WaitGroup // Things we should wait for before bigmachine shutdown.
}
func (b *bigmachineExecutor) Start() func() {
...
shutdownc := make(chan struct{})
b.shutdownc = shutdownc
return func() {
close(shutdownc)
b.machinesWG.Wait() // Address the race.
b.b.Shutdown()
}
}
func (b *bigmachineExecutor) manager(i int) *machineManager {
...
// Bridge to contexts.
ctx, cancel := context.WithCancel(backgroundcontext.Get())
go func() {
<-b.shutdownc
cancel()
}
b.managersWG.Add(1)
go func() {
defer b.managersWG.Done()
b.managers[i].Do(ctx)
} Then when we run the manager loops, we can continue to manage them with a context, e.g. something like: type machineManager struct {
...
machinesWG sync.WaitGroup
}
func (m *machineManager) Do(ctx context.Context) {
...
machines := startMachines(ctx, ...) // Lift the call to Go out of this; see below.
for _, machine := range machines {
m.machinesWG.Add(1)
go func() {
defer m.machinesWG.Done()
machine.Go(ctx) // Clearly attach the machine lifetime to the manager loop.
...
}
}
...
} Note that I haven't attempted to compile any of this code, so caveat emptor. This seems quite a bit cleaner to me though:
Let me know what you think, as perhaps I'm missing something. |
This is much cleaner, thank you. I've just pushed these changes. |
Co-authored-by: Jaran Charumilind <src@me.jcharum.com>
Co-authored-by: Jaran Charumilind <src@me.jcharum.com>
With the code as is we will see the machine lost logs on shutdown immediately, meaning users will see this more often now. Should we do something else to hide those before landing this? |
Creates a context for each invocation of Session.run and cancels it once the run completes. Also threads this context through to all calls down the chain.
Took another look and I think doing things this way will work out fine. There are still some bigmachine logs that linger after shutdown is called which I can clean up as well, will follow up with another PR there.
Still WIP, adding tests.