Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop machine monitoring on Session shutdown #90

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

awiss
Copy link
Contributor

@awiss awiss commented Jul 10, 2020

Creates a context for each invocation of Session.run and cancels it once the run completes. Also threads this context through to all calls down the chain.

Took another look and I think doing things this way will work out fine. There are still some bigmachine logs that linger after shutdown is called which I can clean up as well, will follow up with another PR there.

Still WIP, adding tests.

@awiss awiss requested a review from jcharum July 10, 2020 03:39
exec/bigmachine.go Outdated Show resolved Hide resolved
exec/session.go Outdated
@@ -272,6 +272,8 @@ var statusMu sync.Mutex

func (s *Session) run(ctx context.Context, calldepth int, funcv *bigslice.FuncValue, args ...interface{}) (*Result, error) {
location := "<unknown>"
runContext, runContextCancel := context.WithCancel(ctx)
defer runContextCancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this will cause machine management to stop when this cancellation is executed when returning from this method, as this context is now the context passed to (*sliceMachine).Go. This is undesirable, as Sessions can be used to run multiple invocations.

I think we want to continue to log the lost machine... message as long as the session has not been shut down, as task results can be reused by future invocations, and it's reasonably informative to know that they will need to be recomputed. I think we should look to only clean up the messages that folks see after session shutdown, as lost tasks have no observable effect at that point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. This brings up another issue however. Do you think it is alright to store a context on the session struct (I earlier found "Do not store Contexts inside a struct type" in the documentation). If not do you think there is another way to do this? The only other way I could think of is to require the user to create their own context and pass it to each session method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possibility is to keep a channel that we close on shutdown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just put up a channel based version of this. Let me know what you think.

Copy link
Contributor

@jcharum jcharum Jul 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is unresolved. My reading is as follows:

  1. (*Session).run now creates a Context, which I'll call ctx henceforth. It is cancelled when (*Session).run returns.
  2. (*Session).run passes ctx to Eval.
  3. Eval passes ctx to executor.Run. Suppose executor is a *bigmachineExecutor.
  4. (*bigmachineExecutor).Run passes ctx to (*bigmachineExecutor).manager.
  5. (*bigmachineExecutor).manager passes ctx to (*machineManager).Do.
  6. (*machineManager).Do passes ctx to startMachines.
  7. startMachines passes ctx to (*sliceMachine).Go.

When (*Session).run returns, both:

  • (*machineManager).Do will return, thereby no longer managing machines.
  • (*sliceMachine).Go will return (for all machines).

Neither of these things should happen, as the *Session needs to continue to be usable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of the current diff (*slicemachine).Go returns based on a channel which is closed on session shutdown. Pushing up a diff now which will make (*machineManager).Do exit based on this channel as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've leaved the context plumbed through but it won't deactivate those processes when its closed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the context is cancelled, the channel returned by ctx.Done() will be closed, so the select will unblock. When we check the for-loop condition, ctx.Err() will be non-nil and (*sliceMachine).Go will return. Could you help me understand how this reasoning is invalid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I forgot about the for loop condition I was just looking at the select at the bottom. I just updated it to use background context and double checked on adhoc.

Copy link
Contributor

@jcharum jcharum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a still a reason for changing anything about context usage? If not, let's not change anything related to context. If so, could you explain?

@awiss
Copy link
Contributor Author

awiss commented Jul 14, 2020

No there is no reason anymore. The only change I've left in is moving the cancel from Eval to Session.run(), since it cleans up the call to maintainSliceGroup.

@jcharum
Copy link
Contributor

jcharum commented Jul 14, 2020

No there is no reason anymore. The only change I've left in is moving the cancel from Eval to Session.run(), since it cleans up the call to maintainSliceGroup.

Why touch anything related to the contexts at all? Is it fixing some issue?

@awiss
Copy link
Contributor Author

awiss commented Jul 14, 2020

Why touch anything related to the contexts at all? Is it fixing some issue?

Just a refactor, trying to leave the code a bit nicer than I found it (no contexts will be canceled at any time different than before) I can revert those changes as well if you aren't comfortable with them.

@jcharum
Copy link
Contributor

jcharum commented Jul 14, 2020

Why touch anything related to the contexts at all? Is it fixing some issue?

Just a refactor, trying to leave the code a bit nicer than I found it (no contexts will be canceled at any time different than before) I can revert those changes as well if you aren't comfortable with them.

In general, I would try to do any refactoring in a separate PR, as it makes it clear that there should be no behavior change and allows us to roll things back more easily.

In this specific case, I don't think it's actually the right thing to do. Eval makes a context that it cancels on return, because this context is used by all the goroutines that Eval spawns to monitor the tasks with which it is concerned. Eval internally wants to stop these goroutines on return, as the work they are doing is no longer necessary. With the proposed change, this control is moved out to the caller, which means that those goroutines will persist until the caller decides to cancel ctx. In the case of (*Session).run, that happens to be immediately after Eval returns, but there's nothing that requires that, e.g. another call of Eval leaves the context uncancelled for meaningfully longer.

@awiss
Copy link
Contributor Author

awiss commented Jul 14, 2020

Thanks for the explanation. I've removed those changes.

@awiss awiss changed the title Use a unique context for each call of Session.Run Stop machine monitoring on Session shutdown Jul 14, 2020
@jcharum
Copy link
Contributor

jcharum commented Jul 15, 2020

(Executor).Start already returns a function that is meant to be called to shut down the Executor. This exposes a second way. Between the two, I prefer the existing one, as I think having a function to call is less awkward than a channel to close to signal shutdown. I think it's better to leave the mechanism of shut down entirely up to the Executor implementation.

Relatedly, the code is a bit difficult to reason about because we've now got functions that have both a context and a channel that affect control flow. I think we should instead look to make the interaction between these mechanisms as clear as possible.

I also think this implementation still has a race, albeit one that's probably not too harmful given current implementations. Specifically, on shutdown we both call the bigmachine shutdown and close the channel. I don't think anything guarantees that the close will be handled before the transition to bigmachine.Stopped is handled. I think we're just lucky right now that bigmachine takes its time before machines transition, for whatever reason.

Here's another possible approach. Instead of returning b.b.Shutdown directly from (*bigmachineExecutor).Start, do something like:

type bigmachineExecutor struct {
	...
	shutdownc <-chan struct{}
	machinesWG sync.WaitGroup // Things we should wait for before bigmachine shutdown.
}

func (b *bigmachineExecutor) Start() func() {
	...
	shutdownc := make(chan struct{})
	b.shutdownc = shutdownc
	return func() {
		close(shutdownc)
		b.machinesWG.Wait() // Address the race.
		b.b.Shutdown()
	}
}

func (b *bigmachineExecutor) manager(i int) *machineManager {
	...
	// Bridge to contexts.
	ctx, cancel := context.WithCancel(backgroundcontext.Get())
	go func() {
		<-b.shutdownc
		cancel()
	}
	b.managersWG.Add(1)
	go func() {
		defer b.managersWG.Done()
		b.managers[i].Do(ctx)
	}

Then when we run the manager loops, we can continue to manage them with a context, e.g. something like:

type machineManager struct {
	...
	machinesWG sync.WaitGroup
}

func (m *machineManager) Do(ctx context.Context) {
	...
	machines := startMachines(ctx, ...) // Lift the call to Go out of this; see below.
	for _, machine := range machines {
		m.machinesWG.Add(1)
		go func() {
			defer m.machinesWG.Done()
			machine.Go(ctx) // Clearly attach the machine lifetime to the manager loop.
			...
		}
	}
	...
}

Note that I haven't attempted to compile any of this code, so caveat emptor. This seems quite a bit cleaner to me though:

  • No change to Executor interface. I think the change only involves two files.
  • Continue consistent pattern of control loop methods just taking a context for external cancellation control.
  • Very limited scope for the channel signaled on shutdown.
  • Correct handling of the race mentioned above, no matter what we do with the (*bigmachine.B).Shutdown implementation.

Let me know what you think, as perhaps I'm missing something.

@awiss
Copy link
Contributor Author

awiss commented Jul 16, 2020

This is much cleaner, thank you. I've just pushed these changes.

exec/bigmachine.go Outdated Show resolved Hide resolved
exec/slicemachine.go Outdated Show resolved Hide resolved
exec/slicemachine.go Outdated Show resolved Hide resolved
exec/slicemachine.go Outdated Show resolved Hide resolved
exec/slicemachine.go Outdated Show resolved Hide resolved
awiss and others added 2 commits July 20, 2020 13:33
Co-authored-by: Jaran Charumilind <src@me.jcharum.com>
Co-authored-by: Jaran Charumilind <src@me.jcharum.com>
@awiss
Copy link
Contributor Author

awiss commented Jul 21, 2020

With the code as is we will see the machine lost logs on shutdown immediately, meaning users will see this more often now. Should we do something else to hide those before landing this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants