[VC-35738] Use errgroup for all go routines#601
Conversation
| group, gctx := errgroup.WithContext(ctx) | ||
| defer func() { | ||
| // TODO: replace Fatalf log calls with Errorf and return the error | ||
| cancel() | ||
| if err := group.Wait(); err != nil { | ||
| logs.Log.Fatalf("failed to wait for controller-runtime component to stop: %v", err) | ||
| } | ||
| }() |
There was a problem hiding this comment.
This code was cut and pasted earlier in the file so that the errgroup can be initialized and used for the management server (/metrics, /healthz, /debug/pprof) and for the VenafiConnection manager; both of which were previously run in unsupervised goroutines.
a6d67f4 to
0cb53e6
Compare
| case <-gctx.Done(): | ||
| return | ||
| case <-time.After(config.Period): | ||
| } |
There was a problem hiding this comment.
If any of the errgroup go routines exit (with nil or error) the main context will be cancelled, which will cause this blocking loop to exit immediately instead of waiting for the time period.
| // The agent must stop if the management server stops | ||
| cancel() | ||
| return nil | ||
| }) |
There was a problem hiding this comment.
In a future PR we should add a signal handler to catch sigterm (ctrl-c) and cancel the parent context, and add another Go routine which calls server.Stop after the context is cancelled.
As it stands, sigterm will cause the process (all threads) to immediately exit without any graceful shutdown.
0cb53e6 to
54dc42e
Compare
| // TODO(wallrj): Pass a context to gatherAndOutputData, so that we don't | ||
| // have to wait for it to finish before exiting the process. |
There was a problem hiding this comment.
This can be done in another PR.
Instead of using log.Fatal in some of them Signed-off-by: Richard Wall <richard.wall@venafi.com>
54dc42e to
864edd3
Compare
| } | ||
| }() | ||
| // The agent must stop if the management server stops | ||
| cancel() |
There was a problem hiding this comment.
Instead of calling cancel, can we return an error when the server is stopped unexpectedly early?
Normally when the context is not canceled, the server should keep running, and when the context is canceled; we do not have to call cancel.
There was a problem hiding this comment.
Also, when we return an error, the group context will be automatically canceled.
There was a problem hiding this comment.
I think I agree, that an error should be returned.
Something like fmt.Errorf("the API server stopped unexpectedly").
But I'll come back to that in another PR.
This cancel will work for now.
tfadeyi
left a comment
There was a problem hiding this comment.
lgtm
In the a future PR it might be good to also move the main server definition in a different function/file, and just have the server's start and stop calls in the go routines.
| return fmt.Errorf("failed to start %q data gatherer %q: %v", kind, dgConfig.Name, err) | ||
| } | ||
| // The agent must stop if any of the data gatherers stops | ||
| cancel() |
There was a problem hiding this comment.
This caused a bug.
Most of the DataGatherer.Run implementations return immediately.
But one (Dynamic) runs an informer which blocks until the supplied channel is closed.
By calling cancel here, I stop all the errgroup go routines including the VenafiConnection helper which is needed later when the data gets posted to Venafi.
As a result, the POST hangs and the agent gets stuck.
jetstack-secure/pkg/datagatherer/local/local.go
Lines 41 to 44 in 1f00f09
jetstack-secure/pkg/datagatherer/k8s/discovery.go
Lines 50 to 53 in 1f00f09
jetstack-secure/pkg/datagatherer/k8s/dynamic.go
Lines 263 to 287 in 1f00f09
There was a problem hiding this comment.
This is part of the effort to remove the use of the legacy
logmodule and replace it withklog.Instead of creating free goroutines and relying on
log.Fatalto terminate the process if any of them fail,this PR instead adds those goroutines to the existing
errgroupand replaces those instances oflog.Fatalwith returned errors which can be bubbled up to the parent Cobra command and handled centrally, in #599.These changes have been extracted from #593 for easier review.
Testing
Occupy the server listening port and observe that the process exits with a meaningful error message.