Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpc shutdown and child process handling #88

Merged
merged 9 commits into from
Dec 12, 2018
Merged

Conversation

jbardin
Copy link
Member

@jbardin jbardin commented Nov 30, 2018

This PR's main goal is to fix the shutdown process of grpc plugins, which would always timeout.
In the process we also clean up the subprocess output handling, which speeds up the start/stop of the child process. This broke some tests that were dependent on the shutdown taking more time, but those were patched with a slight delay and can be fixed with a more robust solution at a later time.

The grpc server couldn't exit at all, because the Serve method was waiting on a channel that was never closed. The Serve method is what should be closing the channel to notify when the server is done.

The close signal is implemented by adding a small grpc service (GRPCController) with a Shutdown method to indicate that the process should exit. There is still something blocking a GracefulStop attempt, but the implementation can be extended to try a graceful stop with a timeout once that is resolved.

The GRPCController and GRPCBroker protobuf definitions were moved into an internal package, which prevents the generated code from polluting the public API.

Copy link
Contributor

@mitchellh mitchellh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests pass and the changes look good to me.

The tests didn't fail BEFORE this patch though which makes me feel weird... is there a way we can exercise that failure to verify this change? I'm okay merging before that just to get the fix in, but would be really great to get a failing test before that is fixed after.

All the sleeps also feel kind of weird.

@jbardin
Copy link
Member Author

jbardin commented Dec 11, 2018

Yes, let me get a failing test for the before case. I don't like the sleeps either, but they were just inserted until I had time to refactor the tests (which I do now that the next TF alpha is out).

I'll update the tests in this PR before merging.

@jbardin jbardin force-pushed the jbardin/grpc-shutdown branch 2 times, most recently from 26e1831 to fa01c2f Compare December 11, 2018 20:18
@jbardin
Copy link
Member Author

jbardin commented Dec 11, 2018

Rebased on the latest master. This is closer now, but I've still managed to induce some odd ordering during shutdown, so there will be at least one more commit incoming.

The Serve method was waiting on a channel that was never closes. The
Serve method is what should be closing the channel to notify when the
server is done.

The stdrr read loop only checks for EOF, so if there's another type of
error, it ends up in a busy loop.
The reattach proess poller runs on a 1s ticker, which means it can
possibly take almost 2 seconds to get a response.
This uses proper os.File file descriptors for stdin and stdout rather
than io.Pipe, removing extra copy goroutines from each, and
automatically closes the fds when necessary.

Don't scan with slices from stdout, as they are handled concurrently and
the backing array may be re-used.
Simplify the goroutine management by removing multiple channels and
explicitly waiting for _all_ client goroutines to return on Kill().

Make sure Context cancellation funcs are called in all paths.

Process termination it detected using the existing cancellation context.
This makes detection of early termination and termination during Kill
use the same mechanism, rather than the latter relying to the
side-effect of the logging channel being closed.

Add a flag to mark when a process was forcefully killed for tests. (this
addition makes the pre-refactor grpc tests fail, to validate some of the
PR changes)
Unlike the netRPC server, the gRPC server cannot terminate when the
client closes its connection. Since we want to use the server as a
plugin with control over its lifecycle, there needs to be another
channel of communication to notify the server when to shutdown.

This adds a small grpc service, currently with only a Shutdown method,
which calls Stop on the GRPCServer. This will allow the server to
shutdown normally without the client timing out and calling Kill on the
process.
Adds a record of how the protobuf code was built, and keeps it inline
with the controller code.
Remove the protobuf files from the plugin package, to avoid polluting
the public API.
@jbardin jbardin force-pushed the jbardin/grpc-shutdown branch from ffafcca to cd0df79 Compare December 11, 2018 21:39
@jbardin
Copy link
Member Author

jbardin commented Dec 11, 2018

All set now. Cleaned up the temporary commits in the middle, and all tests all pass under -race too 🏎 🏎

Copy link
Contributor

@mitchellh mitchellh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this add the failing test w/o this PR?

Looks a lot cleaner though.

@jbardin
Copy link
Member Author

jbardin commented Dec 12, 2018

Sorry, the failing test change was squashed into the rest of the test fixes, since it was pretty ugly to rebase in before the rest of the commits. Any of the grpc tests that call killed() will fail prior to the shutdown changes.

@jbardin jbardin merged commit f444068 into master Dec 12, 2018
@jbardin jbardin deleted the jbardin/grpc-shutdown branch December 12, 2018 15:08
endocrimes added a commit to hashicorp/nomad that referenced this pull request Dec 12, 2018
This fixes a bug related to shutting down of GRPC plugin interfaces
(more info: hashicorp/go-plugin#88)

This does not yet fix all test cases for subprocess leaking, but is a
useful independant change.
endocrimes added a commit to hashicorp/nomad that referenced this pull request Jan 8, 2019
This fixes a bug related to shutting down of GRPC plugin interfaces
(more info: hashicorp/go-plugin#88)

This does not yet fix all test cases for subprocess leaking, but is a
useful independant change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants