Skip to content

Conversation

lorenzo
Copy link
Contributor

@lorenzo lorenzo commented Sep 4, 2019

Description

Implements graceful shutdown for websockets. When the sockets are not actively receiving data they will receive a closing message and the server will promptly shut down.

Affected components

  • Server

Solution and Design

The general design I followed:

  • Keep the knowledge of how to close connections inside the Websocket
    module
  • Add a field to WsServer to hold a channel, the channel signals when
    the application is stopping
  • Return a callback to Main that can be invoked to send a message to
    the channel

Steps to test and verify

  • Start the server
  • Use gql with any subscription
  • Send a TERM signal to the process. It should stop immediately

The general design I followed:

* Keep the knowledge of how to close connections inside the Websocket
  module
* Add a field to `WsServer` to hold a channel, the channel signals when
  the appliction is stopping
* Return a callback to `Main` that can be invoked to send a message to
  the channel
@lorenzo lorenzo requested a review from lexi-lambda as a code owner September 4, 2019 15:27
@netlify
Copy link

netlify bot commented Sep 4, 2019

Deploy preview for hasura-docs ready!

Built with commit c35b47e

https://deploy-preview-2827--hasura-docs.netlify.com

@hasura-bot
Copy link
Contributor

Review app for commit dc2906b deployed to Heroku: https://hge-ci-pull-2827.herokuapp.com
Docker image for server: hasura/graphql-engine:pull2827-dc2906b7

@marionschleifer marionschleifer added the c/community Related to community content label Sep 4, 2019
Copy link
Contributor

@lexi-lambda lexi-lambda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this generally looks great to me, though I left one small comment.

@@ -435,6 +435,14 @@ initErrExit e = do
<> T.unpack (qeError e)
exitFailure

data HasuraApp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for pulling this out.

@lorenzo
Copy link
Contributor Author

lorenzo commented Sep 5, 2019

@lexi-lambda added your suggestions, thanks for the review!

@hasura-bot
Copy link
Contributor

Review app for commit b64cda2 deployed to Heroku: https://hge-ci-pull-2827.herokuapp.com
Docker image for server: hasura/graphql-engine:pull2827-b64cda21

lorenzo and others added 2 commits September 5, 2019 11:29
Before it was not rejecting new connection. By moving the logic one
layer up, we're now able to reject connections as they come in.
@lorenzo
Copy link
Contributor Author

lorenzo commented Sep 5, 2019

@lexi-lambda I figured that my implementation was not entirely correct as new WS connections were not being rejected properly. By moving the logic one layer above I was able to achieve that goal.

@hasura-bot
Copy link
Contributor

Review app for commit d15cf86 deployed to Heroku: https://hge-ci-pull-2827.herokuapp.com
Docker image for server: hasura/graphql-engine:pull2827-d15cf86e

Copy link
Contributor

@lexi-lambda lexi-lambda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes. I’m not opposed to the new approach, but I’ll admit: I don’t understand it (and I’d like to make sure I do). I’d have thought that closing the socket would be sufficient to prevent new websocket connections, since, well… how could new connections be received at all if the server isn’t listening anymore? Can you explain to me why that wasn’t enough?

@hasura-bot
Copy link
Contributor

Review app for commit cdcc5bf deployed to Heroku: https://hge-ci-pull-2827.herokuapp.com
Docker image for server: hasura/graphql-engine:pull2827-cdcc5bf6

@lorenzo
Copy link
Contributor Author

lorenzo commented Sep 5, 2019

@lexi-lambda That's a great question and I was also mystified at how new connections were apparently making their way in. My working theory right now is that those connections were already accepted before closing the socket but had not been yet handled by the Websockets server code. Those still in transit connections would apparently cause a race condition where they were accepted and then served as live queries after the call to closing all connections was already made.

I'm actually not sure how to validate my theory as true, but it seemed to solve an edge case when rapidly trying to create new WS connections as I send the TERM signal. Does this make any sense at all to you, sorry for not being more clear about this myself?

@lexi-lambda
Copy link
Contributor

My working theory right now is that those connections were already accepted before closing the socket but had not been yet handled by the Websockets server code. Those still in transit connections would apparently cause a race condition where they were accepted and then served as live queries after the call to closing all connections was already made.

Aha, that makes sense to me, thanks for explaining. The guard does seem necessary, then, but I have a couple small requests:

  1. Could you add support for shutting the server down to Hasura.GraphQL.Transport.WebSocket.Server instead of Hasura.GraphQL.Transport.WebSocket? The former module is somewhat poorly named, as it doesn’t really have anything to do with GraphQL, but I think this functionality belongs there, since it’s really agnostic to what the server is being used for.

    The approach I have in mind is to add a _wssIsShuttingDown :: !(TVar Bool) field to WSServer (and though this is subjective, I think a Bool is fine here; I think introducing a different sum type is premature), then add a shutdownWSServer export.

  2. I think it would be nice to return an HTTP 503 Service Unavailable response here rather than HTTP 500 Internal Server Error, since it’s more descriptive.

    I would also include a Retry-After: 0 header in the response, since in a cluster of Hasura instances, we’d like the client to retry connecting immediately. (Unfortunately, AFAIK no browsers will actually automatically retry the request even if the header is present, so it probably isn’t actually that useful, but it feels like good HTTP hygiene.)

@lorenzo
Copy link
Contributor Author

lorenzo commented Sep 6, 2019

  1. ; I think introducing a different sum type is premature)

That was actually my first attempt and I immediately confused myself by writing the function assertAcceptingConnection True = ... did True mean accepting connections or the opposite? (it was the opposite)

In the light that it took 5 seconds for me to get boolean blindness I resolved to make a cheap sum type

  1. The approach I have in mind is to add a _wssIsShuttingDown :: !(TVar Bool) field to WSServer

You may remember that was the first version of this PR you reviewed. I got confused by the module naming thinking that it actually implemented a server-like structure, or that at least the code handling the very beginnings of a connection happened there.

I think that those 2 modules should swap names: Hasura.GraphQL.Transport.WebSocket is actually implementing the server (accepts connections, rejects, checks for the right route, and actaully create a server-like struct which in the case of WAI is a curried function expecting a connection)

I will attempt to put the logic back there in the best way it makes sense to me now. Let me know if this is what you had in mind.

@hasura-bot
Copy link
Contributor

Review app for commit 66356fd deployed to Heroku: https://hge-ci-pull-2827.herokuapp.com
Docker image for server: hasura/graphql-engine:pull2827-66356fd4

@lorenzo
Copy link
Contributor Author

lorenzo commented Sep 6, 2019

@lexi-lambda can you please review again?

@hasura-bot
Copy link
Contributor

Review app for commit e408213 deployed to Heroku: https://hge-ci-pull-2827.herokuapp.com
Docker image for server: hasura/graphql-engine:pull2827-e408213e

This new design prevents accepting new WS connections while we are in
the process of shutting the server down. We enforce this by guarding the
connections map in a TVar that we need to check for every time we need
to use the connections map.
@lorenzo
Copy link
Contributor Author

lorenzo commented Sep 6, 2019

@lexi-lambda Alright, I tried your suggestion and really liked how much safer the code looks like now. I had to do a few changes in order to keep things atomic, but I'm feeling more confident now that this is the right design.

@hasura-bot
Copy link
Contributor

Review app for commit 48a96ec deployed to Heroku: https://hge-ci-pull-2827.herokuapp.com
Docker image for server: hasura/graphql-engine:pull2827-48a96ecc

@lorenzo lorenzo requested a review from lexi-lambda September 6, 2019 19:22
@hasura-bot
Copy link
Contributor

Review app for commit d377c9a deployed to Heroku: https://hge-ci-pull-2827.herokuapp.com
Docker image for server: hasura/graphql-engine:pull2827-d377c9a5

@hasura-bot
Copy link
Contributor

Review app for commit 0f7bd96 deployed to Heroku: https://hge-ci-pull-2827.herokuapp.com
Docker image for server: hasura/graphql-engine:pull2827-0f7bd96b

@hasura-bot
Copy link
Contributor

Review app for commit 768baf6 deployed to Heroku: https://hge-ci-pull-2827.herokuapp.com
Docker image for server: hasura/graphql-engine:pull2827-768baf6a

@lorenzo
Copy link
Contributor Author

lorenzo commented Sep 9, 2019

@lexi-lambda can you please review again?

@hasura-bot
Copy link
Contributor

Review app for commit c35b47e deployed to Heroku: https://hge-ci-pull-2827.herokuapp.com
Docker image for server: hasura/graphql-engine:pull2827-c35b47e5

Copy link
Contributor

@lexi-lambda lexi-lambda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me; many thanks for your patience working out the subtleties!

@lorenzo
Copy link
Contributor Author

lorenzo commented Sep 9, 2019

Thanks @lexi-lambda I’m actually glad that you were so thorough, I’m happy with his more robust solution.

@lexi-lambda lexi-lambda merged commit 5609fba into hasura:master Sep 9, 2019
@hasura-bot
Copy link
Contributor

Review app https://hge-ci-pull-2827.herokuapp.com is deleted

@lorenzo lorenzo deleted the graceful-shutdown branch September 9, 2019 20:36
polRk pushed a commit to polRk/graphql-engine that referenced this pull request Feb 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/community Related to community content
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants