Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add health endpoint #73

Closed
iffyio opened this issue Jul 16, 2020 · 9 comments · Fixed by #221
Closed

Add health endpoint #73

iffyio opened this issue Jul 16, 2020 · 9 comments · Fixed by #221
Labels
area/operations Installation, updating, metrics etc good first issue Good for newcomers help wanted Extra attention is needed

Comments

@iffyio
Copy link
Collaborator

iffyio commented Jul 16, 2020

#65 Includes a http server with endpoint for serving metrics. We want to reuse this server to include an endpoint for health checks

@iffyio iffyio added area/operations Installation, updating, metrics etc good first issue Good for newcomers help wanted Extra attention is needed labels Jul 17, 2020
@markmandel
Copy link
Member

Pretty sure this is done, yeah? @iffyio

@iffyio
Copy link
Collaborator Author

iffyio commented Feb 23, 2021

It hasn't been worked on yet. coming to think of it not sure what a health check of the proxy would look like, but at least the server is still a metrics server in the code rather than a generic admin server

@markmandel
Copy link
Member

Weird, I swore we did. Clearly I'm working on too many things at the same time.

@markmandel
Copy link
Member

Thought I'd take a look at this one too - even if was just something super basic we could expand upon later, and had a couple of thoughts:

If we make the metric port configurable as per #101 - that will come under admin.metrics.port - if we want the health endpoint to run on the same port as the metrics, should we change the config to something more generic?

Or, should the health endpoint run on it's own port? And be configured through something like admin.liveness.port, which could also be expanded as needed over time.

That is easier config, and also easier code, because from review - passing around the hyper server may end up being more tricky than it's worth - otherwise it may be worth moving back to warp. I'm not sure if you would even want to separate access to a health check endpoint from metrics -- but maybe someone will?

Thoughts?

@iffyio
Copy link
Collaborator Author

iffyio commented Mar 16, 2021

I think we can have them all under the same admin server/port. it'll likely be simpler and we don't want to create a server for each use case either

@markmandel
Copy link
Member

coming to think of it not sure what a health check of the proxy would look like, but at least the server is still a metrics server in the code rather than a generic admin server

We can use:
https://doc.rust-lang.org/beta/std/panic/fn.set_hook.html

To track if a panic has occurred. If it has, we should probably (a) log it 😄 but also (b) respond on the health point that we are unhealthy, since we don't know what part of the system has been broken.

Sound good?

Down the line, we can extra specific checks to it, but I think it's a good start.

@iffyio
Copy link
Collaborator Author

iffyio commented Mar 19, 2021

I don't see panicking being a proper use case for health check, I think if we panic we should let the program crash as usual and restart rather than handle it specially

@markmandel
Copy link
Member

markmandel commented Mar 19, 2021

I don't see panicking being a proper use case for health check, I think if we panic we should let the program crash as usual and restart rather than handle it specially

Ah that's a good point. I'm showing my Kubernetes bias 😄 in which case, we should definitely add a set_hook to power that, as if a panic happens within a tokio::spawn it's not going to crash the system, it could just exit the worker loop that it happened within (and most of our tokio::spawn operations start worker loops).

In which case, I'll just setup a simple /live endpoint that returns 200, which can allow Kubernetes (and others) to check if the process is at least actively responding. We can later expand it to include more internal checks as needed.

Probably makes sense to have the panic hook in the same health module as well, just so that all healthiness operations sit in the same place.

I've got the admin server split pretty much working for #101 over in https://github.com/googleforgames/quilkin/tree/mm/admin-server -- will probably see how this parts fits into it, and then start taking it apart and submitting PRs.

@iffyio
Copy link
Collaborator Author

iffyio commented Mar 19, 2021

Ah that makes sense, we can have set_hook e.g log the stack trace and mark the proxy as unhealthy or something like that to cover panicking from other threads yeah

markmandel added a commit that referenced this issue Mar 30, 2021
Create a `Health` module which tracks if a panic occurs
anywhere in the code base (which may or may not be on the main thread),
and moves the system to unhealthy.

In the future we could add extra checks to this module as we discover
more things that impact proxy health.

Closes #73
markmandel added a commit that referenced this issue Mar 30, 2021
Create a `Health` module which tracks if a panic occurs
anywhere in the code base (which may or may not be on the main thread),
and moves the system to unhealthy.

In the future we could add extra checks to this module as we discover
more things that impact proxy health.

Closes #73
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/operations Installation, updating, metrics etc good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants