Add health endpoint #73

iffyio · 2020-07-16T11:18:25Z

#65 Includes a http server with endpoint for serving metrics. We want to reuse this server to include an endpoint for health checks

markmandel · 2021-02-22T21:05:22Z

Pretty sure this is done, yeah? @iffyio

iffyio · 2021-02-23T09:28:03Z

It hasn't been worked on yet. coming to think of it not sure what a health check of the proxy would look like, but at least the server is still a metrics server in the code rather than a generic admin server

markmandel · 2021-02-23T15:59:34Z

Weird, I swore we did. Clearly I'm working on too many things at the same time.

markmandel · 2021-03-15T23:29:44Z

Thought I'd take a look at this one too - even if was just something super basic we could expand upon later, and had a couple of thoughts:

If we make the metric port configurable as per #101 - that will come under admin.metrics.port - if we want the health endpoint to run on the same port as the metrics, should we change the config to something more generic?

Or, should the health endpoint run on it's own port? And be configured through something like admin.liveness.port, which could also be expanded as needed over time.

That is easier config, and also easier code, because from review - passing around the hyper server may end up being more tricky than it's worth - otherwise it may be worth moving back to warp. I'm not sure if you would even want to separate access to a health check endpoint from metrics -- but maybe someone will?

Thoughts?

iffyio · 2021-03-16T07:48:32Z

I think we can have them all under the same admin server/port. it'll likely be simpler and we don't want to create a server for each use case either

markmandel · 2021-03-18T23:59:19Z

coming to think of it not sure what a health check of the proxy would look like, but at least the server is still a metrics server in the code rather than a generic admin server

We can use:
https://doc.rust-lang.org/beta/std/panic/fn.set_hook.html

To track if a panic has occurred. If it has, we should probably (a) log it 😄 but also (b) respond on the health point that we are unhealthy, since we don't know what part of the system has been broken.

Sound good?

Down the line, we can extra specific checks to it, but I think it's a good start.

iffyio · 2021-03-19T07:14:30Z

I don't see panicking being a proper use case for health check, I think if we panic we should let the program crash as usual and restart rather than handle it specially

markmandel · 2021-03-19T16:32:30Z

I don't see panicking being a proper use case for health check, I think if we panic we should let the program crash as usual and restart rather than handle it specially

Ah that's a good point. I'm showing my Kubernetes bias 😄 in which case, we should definitely add a set_hook to power that, as if a panic happens within a tokio::spawn it's not going to crash the system, it could just exit the worker loop that it happened within (and most of our tokio::spawn operations start worker loops).

In which case, I'll just setup a simple /live endpoint that returns 200, which can allow Kubernetes (and others) to check if the process is at least actively responding. We can later expand it to include more internal checks as needed.

Probably makes sense to have the panic hook in the same health module as well, just so that all healthiness operations sit in the same place.

I've got the admin server split pretty much working for #101 over in https://github.com/googleforgames/quilkin/tree/mm/admin-server -- will probably see how this parts fits into it, and then start taking it apart and submitting PRs.

iffyio · 2021-03-19T16:47:02Z

Ah that makes sense, we can have set_hook e.g log the stack trace and mark the proxy as unhealthy or something like that to cover panicking from other threads yeah

Create a `Health` module which tracks if a panic occurs anywhere in the code base (which may or may not be on the main thread), and moves the system to unhealthy. In the future we could add extra checks to this module as we discover more things that impact proxy health. Closes #73

iffyio added area/operations Installation, updating, metrics etc good first issue Good for newcomers help wanted Extra attention is needed labels Jul 17, 2020

markmandel mentioned this issue Mar 30, 2021

Add /live health endpoint to admin server #221

Merged

markmandel closed this as completed in #221 Mar 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add health endpoint #73

Add health endpoint #73

iffyio commented Jul 16, 2020

markmandel commented Feb 22, 2021

iffyio commented Feb 23, 2021

markmandel commented Feb 23, 2021

markmandel commented Mar 15, 2021

iffyio commented Mar 16, 2021

markmandel commented Mar 18, 2021

iffyio commented Mar 19, 2021

markmandel commented Mar 19, 2021 •

edited

Loading

iffyio commented Mar 19, 2021

Add health endpoint #73

Add health endpoint #73

Comments

iffyio commented Jul 16, 2020

markmandel commented Feb 22, 2021

iffyio commented Feb 23, 2021

markmandel commented Feb 23, 2021

markmandel commented Mar 15, 2021

iffyio commented Mar 16, 2021

markmandel commented Mar 18, 2021

iffyio commented Mar 19, 2021

markmandel commented Mar 19, 2021 • edited Loading

iffyio commented Mar 19, 2021

markmandel commented Mar 19, 2021 •

edited

Loading