Aliveness/Readiness Probe improvements #12292

applike-ss · 2022-03-01T13:05:21Z

Motivation

Currently we are facing the issue that our historicals are restarting every now and then (using kubernetes).
This can have multiple reasons, but one of them is unexpected exceptions in the curator.
The scenario is that we have a volumes filled with segments already and druid starts to initialize/read them.
This is taking longer than 60 seconds, so kubelet is killing the pod before the stage "SERVER" can be reached and the health route is available.

Proposed changes

I would like to propose that there is an individual http server running already in the "NORMAL" stage which holds routes like:

/status/ready (which is only ready when all segments are loaded)
/status/alive

Another option would be to have an individual http server serving the regular status route. However in this case i would suggest that the response include a json with fields based on which the druid administrator can identify that druid is alive and ready. Again i would see missing segments (segments to load) as a non-ready state. This might make it look flaky when initializing a new historical, but i don't see this to be an issue.

Rationale

A discussion of why this particular solution is the best one. One good way to approach this is to discuss other alternative solutions that you considered and decided against. This should also include a discussion of any specific benefits or drawbacks you are aware of.

Operational impact

there shouldn't be backwards incompatibilities
old status route can either be kept or moved to the new http server (latter would introduce an incompatibility)
cluster operators would need to adjust their helm charts to use the new route and parse the json

Fryuni · 2022-11-09T00:03:45Z

Currently we are facing the issue that our historicals are restarting every now and then (using kubernetes).
This can have multiple reasons, but one of them is unexpected exceptions in the curator.

This portion seems related to #13167

The scenario is that we have a volumes filled with segments already and druid starts to initialize/read them.
This is taking longer than 60 seconds, so kubelet is killing the pod before the stage "SERVER" can be reached and the health route is available.

A /status/alive route that is started independent of the historical service would be somewhat redundant. The historical process already exits completely when there is a problem that would mark it as dead, and if something fails that does not exit the program but leaves the process in an unrecoverable state would likely not be detected by a route that is decoupled of the problem.

One thing you can do is configure the existing /status/health as a readiness and startup probe, but no liveness probe.

But I do like the idea of a route that only returns OK when there are no pending segments to be loaded or dropped. This would flip between error and ok when new segments are ingested, but that does fit the use cases that I have in mind.

OliveBZH · 2023-03-01T14:06:57Z

@Fryuni can you confirm that theses restarts are fixed by the #13167 ?
We face the same issue and are looking for the root cause

Fryuni · 2023-03-01T15:03:44Z

Yes, the updated curator fixed #13161 for me, as I mentioned there.
We are running without a liveness probe as I suggested above

applike-ss added Design Review Proposal labels Mar 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aliveness/Readiness Probe improvements #12292

Aliveness/Readiness Probe improvements #12292

applike-ss commented Mar 1, 2022

Fryuni commented Nov 9, 2022

OliveBZH commented Mar 1, 2023 •

edited

Loading

Fryuni commented Mar 1, 2023

Aliveness/Readiness Probe improvements #12292

Aliveness/Readiness Probe improvements #12292

Comments

applike-ss commented Mar 1, 2022

Motivation

Proposed changes

Rationale

Operational impact

Fryuni commented Nov 9, 2022

OliveBZH commented Mar 1, 2023 • edited Loading

Fryuni commented Mar 1, 2023

OliveBZH commented Mar 1, 2023 •

edited

Loading