When gateway discovery is enabled, the liveness probe should be 👍 even if the adminapi service has 0 ready endpoints #3592

mflendrich · 2023-02-21T16:50:29Z

Is there an existing issue for this?

I have searched the existing issues

Problem Statement

Today KIC (with GW discovery enabled) won't make the liveness probe "healthy" if the proxy service has 0 endpoints.

It does not make sense to restart the KIC pod if it's Gateways that is down.

Proposed Solution

In the non-GW-discovery case, the liveness probe DOES require the adminapi endpoint to be up
In the GW-discovery case, the liveness probe becomes "healthy" if:
- the adminapi service is watchable
- the k8s API is possible to connect to
- it does not matter how many ready endpoints the adminapi service has (can be 0)

Currently the last bullet point does not hold.

Additional information

No response

Acceptance Criteria

the semantics under Proposed solutions are implemented for the liveness probe

rainest · 2023-02-22T01:03:48Z

I feel kinda iffy on this because this is a case where failing, even if failing in a rather non-specific way, probably makes sense--it's a bit of a "yes, technically code can try to handle it, but probably if you do manage to wind up in this situation it's more a 'tell whomever did it not to do that/explain why' scenario".

While the controller does enter crashloop backoff if you start it when no Kong instances will ever become ready, that's probably okay. If there are no Kong instances ready, we can make KIC become live despite, but KIC won't be able to actually do anything in that state. KIC will happily go live and then do nothing forever, because until you fix the lack of ready Kong instances, there's nothing for KIC to push to.

Documentation and examples should avoid this. We're saying "if you're using discovery, deploy your Kong instance and point KIC to it" as the happy path for discovery mode, and we expect Kong instances to come online under normal circumstances. While you could deploy KIC and a Kong Service, but no Kong Deployment for that Service, or a broken Kong Deployment for that, that's a bit of a contrived situation where you know it'll break in a particular way. We, the application authors, know that you can create this situation, but it doesn't feel like something end users would naturally do on their own--absent evidence that users are taking that strange path, we can reasonably expect that most won't.

Crash loop backoff is a reasonable approach for handling odd situations. Kong may fail to come online quickly and send the controller into backoff, but backoff isn't dead, it's just increasingly delayed retries. Hypothetically, you may install KIC and a Kong Service with no live endpoints, take an hour lunch break, come back, create live Kong endpoints, and then bemuse over why KIC doesn't instantly start pushing configuration, but that seems somewhat unlikely. If you decide to wait an hour, backoff will eventually restart KIC and KIC will come online successfully on its own. In practice I'd expect users to maybe initially install a broken state, recognize it's broken shortly after, and then either ask for assistance or redo the entire thing to get back into a happy state.

Do we indeed have stories where we expect this scenario will likely happen in practice, and where we definitely need code logic to recover from it automatically? This feels like a situation that is within the realm of technical possibility, but where realistically your environment is so borked anyway that we don't necessarily need targeted automatic recovery. Hitting CrashLoopBackoff here is a reasonable "yes, your envrionment is broken, you need to fix several things to make it not broken, and having done so you've probably re-rolled your KIC Deployment anyway".

randmonkey · 2023-03-01T08:45:16Z

Considering the behavior when KIC get 0 endpoints of kong admin service on initialization of kong clients, now we can have the 2 options:

let KIC wait for available endpoints to appear. This does not have much difference with the current status to let KIC crash, the only difference is let KIC to do the continuous retries vs let kubelet to restart KIC pods. This does not make much sense.
modify the components to receive changes on gateway clients. This includes modifications on initialization of DBMode (affecting whether leader election is needed), configurations in dataplane/sendconfig, and kong version (used in sendconfigs, and also in reports of telemetry).

Actually, it is possible that Kong version changed during KIC running: customers may upgrade kong gateway. After kong gateway is upgraded, the kong version will change from the initial versions. May be this should be considered together with #3590.

mflendrich added this to the KIC v2.9.0 milestone Feb 21, 2023

mflendrich changed the title ~~When gateway discovery is enabled, the liveness probe should be 👍 even if the adminapi service has 0 ready endpoints~~ When gateway discovery is enabled, the liveness probe should be 👍 even if the adminapi service has 0 ready endpoints Feb 22, 2023

randmonkey self-assigned this Feb 24, 2023

mflendrich mentioned this issue Feb 27, 2023

Support single controller deployments #702

Closed

12 tasks

randmonkey mentioned this issue Mar 3, 2023

fix: unlimit retry of getting gateway admin API endpoint and move to standalone healthz server #3654

Merged

1 task

randmonkey closed this as completed in #3654 Mar 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When gateway discovery is enabled, the liveness probe should be 👍 even if the adminapi service has 0 ready endpoints #3592

When gateway discovery is enabled, the liveness probe should be 👍 even if the adminapi service has 0 ready endpoints #3592

mflendrich commented Feb 21, 2023

rainest commented Feb 22, 2023

randmonkey commented Mar 1, 2023 •

edited

Loading

When gateway discovery is enabled, the liveness probe should be 👍 even if the adminapi service has 0 ready endpoints #3592

When gateway discovery is enabled, the liveness probe should be 👍 even if the adminapi service has 0 ready endpoints #3592

Comments

mflendrich commented Feb 21, 2023

Is there an existing issue for this?

Problem Statement

Proposed Solution

Additional information

Acceptance Criteria

rainest commented Feb 22, 2023

randmonkey commented Mar 1, 2023 • edited Loading

randmonkey commented Mar 1, 2023 •

edited

Loading