Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When gateway discovery is enabled, the liveness probe should be 👍 even if the adminapi service has 0 ready endpoints #3592

Closed
1 of 2 tasks
mflendrich opened this issue Feb 21, 2023 · 2 comments · Fixed by #3654
Assignees
Milestone

Comments

@mflendrich
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Problem Statement

Today KIC (with GW discovery enabled) won't make the liveness probe "healthy" if the proxy service has 0 endpoints.

It does not make sense to restart the KIC pod if it's Gateways that is down.

Proposed Solution

  • In the non-GW-discovery case, the liveness probe DOES require the adminapi endpoint to be up
  • In the GW-discovery case, the liveness probe becomes "healthy" if:
    • the adminapi service is watchable
    • the k8s API is possible to connect to
    • it does not matter how many ready endpoints the adminapi service has (can be 0)

Currently the last bullet point does not hold.

Additional information

No response

Acceptance Criteria

  • the semantics under Proposed solutions are implemented for the liveness probe
@mflendrich mflendrich added this to the KIC v2.9.0 milestone Feb 21, 2023
@rainest
Copy link
Contributor

rainest commented Feb 22, 2023

I feel kinda iffy on this because this is a case where failing, even if failing in a rather non-specific way, probably makes sense--it's a bit of a "yes, technically code can try to handle it, but probably if you do manage to wind up in this situation it's more a 'tell whomever did it not to do that/explain why' scenario".

While the controller does enter crashloop backoff if you start it when no Kong instances will ever become ready, that's probably okay. If there are no Kong instances ready, we can make KIC become live despite, but KIC won't be able to actually do anything in that state. KIC will happily go live and then do nothing forever, because until you fix the lack of ready Kong instances, there's nothing for KIC to push to.

Documentation and examples should avoid this. We're saying "if you're using discovery, deploy your Kong instance and point KIC to it" as the happy path for discovery mode, and we expect Kong instances to come online under normal circumstances. While you could deploy KIC and a Kong Service, but no Kong Deployment for that Service, or a broken Kong Deployment for that, that's a bit of a contrived situation where you know it'll break in a particular way. We, the application authors, know that you can create this situation, but it doesn't feel like something end users would naturally do on their own--absent evidence that users are taking that strange path, we can reasonably expect that most won't.

Crash loop backoff is a reasonable approach for handling odd situations. Kong may fail to come online quickly and send the controller into backoff, but backoff isn't dead, it's just increasingly delayed retries. Hypothetically, you may install KIC and a Kong Service with no live endpoints, take an hour lunch break, come back, create live Kong endpoints, and then bemuse over why KIC doesn't instantly start pushing configuration, but that seems somewhat unlikely. If you decide to wait an hour, backoff will eventually restart KIC and KIC will come online successfully on its own. In practice I'd expect users to maybe initially install a broken state, recognize it's broken shortly after, and then either ask for assistance or redo the entire thing to get back into a happy state.

Do we indeed have stories where we expect this scenario will likely happen in practice, and where we definitely need code logic to recover from it automatically? This feels like a situation that is within the realm of technical possibility, but where realistically your environment is so borked anyway that we don't necessarily need targeted automatic recovery. Hitting CrashLoopBackoff here is a reasonable "yes, your envrionment is broken, you need to fix several things to make it not broken, and having done so you've probably re-rolled your KIC Deployment anyway".

@mflendrich mflendrich changed the title When gateway discovery is enabled, the liveness probe should be 👍 even if the adminapi service has 0 ready endpoints When gateway discovery is enabled, the liveness probe should be 👍 even if the adminapi service has 0 ready endpoints Feb 22, 2023
@randmonkey randmonkey self-assigned this Feb 24, 2023
@randmonkey
Copy link
Contributor

randmonkey commented Mar 1, 2023

Considering the behavior when KIC get 0 endpoints of kong admin service on initialization of kong clients, now we can have the 2 options:

  • let KIC wait for available endpoints to appear. This does not have much difference with the current status to let KIC crash, the only difference is let KIC to do the continuous retries vs let kubelet to restart KIC pods. This does not make much sense.
  • modify the components to receive changes on gateway clients. This includes modifications on initialization of DBMode (affecting whether leader election is needed), configurations in dataplane/sendconfig, and kong version (used in sendconfigs, and also in reports of telemetry).

Actually, it is possible that Kong version changed during KIC running: customers may upgrade kong gateway. After kong gateway is upgraded, the kong version will change from the initial versions. May be this should be considered together with #3590.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants