Ingress Status: Improve checking for rollover or shutdown. #10686

nefelim4ag · 2023-11-28T15:47:34Z

What this PR does / why we need it:

Filter out pods in terminating state from pod discovery.
Reuse common labels filter for pod discovery between ReplicaSets.

Types of changes

Bug fix (non-breaking change which fixes an issue)

Which issue/s this PR fixes

fixes #10680 - incorrect handling of pod rollout without publishService
fixes #10681 - missing handling of pods in terminating state

How Has This Been Tested?

Build image, replace current 1.9.4 image with patched.
Make kubectl -n ingress-nginx rollout restart deployment ingress-nginx-controller for RS handling tests.
Patch LifeCycle.preStop to:

lifecycle:
  preStop:
    exec:
      command:
      - sh
      - -ec
      - sleep 180; kill 1

And monitoring ingress status for correct changes:

Pods from new ReplicaSets are shown in Ingress Status
Pods in terminating state are missing from Ingress Status

Checklist:

I've read the CONTRIBUTION guide
I have added unit and/or e2e tests to cover my changes.
All new and existing tests passed.

Current status_test.go are a little fragile and I currently struggle with just adding pods with new statuses and adding tests. But I'm working on it, any help will be appreciated.

Thanks!

netlify · 2023-11-28T15:47:41Z

✅ Deploy Preview for kubernetes-ingress-nginx canceled.

Name	Link
🔨 Latest commit	`8216d4a`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-ingress-nginx/deploys/66d5ae9e4bd27700087d3331

k8s-ci-robot · 2023-11-28T15:47:43Z

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2023-11-28T15:47:44Z

Welcome @nefelim4ag!

It looks like this is your first PR to kubernetes/ingress-nginx 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/ingress-nginx has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2023-11-28T15:47:45Z

Hi @nefelim4ag. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Gacko

Added some suggestions.

internal/ingress/status/status.go

Gacko · 2023-12-01T09:31:38Z

Looks great! Unfortunately I'm not eligible for approving controller PRs, yet. But if you really wanna go the extra mile, you could have a look at this issue and check if your PR can (at least partly) improve what's not working in there. I know, your change is focus on not having a published service, but maybe you can at least have a look. :)

nefelim4ag · 2023-12-01T15:41:04Z

@Gacko You are right, this is the same problem here.

But I'm not sure they are directly related to each other because helm upgrade path and drop of LB status must be handled by isRunningMultiplePods(). I need to spend some more time on get LB working and reproducing that issue with enabled debug.

UPD:
It is similar, someone adds helm version labels to the pod:

app.kubernetes.io/version: 1.8.4
helm.sh/chart: ingress-nginx-4.7.3

They are creating problems. As quick fix, it is possible to make them "common" and add them to the filter.

But right solution will be to make some more magic & rewrite that discovery logic completely.
Maybe always use publishService, or follow by ownerReferences in pod data and use selector from deployment/daemonset. I think that selector usage will not break anything to users, at least behavior will look similar.

Anyway, only maintainers have the right to choose a solution, and this is out of the scope of this PR (IMHO).
I can only append rules to the current filter.

Gacko · 2023-12-02T12:29:34Z

internal/ingress/status/status.go

+		if pod.GetDeletionTimestamp() != nil {
+			klog.InfoS("POD is terminating", "pod", klog.KObj(&pod), "node", pod.Spec.NodeName)
+			continue
+		}


Isn't the deletion timestamp set by the API server? So if a pod has finalizers or connections are still draining, I'm not sure if the controller is still running while already having the timestamp set.

It's sets equals to delete request time

When you create a resource using a manifest file, you can specify finalizers in the metadata.finalizers field. When you attempt to delete the resource, the API server handling the delete request notices the values in the finalizers field and does the following: Modifies the object to add a metadata.deletionTimestamp field with the time you started the deletion. Prevents the object from being removed until all items are removed from its metadata.finalizers field

Yes, I'm aware of that. But setting deletionTimestamp does not mean the pod is actually terminating as the actual shutdown is initiated by kubelet at last and, in case of ingress-nginx might take longer due to the preStopHook. So the container itself is unready at that point, which means it's not eligible for new connections, but still running and able to take over leadership until it's getting asked to terminate after either the preStopHook completion or timeout (240s IIRC).

Yes, but if Nginx doesn't have connections it will exit immediately, so the container itself becomes useless almost instantly.

I1203 00:53:44.834420 8 nginx.go:387] "Stopping admission controller" E1203 00:53:44.834468 8 nginx.go:326] "Error listening for TLS connections" err="http: Server closed" I1203 00:53:44.834473 8 nginx.go:395] "Stopping NGINX process" 2023/12/03 00:53:44 [notice] 310#310: signal process started I1203 00:53:45.867820 8 nginx.go:408] "NGINX process has stopped" I1203 00:53:45.867839 8 sigterm.go:44] Handled quit, delaying controller exit for 10 seconds

Because of this, I use sleep before sending signals to process.

Anyway, that is how service endpoints work. Pod with terminating status will be removed from endpoints, ignoring ready status. But because we emulating this logic in this specific case (publishService disabled). The same termination handling must be here, so the container can be in terminating status for some time, ready for the same amount of time (to process connections).

So, usage example for here, where it can have shortlived requests, and current logic will not work.

Amazon ELB - it awfully slow to update endpoints.

DNS, where LoadBalancer status is several A records, so I want to wait for record TTL before killing the pod.

Same for K8s services, because of "eventual consistency" but update of IPTables/IPVS requires several seconds, so sleep 15 everywhere makes internal networking work nice (of course it depends on the application, I'm talking about HTTP here).

I mean, this is not an Ingress nginx specific issue, this is how the k8s are designed from the ground up.

Termination handling eventually will be reimplemented everywhere, where pod discovery mimics service discovery logic.

The same issue was in External-dns for quite some time: kubernetes-sigs/external-dns#3447 (comment)

Gacko · 2023-12-03T14:13:39Z

/assign

nefelim4ag · 2023-12-04T21:12:47Z

Added tests, and now helm labels will be ignored, which can work as duct tape for:

Ingress Status: Wrong labels when checking for rollover or shutdown. #10475

I can squash/rebase on request, np.

Gacko · 2024-02-01T21:54:15Z

/retitle Ingress Status: Improve checking for rollover or shutdown.

nefelim4ag · 2024-05-07T10:55:17Z

Hello, how can I assist somehow with this PR?

k8s-ci-robot · 2024-07-01T09:51:32Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: nefelim4ag
Once this PR has been reviewed and has the lgtm label, please ask for approval from gacko. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

longwuyuan · 2024-07-01T10:24:36Z

Is it feasible to do real tests on minikube and/or kind clusters to first show the problem and then to deploy controller with changed code to show the solution (as well as lack of impact on any other part of the controller from the change)
Squash commits

nefelim4ag · 2024-07-02T13:39:45Z

@longwuyuan I did not quite understand how I could provide that to someone (testing cluster), I used the patched version this whole time because it fixed my issues on my dev/prod clusters.
But my words are worth nothing, there are unit tests for that.

I described the problems (we talk about the usage of ingress nginx without LB, where this code applies):

Ingress Status: Wrong labels when checking for rollover or shutdown. #10475. Helm upgrade is 1.8.4 -> 1.9.1, or 1.9.1 -> 1.10.1 is a simple reproduction to check how the controller will not detect new nodes with new labels and not set them to ingress status.
Missing check for pod 'terminating' status.

There are images: https://gallery.ecr.aws/xflow-ltd/ingress-nginx, so 1.9.4-p2, 1.10.1-p1
Just in case someone needed them.

Yes, squash sounds good after that amount of time.

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>

Co-authored-by: Marco Ebert <marco_ebert@icloud.com> Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 28, 2023

k8s-ci-robot requested review from strongjz and tao12345666333 November 28, 2023 15:47

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 28, 2023

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Nov 28, 2023

k8s-ci-robot added needs-priority size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 28, 2023

nefelim4ag force-pushed the fix/internal-ingress-status-go branch 3 times, most recently from 8c0b5d5 to 9f092d3 Compare November 29, 2023 14:50

Gacko reviewed Nov 30, 2023

View reviewed changes

internal/ingress/status/status.go Outdated Show resolved Hide resolved

internal/ingress/status/status.go Outdated Show resolved Hide resolved

internal/ingress/status/status.go Outdated Show resolved Hide resolved

internal/ingress/status/status.go Outdated Show resolved Hide resolved

nefelim4ag force-pushed the fix/internal-ingress-status-go branch from fb259e1 to 55619e7 Compare December 1, 2023 02:03

Gacko reviewed Dec 2, 2023

View reviewed changes

k8s-ci-robot assigned Gacko Dec 3, 2023

nefelim4ag force-pushed the fix/internal-ingress-status-go branch from 55619e7 to eef2079 Compare December 4, 2023 21:05

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 4, 2023

nefelim4ag force-pushed the fix/internal-ingress-status-go branch from eef2079 to abdd6ba Compare December 4, 2023 21:08

Gacko mentioned this pull request Dec 13, 2023

Helm fix: make status update permission configurable #10659

Merged

10 tasks

k8s-ci-robot changed the title ~~Fix corner cases behavior without publishService for ingress load balancer status~~ Ingress Status: Improve checking for rollover or shutdown. Feb 1, 2024

nefelim4ag force-pushed the fix/internal-ingress-status-go branch from abdd6ba to 200ae7f Compare July 1, 2024 09:51

nefelim4ag force-pushed the fix/internal-ingress-status-go branch from 200ae7f to 01f38e4 Compare July 2, 2024 13:18

nefelim4ag force-pushed the fix/internal-ingress-status-go branch from 01f38e4 to f685bd2 Compare July 2, 2024 14:17

nefelim4ag force-pushed the fix/internal-ingress-status-go branch from f685bd2 to 3402260 Compare July 22, 2024 10:51

nefelim4ag and others added 4 commits September 2, 2024 14:24

fix: status.go - check pod terminating state

a267337

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>

fix: status.go - use filtered labels for pod discovery

c4766f7

Co-authored-by: Marco Ebert <marco_ebert@icloud.com> Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>

fix: status.go - ignore terminating pods as leaders

a0344d6

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>

chore: status_test.go check pod labels & terminating status

8216d4a

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>

nefelim4ag force-pushed the fix/internal-ingress-status-go branch from 3402260 to 8216d4a Compare September 2, 2024 12:25

frittentheke mentioned this pull request Oct 30, 2024

Ingress Status: Wrong labels when checking for rollover or shutdown. #10475

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingress Status: Improve checking for rollover or shutdown. #10686

Ingress Status: Improve checking for rollover or shutdown. #10686

nefelim4ag commented Nov 28, 2023 •

edited

Loading

netlify bot commented Nov 28, 2023 •

edited

Loading

k8s-ci-robot commented Nov 28, 2023

k8s-ci-robot commented Nov 28, 2023

k8s-ci-robot commented Nov 28, 2023

Gacko left a comment

Gacko commented Dec 1, 2023

nefelim4ag commented Dec 1, 2023 •

edited

Loading

Gacko Dec 2, 2023

nefelim4ag Dec 2, 2023

Gacko Dec 2, 2023

nefelim4ag Dec 3, 2023 •

edited

Loading

nefelim4ag Dec 3, 2023

Gacko commented Dec 3, 2023

nefelim4ag commented Dec 4, 2023 •

edited

Loading

Gacko commented Feb 1, 2024

nefelim4ag commented May 7, 2024

k8s-ci-robot commented Jul 1, 2024

longwuyuan commented Jul 1, 2024

nefelim4ag commented Jul 2, 2024 •

edited

Loading

Ingress Status: Improve checking for rollover or shutdown. #10686

Are you sure you want to change the base?

Ingress Status: Improve checking for rollover or shutdown. #10686

Conversation

nefelim4ag commented Nov 28, 2023 • edited Loading

What this PR does / why we need it:

Types of changes

Which issue/s this PR fixes

How Has This Been Tested?

Checklist:

netlify bot commented Nov 28, 2023 • edited Loading

✅ Deploy Preview for kubernetes-ingress-nginx canceled.

k8s-ci-robot commented Nov 28, 2023

k8s-ci-robot commented Nov 28, 2023

k8s-ci-robot commented Nov 28, 2023

Gacko left a comment

Choose a reason for hiding this comment

Gacko commented Dec 1, 2023

nefelim4ag commented Dec 1, 2023 • edited Loading

Gacko Dec 2, 2023

Choose a reason for hiding this comment

nefelim4ag Dec 2, 2023

Choose a reason for hiding this comment

Gacko Dec 2, 2023

Choose a reason for hiding this comment

nefelim4ag Dec 3, 2023 • edited Loading

Choose a reason for hiding this comment

nefelim4ag Dec 3, 2023

Choose a reason for hiding this comment

Gacko commented Dec 3, 2023

nefelim4ag commented Dec 4, 2023 • edited Loading

Gacko commented Feb 1, 2024

nefelim4ag commented May 7, 2024

k8s-ci-robot commented Jul 1, 2024

longwuyuan commented Jul 1, 2024

nefelim4ag commented Jul 2, 2024 • edited Loading

nefelim4ag commented Nov 28, 2023 •

edited

Loading

netlify bot commented Nov 28, 2023 •

edited

Loading

nefelim4ag commented Dec 1, 2023 •

edited

Loading

nefelim4ag Dec 3, 2023 •

edited

Loading

nefelim4ag commented Dec 4, 2023 •

edited

Loading

nefelim4ag commented Jul 2, 2024 •

edited

Loading