Ingress-nginx unbalanced traffic #10061

simonemilano · 2023-06-08T08:12:36Z

Hi,
we are experiencing unbalanced traffic using ingress-nginx on Google Kubernetes Engine. We are using ingress-nginx v 1.1.1 to expose a deployment that at the moment is making an https call to an external service and then returns the answer to the caller.
Using round robin (default for ingress-nginx) we are experiencing very unbalanced traffic. We observe that initially the load is symmetrical between the pods but then when the deployment scales ingress-nginx sends increasingly traffic on the new pod. In some occasion the new pod has 90% of the cpu usage while the others 30%.
Using ewma seems to fix but the cpu usage seems to go up and and down observing it in scales of seconds.
Any idea of why round robin behaves like that?

k8s-ci-robot · 2023-06-08T08:12:43Z

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

bmv126 · 2023-06-08T08:51:42Z

@simonemilano
How many worker process you have in nginx.conf

longwuyuan · 2023-06-08T14:59:58Z

@simonemilano Please provide answers to the questions asked in a new issue template. You have not even copy/pasted the output of kubectl commands that describe the controller , ingress, service, curl request etc. , so any discussion here is going to be based on guess work.

Some kind of ability to reproduce a break in round-robin loadbalancing or at least a deep understanding of your requests and ingress, coupled with the complexity of response time depending on a call by backend pod to some internet endpoint etc. needs to be actionable by a developer, if there is a problem in the code.

/remove-kind bug
/kind support

strongjz · 2023-06-08T15:55:41Z

Can you provide the exact network setup, the external cluster policy, and the workers, is it the default?

Can you explain this a little more to understand the traffic routing?

deployment that at the moment is making an https call to an external service and then returns the answer to the caller.

/triage needs-information

simonemilano · 2023-06-08T16:49:34Z

Hi,
@longwuyuan will provide asap.. unfortunatly is a cluster from a client and i've to wait for authorization. Is also possible to send it directly to you without without publishing it publicly?

The problem seems very similar to the one described here https://technology.lastminute.com/ingress-nginx-bug-makes-comeback/.

@strongjz
the GKE setup is the following
Ingress-nginx -> Microservice1 -> Microservice2 (with mocked answers for test)

Ingress nginx is in a dedicated namespace. Microservice1 and Microservice2 are in the same namespace.
Ingress-nginx 1.1.1 has 3 pods without autoscaling and default workers and uses round-robin for balancing calls to Microservice1.
Microservice1 has 8 pods that can scale up to 10 and calls Microservice2 internally to the cluster.
Microservice2 has 20 pods and answers after a delay of 200ms a mocked response.
All the call are Rest Https with a small json in answer (about 1 kb).

We have a bunch of VMs located beside the cluster where we use jmeter. From that vm's we call MIcroservice1 through the ingress-nginx.
All fine until 240tps. At the point pods on microservice1 scales up spawning a new pod. From that point the new pod seems to monopolize the traffic (you can see it from the graph) reaching in some cases beyond 100% of cpu usage.
Since Microservice2 has mocked responses all the https call are absolutely identical in terms of type and network delay.

With the same situation ewma seems to behave correctly. Apart from the istantaneous cpu usage that swing up and down (but i think it's normal). Overall pods are balanced.

github-actions · 2023-07-09T02:15:06Z

This is stale, but we won't close it automatically, just bare in mind the maintainers may be busy with other tasks and will reach your issue ASAP. If you have any question or request to prioritize this, please reach #ingress-nginx-dev on Kubernetes Slack.

longwuyuan · 2024-09-12T18:18:54Z

@simonemilano sorry for no action from the project on this for so long. We just did not have enough resources to research and experiment such a complex issue.

First the kubectl describe of the ingress is needed to know routing rules.
Second "kubectl logs` of the controller pod is needed, showing ample bunch of messages pre as well as the post 240tps mark.
Third the metrics of the controller are needed for pre and post the 240 mark.
Fourth exact and complete request is needed as well as its response (curl -v sample)

Now after so long, the first thought I get is the that more unconventional data gathering is needed to even think of possibilities. Test cases like ;

Does a non jmeter stress do the same thing (like locust/other)
Does a simple request to / of microservice1 do the same thing (obvious that microservice2 does not get engaged)

If this is not to be worked on, then please close the issue. This update comes in the light that the project had to make some required decisions owing to shortage of resources. We even had to deprecate popular features as we can not support/maintain them (of-course needless to say that loadbalancing algo is not in that category as loadbalancing ia direct implication of the K8S Ingress-API).

longwuyuan · 2024-09-12T18:37:54Z

/remove-kind support
/kind bug
/remove-lifecycle frozen

simonemilano added the kind/bug Categorizes issue or PR as related to a bug. label Jun 8, 2023

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels Jun 8, 2023

k8s-ci-robot added kind/support Categorizes issue or PR as a support question. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jun 8, 2023

k8s-ci-robot added the triage/needs-information Indicates an issue needs more information in order to work on it. label Jun 8, 2023

github-actions bot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jul 9, 2023

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. and removed lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. kind/support Categorizes issue or PR as a support question. labels Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingress-nginx unbalanced traffic #10061

Ingress-nginx unbalanced traffic #10061

simonemilano commented Jun 8, 2023

k8s-ci-robot commented Jun 8, 2023

bmv126 commented Jun 8, 2023

longwuyuan commented Jun 8, 2023

strongjz commented Jun 8, 2023

simonemilano commented Jun 8, 2023

github-actions bot commented Jul 9, 2023

longwuyuan commented Sep 12, 2024

longwuyuan commented Sep 12, 2024

Ingress-nginx unbalanced traffic #10061

Ingress-nginx unbalanced traffic #10061

Comments

simonemilano commented Jun 8, 2023

k8s-ci-robot commented Jun 8, 2023

bmv126 commented Jun 8, 2023

longwuyuan commented Jun 8, 2023

strongjz commented Jun 8, 2023

simonemilano commented Jun 8, 2023

github-actions bot commented Jul 9, 2023

longwuyuan commented Sep 12, 2024

longwuyuan commented Sep 12, 2024