Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingress-nginx unbalanced traffic #10061

Open
simonemilano opened this issue Jun 8, 2023 · 8 comments
Open

Ingress-nginx unbalanced traffic #10061

simonemilano opened this issue Jun 8, 2023 · 8 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@simonemilano
Copy link

Hi,
we are experiencing unbalanced traffic using ingress-nginx on Google Kubernetes Engine. We are using ingress-nginx v 1.1.1 to expose a deployment that at the moment is making an https call to an external service and then returns the answer to the caller.
Using round robin (default for ingress-nginx) we are experiencing very unbalanced traffic. We observe that initially the load is symmetrical between the pods but then when the deployment scales ingress-nginx sends increasingly traffic on the new pod. In some occasion the new pod has 90% of the cpu usage while the others 30%.
Using ewma seems to fix but the cpu usage seems to go up and and down observing it in scales of seconds.
Any idea of why round robin behaves like that?

Schermata 2023-06-05 alle 16 48 01
@simonemilano simonemilano added the kind/bug Categorizes issue or PR as related to a bug. label Jun 8, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels Jun 8, 2023
@bmv126
Copy link

bmv126 commented Jun 8, 2023

@simonemilano
How many worker process you have in nginx.conf

@longwuyuan
Copy link
Contributor

@simonemilano Please provide answers to the questions asked in a new issue template. You have not even copy/pasted the output of kubectl commands that describe the controller , ingress, service, curl request etc. , so any discussion here is going to be based on guess work.

Some kind of ability to reproduce a break in round-robin loadbalancing or at least a deep understanding of your requests and ingress, coupled with the complexity of response time depending on a call by backend pod to some internet endpoint etc. needs to be actionable by a developer, if there is a problem in the code.

/remove-kind bug
/kind support

@k8s-ci-robot k8s-ci-robot added kind/support Categorizes issue or PR as a support question. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jun 8, 2023
@strongjz
Copy link
Member

strongjz commented Jun 8, 2023

Can you provide the exact network setup, the external cluster policy, and the workers, is it the default?

Can you explain this a little more to understand the traffic routing?

deployment that at the moment is making an https call to an external service and then returns the answer to the caller.

/triage needs-information

@k8s-ci-robot k8s-ci-robot added the triage/needs-information Indicates an issue needs more information in order to work on it. label Jun 8, 2023
@simonemilano
Copy link
Author

Hi,
@longwuyuan will provide asap.. unfortunatly is a cluster from a client and i've to wait for authorization. Is also possible to send it directly to you without without publishing it publicly?

The problem seems very similar to the one described here https://technology.lastminute.com/ingress-nginx-bug-makes-comeback/.

@strongjz
the GKE setup is the following
Ingress-nginx -> Microservice1 -> Microservice2 (with mocked answers for test)

Ingress nginx is in a dedicated namespace. Microservice1 and Microservice2 are in the same namespace.
Ingress-nginx 1.1.1 has 3 pods without autoscaling and default workers and uses round-robin for balancing calls to Microservice1.
Microservice1 has 8 pods that can scale up to 10 and calls Microservice2 internally to the cluster.
Microservice2 has 20 pods and answers after a delay of 200ms a mocked response.
All the call are Rest Https with a small json in answer (about 1 kb).

We have a bunch of VMs located beside the cluster where we use jmeter. From that vm's we call MIcroservice1 through the ingress-nginx.
All fine until 240tps. At the point pods on microservice1 scales up spawning a new pod. From that point the new pod seems to monopolize the traffic (you can see it from the graph) reaching in some cases beyond 100% of cpu usage.
Since Microservice2 has mocked responses all the https call are absolutely identical in terms of type and network delay.

With the same situation ewma seems to behave correctly. Apart from the istantaneous cpu usage that swing up and down (but i think it's normal). Overall pods are balanced.

@github-actions
Copy link

github-actions bot commented Jul 9, 2023

This is stale, but we won't close it automatically, just bare in mind the maintainers may be busy with other tasks and will reach your issue ASAP. If you have any question or request to prioritize this, please reach #ingress-nginx-dev on Kubernetes Slack.

@github-actions github-actions bot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jul 9, 2023
@longwuyuan
Copy link
Contributor

@simonemilano sorry for no action from the project on this for so long. We just did not have enough resources to research and experiment such a complex issue.

First the kubectl describe of the ingress is needed to know routing rules.
Second "kubectl logs` of the controller pod is needed, showing ample bunch of messages pre as well as the post 240tps mark.
Third the metrics of the controller are needed for pre and post the 240 mark.
Fourth exact and complete request is needed as well as its response (curl -v sample)

Now after so long, the first thought I get is the that more unconventional data gathering is needed to even think of possibilities. Test cases like ;

  • Does a non jmeter stress do the same thing (like locust/other)
  • Does a simple request to / of microservice1 do the same thing (obvious that microservice2 does not get engaged)

If this is not to be worked on, then please close the issue. This update comes in the light that the project had to make some required decisions owing to shortage of resources. We even had to deprecate popular features as we can not support/maintain them (of-course needless to say that loadbalancing algo is not in that category as loadbalancing ia direct implication of the K8S Ingress-API).

@longwuyuan
Copy link
Contributor

/remove-kind support
/kind bug
/remove-lifecycle frozen

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. and removed lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. kind/support Categorizes issue or PR as a support question. labels Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
Development

No branches or pull requests

5 participants