Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: request failed with 502 when upstream is rolling update #632

Open
suninuni opened this issue Aug 17, 2021 · 9 comments
Open

bug: request failed with 502 when upstream is rolling update #632

suninuni opened this issue Aug 17, 2021 · 9 comments

Comments

@suninuni
Copy link
Contributor

Issue description

When upstream service is rolling update, Apisix can not update upstream nodes immediately which caused request failed with 502. This will not happen in ingress-nginx-controller(version 0.46 which is already cancelled reload nginx when upstream nodes changed.)

image

Environment

  • your apisix-ingress-controller version (output of apisix-ingress-controller version --long); 0.6.0
  • your Kubernetes cluster version (output of kubectl version); 1.21.0

Minimal test code / Steps to reproduce the issue

  1. install by default values.yml
  2. deploy a service by simple configure.
  3. trigger a rolling update and run with ab in the same time.

What's the actual result? (including assertion message & call stack if applicable)

For Apisix, the Non-2xx responses will around 1000, but for ingress-nginx-controller, this will be 0.

What's the expected result?

No failed request when upstream nodes changed.

@tokers
Copy link
Contributor

tokers commented Aug 17, 2021

@suninuni You may try to:

  1. add the retry mechanism
  2. add the active and passive health check

Look forward to knowing whether this can eliminate or mitigate this issue. We'll also try to optimize the process from the internal of apisix and apisix-ingress-controller.

@tao12345666333
Copy link
Member

tao12345666333 commented Aug 18, 2021

can you paste your ApisixRoute / ApisixUpstream CR ?

@tao12345666333
Copy link
Member

Or you can add readiness configurations

@Donghui0
Copy link
Contributor

@suninuni You may try to:

  1. add the retry mechanism
  2. add the active and passive health check

Look forward to knowing whether this can eliminate or mitigate this issue. We'll also try to optimize the process from the internal of apisix and apisix-ingress-controller.

in APISIX, the retry mechanism is enabled by default and set the number of retries according to the number of available backend nodes.
retry mechanism did not take effect?

@tokers
Copy link
Contributor

tokers commented Aug 18, 2021

Let's wait @suninuni for his configuration snippets.

@suninuni
Copy link
Contributor Author

Or you can add readiness configurations

@tao12345666333 If you mean the readiness of the pod, yes I have.

AND @tokers @Donghui0 thanks for your replies.

From the screenshot of Apisix's access log, I think the retry mechanism is worked. You can find some of requests are return 200 after retried 1 times (because I only set 2 replicas for the test service).

image

So, I known if I have enough pods in the upstream, there will no failed requests after retried many times. And for the active and passive health check, they will reduce the failed number but not make it disappear.

We'll also try to optimize the process from the internal of apisix and apisix-ingress-controller.

Yes, this is what I want.

For ingress-nginx-controller, it watch the changes of upstream nodes and updated them in memory. But for apisix-ingress-controller and apisix, the update process will be (in my opinion, if there is any error, please help me to point it out):

apisix-ingress-controller watch the changes -> call Apisix api to update -> Apisix save to Etcd -> Other Apisix get new upstrems from Etcd.

Compared to ingress-nginx-controller, this process is indeed a lot longer.

@tokers
Copy link
Contributor

tokers commented Aug 19, 2021

Or you can add readiness configurations

@tao12345666333 If you mean the readiness of the pod, yes I have.

AND @tokers @Donghui0 thanks for your replies.

From the screenshot of Apisix's access log, I think the retry mechanism is worked. You can find some of requests are return 200 after retried 1 times (because I only set 2 replicas for the test service).

image

So, I known if I have enough pods in the upstream, there will no failed requests after retried many times. And for the active and passive health check, they will reduce the failed number but not make it disappear.

We'll also try to optimize the process from the internal of apisix and apisix-ingress-controller.

Yes, this is what I want.

For ingress-nginx-controller, it watch the changes of upstream nodes and updated them in memory. But for apisix-ingress-controller and apisix, the update process will be (in my opinion, if there is any error, please help me to point it out):

apisix-ingress-controller watch the changes -> call Apisix api to update -> Apisix save to Etcd -> Other Apisix get new upstrems from Etcd.

Compared to ingress-nginx-controller, this process is indeed a lot longer.

That's right, we need further discussions about it.

@Donghui0
Copy link
Contributor

Donghui0 commented Aug 19, 2021

Looking at the third access log in the picture, it seems that the retry mechanism does not take effect. Only requested "10.32.176.134:80" once, but did not continue to request "10.32.137.94:80".

Only passive healthchecks are not supported in the current apisix version. The balancer.create_server_picker method uses the LRU cache. If the checker.status_ver field is not updated by the active health check, the cache will become invalid for a long time (300s). A failed node cannot be requested again for a long period of time, even if it has been restored to a healthy state.

@Donghui0
Copy link
Contributor

Donghui0 commented Sep 1, 2021

I guess @suninuni suggestion is to allow the Healthcheck plugin to support passive checks alone. like nginx.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants