Description
Hi,
I tested with below canary configuration. Canary test and deployment worked as expected, but several number of 503 errors were detected from canary service. Also, the higher maxWeight
the higher number of 503 errors were observed. The 503 errors were observed only 1 seconds with below canary config, but I think you may be able to improve it.
I'm wondering that Flagger tore canary stack down too early before draining existing requests from the canary stack. I don't have too much knowledge of the project, but it may be able to be improved by delaying the canary stack termination until existing requests are responded.
Uploaded istio prometheus screenshot and vegeta plot.html. Let me know if you need more information from me.
Vegeta Test Load
150 qps of GET requests on a web page less than 200 Bytes.
Vegeta report from the result (54 503 errors):
Requests [total, rate, throughput] 249974, 150.00, 149.96
Duration [total, attack, wait] 27m46.5873736s, 27m46.486746486s, 100.627114ms
Latencies [mean, 50, 95, 99, max] 110.489665ms, 100.59509ms, 115.350849ms, 288.51525ms, 5.100550775s
Bytes In [total, mean] 60027002, 240.13
Bytes Out [total, mean] 0, 0.00
Success [ratio] 99.98%
Status Codes [code:count] 200:249920 503:54
Error Set:
503 Service Unavailable
Vegeta plot from the same test result:
[plot_origin_25p_canary.html.zip]
(https://github.com/weaveworks/flagger/files/3441024/plot_origin_25p_canary.html.zip)
Prometheus request count rate metrics on Istio gateway:
Canary template
---
apiVersion: flagger.app/v1alpha3
kind: Canary
metadata:
name: {{.Chart.Name}}
namespace: {{.Values.namespace}}
spec:
provider: istio
targetRef:
apiVersion: apps/v1
kind: Deployment
name: {{.Chart.Name}} # this will generate a service with the same name
progressDeadlineSeconds: 3600
autoscalerRef:
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
name: {{.Chart.Name}}
service:
portDiscovery: true
port: 8080
gateways:
- {{.Chart.Name}}
hosts:
- {{.Chart.Name}}.mytesthostname.com
skipAnalysis: false
canaryAnalysis:
interval: {{.Values.releaseApprovalPollingInterval}}
threshold: {{.Values.releaseApprovalPollingCount}}
# max weight should be bigger than step weight to prevent releasing the new version before release-approval-check returns.
maxWeight: 50
stepWeight: 25
webhooks:
- name: release-approval-check
type: rollout
url: http://webhook.kube-system/release-approvals/check
timeout: 10s
metadata:
token: {{.Values.version}}
- name: log-release-version
type: post-rollout
url: http://webhook.kube-system/released-apps
timeout: 10s
metadata:
token: {{.Values.version}}