Skip to content

503 errors from canary service before rolling update #256

Closed
@cmoonExpedia

Description

@cmoonExpedia

Hi,

I tested with below canary configuration. Canary test and deployment worked as expected, but several number of 503 errors were detected from canary service. Also, the higher maxWeight the higher number of 503 errors were observed. The 503 errors were observed only 1 seconds with below canary config, but I think you may be able to improve it.

I'm wondering that Flagger tore canary stack down too early before draining existing requests from the canary stack. I don't have too much knowledge of the project, but it may be able to be improved by delaying the canary stack termination until existing requests are responded.

Uploaded istio prometheus screenshot and vegeta plot.html. Let me know if you need more information from me.

Vegeta Test Load

150 qps of GET requests on a web page less than 200 Bytes.

Vegeta report from the result (54 503 errors):

Requests      [total, rate, throughput]  249974, 150.00, 149.96
Duration      [total, attack, wait]      27m46.5873736s, 27m46.486746486s, 100.627114ms
Latencies     [mean, 50, 95, 99, max]    110.489665ms, 100.59509ms, 115.350849ms, 288.51525ms, 5.100550775s
Bytes In      [total, mean]              60027002, 240.13
Bytes Out     [total, mean]              0, 0.00
Success       [ratio]                    99.98%
Status Codes  [code:count]               200:249920  503:54  
Error Set:
503 Service Unavailable

Vegeta plot from the same test result:

[plot_origin_25p_canary.html.zip]
(https://github.com/weaveworks/flagger/files/3441024/plot_origin_25p_canary.html.zip)

Prometheus request count rate metrics on Istio gateway:

image

Canary template

---
apiVersion: flagger.app/v1alpha3
kind: Canary
metadata:
  name: {{.Chart.Name}}
  namespace: {{.Values.namespace}}
spec:
  provider: istio
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: {{.Chart.Name}} # this will generate a service with the same name
  progressDeadlineSeconds: 3600
  autoscalerRef:
    apiVersion: autoscaling/v2beta1
    kind: HorizontalPodAutoscaler
    name: {{.Chart.Name}}
  service:
    portDiscovery: true
    port: 8080
    gateways:
      - {{.Chart.Name}}
    hosts:
      - {{.Chart.Name}}.mytesthostname.com
  skipAnalysis: false
  canaryAnalysis:
    interval: {{.Values.releaseApprovalPollingInterval}}
    threshold: {{.Values.releaseApprovalPollingCount}}
    # max weight should be bigger than step weight to prevent releasing the new version before release-approval-check returns.
    maxWeight: 50
    stepWeight: 25
    webhooks:
      - name: release-approval-check
        type: rollout
        url: http://webhook.kube-system/release-approvals/check
        timeout: 10s
        metadata:
          token: {{.Values.version}}
      - name: log-release-version
        type: post-rollout
        url: http://webhook.kube-system/released-apps
        timeout: 10s
        metadata:
          token: {{.Values.version}}

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/enhancementImprovement request for an existing feature

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions