Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canary deployment getting failed #1720

Open
infrawizard opened this issue Nov 4, 2024 · 3 comments
Open

Canary deployment getting failed #1720

infrawizard opened this issue Nov 4, 2024 · 3 comments

Comments

@infrawizard
Copy link

I'm implementing canary deployment using Flagger to monitor my application. However, despite configuring the request-success-rate metric, Flagger isn't sending any metrics or requests to the endpoint. I am using traefik provider.

I am installing flagger like below:

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: flagger
  namespace: kube-system
spec:
  releaseName: flagger
  chart:
    spec:
      chart: flagger
      version: 1.36.0
      interval: 6h
      sourceRef:
        kind: HelmRepository
        name: flagger
        namespace: flux-system
      verify:
        provider: cosign 
  values:
    meshProvider: traefik
    prometheus:
      install: true    
    nodeSelector:
      kubernetes.io/os: linux
  install:
    crds: CreateReplace
  upgrade:
    crds: CreateReplace
  interval: 1h

And canary with below:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: test-service
  namespace: test
spec:
  provider: traefik
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: test-service
  progressDeadlineSeconds: 600
  service:
    port: 3000
    targetPort: 3000
  analysis:
    interval: 10s
    threshold: 10
    maxWeight: 50
    stepWeight: 5
    metrics:
      - name: request-success-rate
        interval: 1m
        thresholdRange:
          min: 99
      - name: request-duration
        interval: 1m
        thresholdRange:
          max: 500
    webhooks:
      - name: acceptance-test
        type: pre-rollout
        url: http://flagger-loadtester.test/
        timeout: 10s
        metadata:
          type: bash
          cmd: "curl -X GET http://test-service:3000/ping"
      - name: load-test
        type: rollout
        url: http://flagger-loadtester.test/
        timeout: 5s
        metadata:
          type: cmd
          cmd: "hey -z 10s -q 10 -c 2 http://test-service:3000/ping"
          logCmdOutput: "true"

Canary is getting succeeded without the metrics field but getting failed:

Events:
Type Reason Age From Message

Warning Synced 4m19s flagger test-service-primary.test not ready: waiting for rollout to finish: observed deployment generation less than desired generation
Warning Synced 3m29s (x5 over 4m9s) flagger test-service-primary.test not ready: waiting for rollout to finish: 0 of 1 (readyThreshold 100%) updated replicas are available
Normal Synced 3m19s (x7 over 4m19s) flagger all the metrics providers are available!
Normal Synced 3m19s flagger Initialization done! test-service.test
Normal Synced 2m49s flagger New revision detected! Scaling up test-service.test
Warning Synced 119s (x5 over 2m39s) flagger canary deployment test-service.test not ready: waiting for rollout to finish: 0 of 1 (readyThreshold 100%) updated replicas are available
Normal Synced 109s flagger Starting canary analysis for test-service.test
Normal Synced 109s flagger Pre-rollout check acceptance-test passed
Normal Synced 109s flagger Advance test-service.test canary weight 5
Warning Synced 89s (x2 over 99s) flagger Halt advancement no values found for traefik metric request-success-rate probably test-service.test is not receiving traffic: running query failed: no values found

Below is my traefik config:

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: traefik
  namespace: kube-system
spec:
  chart:
    spec:
      chart: traefik
      sourceRef:
        kind: HelmRepository
        name: traefik
        namespace: flux-system
      version: '23.0.1'
  values:
    additionalArguments:
      - "--entryPoints.web.forwardedHeaders.trustedIPs=10.0.0.0/16,10.5.0.0/16,10.21.0.0/16"
    updateStrategy:
      rollingUpdate:
        maxUnavailable: 1
        
    providers:
      kubernetesCRD:
        enabled: true
        allowCrossNamespace: true
        allowExternalNameServices: true

      kubernetesIngress:
        enabled: true
        allowExternalNameServices: true

    ports:
      web:
        nodePort: 32080
      websecure:
        nodePort: 32443

    service:
      type: NodePort

  interval: 10m0s

I am installing prometheus with flagger. The setup works without metrics but fails when its added. Not sure if I am missing anything in the setup. I see flagger-prometheus pod in the setup. Do I need to install anything else for inbuilt metrics to work? Or anything else missing in the setup?

@hrvatskibogmars
Copy link

hrvatskibogmars commented Nov 12, 2024

I am having the same issues with istio.

I see that flagger is hiting prometheus. I see the query but for some uknown reason to me its just not getting any traffic to new pod. Canary deployment has 0 or 1 value when I query this metric. Traffic to old pod works and its showing on in prometheus.

@infrawizard
Copy link
Author

@stefanprodan would really appreciate your input here.

@hrvatskibogmars
Copy link

hrvatskibogmars commented Nov 13, 2024

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: request-duration
  namespace: flagger
spec:
  provider:
    type: prometheus
    address: http://mimir-distributed-gateway.observability:8080/prometheus
  query: |
    histogram_quantile(0.99,
      sum(
        irate(
          istio_request_duration_milliseconds_bucket{
            reporter="destination",
            destination_workload=~"{{ target }}",
            destination_workload_namespace=~"{{ namespace }}"
          }[{{ interval }}]
        )
      ) by (le)
    )

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: request-success-rate
  namespace: flagger
spec:
  provider:
    type: prometheus
    address: http://mimir-distributed-gateway.observability:8080/prometheus
  query: |
    sum(
        rate(
            istio_requests_total{
              reporter="destination",
              destination_workload_namespace=~"{{ namespace }}",
              destination_workload=~"{{ target }}",
              response_code!~"5.*"
            }[{{ interval }}]
        )
    )
    /
    sum(
        rate(
            istio_requests_total{
              reporter="destination",
              destination_workload_namespace=~"{{ namespace }}",
              destination_workload=~"{{ target }}"
            }[{{ interval }}]
        )
    )

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: echo-server-cannary
  namespace: debug
spec:
  # deployment reference
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: echo-server
  # the maximum time in seconds for the canary deployment
  # to make progress before it is rollback (default 600s)
  progressDeadlineSeconds: 600
  # HPA reference (optional)
  autoscalerRef:
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    name: echo-server
  service:
    # service port number
    port: 80
    # container port number or name (optional)
    targetPort: 80
    # Istio gateways (optional)
    gateways:
    - default/gw-dev-imba-com
    # Istio virtual service host names (optional)
    hosts:
    -imba.com
    match:
      - uri:
          prefix: /api/echo
    # Istio traffic policy (optional)
    trafficPolicy:
      tls:
        # use ISTIO_MUTUAL when mTLS is enabled
        mode: ISTIO_MUTUAL
    # Istio retry policy (optional)
    retries:
      attempts: 3
      perTryTimeout: 1s
      retryOn: "gateway-error,connect-failure,refused-stream"
  analysis:
    # schedule interval (default 60s)
    interval: 1m
    # max number of failed metric checks before rollback
    threshold: 10
    # max traffic percentage routed to canary
    # percentage (0-100)
    maxWeight: 50
    # canary increment step
    # percentage (0-100)
    stepWeight: 10
    metrics:
      - name: request-success-rate
        templateRef:
          name: request-success-rate
          namespace: flagger
        thresholdRange:
          max: 500
        interval: 5m
      - name: request-duration
        templateRef:
          name: request-duration
          namespace: flagger
        thresholdRange:
          max: 500
        interval: 5m
    # testing (optional)
    webhooks:
      - name: acceptance-test
        type: pre-rollout
        url: https://imba.com/api/echo
        timeout: 30s
        metadata:
          type: bash
          cmd: "curl -sd 'test' https://imba.com/api/echo | grep token"
      - name: load-test
        url: https://imba.com/api/echo
        timeout: 5s
        metadata:
          cmd: "hey -z 1m -q 10 -c 2 http://imba.com/api/echo"

I found that the problem is with metrics. Its not generating enough traffic to show any value for given metric, thus resulting in failed rollout.

➜  ~ istioctl version
client version: 1.24.0
control plane version: 1.21.0
data plane version: 1.21.0 (61 proxies)
➜  ~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants