Skip to content

Reconciling TargetGroup Removes ECS targets causing downtime #2372

@jaguer0

Description

@jaguer0

Describe the bug
I'm using Flux for gitops , using elvb2 to create loadbalancer , targetgroup, listeners & also using aws ack ECS for task definition and service.

When the targetgroup is reconciled , it's removing the targets that were registered from ECS via the targetGroupRef. This leads to the target group having 0 targets and being down. Eventually it comes back online ,

Steps to reproduce
Target group is created , example

apiVersion: elbv2.services.k8s.aws/v1alpha1
kind: TargetGroup
metadata:
  name: foo-bar-tg-staging
spec:
  name: foo-bar-tg-staging
  healthCheckEnabled: true
  healthCheckIntervalSeconds: 30
  healthCheckPath: /
  healthCheckPort: traffic-port
  healthCheckProtocol: HTTP
  healthCheckTimeoutSeconds: 5
  healthyThresholdCount: 5
  ipAddressType: ipv4
  matcher:
    httpCode: "200"
  port: 8080
  protocol: HTTP
  protocolVersion: HTTP1
  targetType: ip
  unhealthyThresholdCount: 2
  vpcID: xxxxx

It is created and working fine, it's referenced in the ECS service using targetGroupRef.

apiVersion: ecs.services.k8s.aws/v1alpha1
kind: Service
metadata:
  name: foo-bar
spec:
  name: foo-bar
  capacityProviderStrategy:
  - base: 0
    capacityProvider: FARGATE
    weight: 1
  cluster: staging
  deploymentConfiguration:
    alarms:
      alarmNames:
      - none
      enable: false
      rollback: false
    deploymentCircuitBreaker:
      enable: true
      rollback: true
    maximumPercent: 200
    minimumHealthyPercent: 100
  deploymentController:
    type: ECS
  desiredCount: 1
  enableECSManagedTags: true
  enableExecuteCommand: false
  healthCheckGracePeriodSeconds: 0
  loadBalancers:
  - containerName: foo-bar
    containerPort: 8080
    targetGroupRef:
      from:
        name: foo-bar-tg-staging
  networkConfiguration:
    awsVPCConfiguration:
      assignPublicIP: DISABLED
      securityGroups:
      - sg-xxxxx
      subnets:
      - sg-xxxxxx
      - sg-xxxxxx
  platformVersion: 1.4.0
  propagateTags: NONE
  schedulingStrategy: REPLICA
  taskDefinitionRef:
    from:
      name: foo-bar-staging

When the ELBv2 controller reconciles the target group, it removes the targets completely causing downtime. about ~15 min later ECS re-adds them and it's back online. It happens daily as the reconciler is set to every 10 hours and leads to downtime.

I do see an event for ECS noting task remained in deregistered state for too long

Log from elbv2 controller

2025-03-11T10:45:53.047941104Z stderr F 
{"level":"info",
"ts":"2025-03-11T10:45:53.047Z",
"logger":"ackrt",
"msg":"desired resource state has changed",
"kind":"TargetGroup",
"namespace":"foo-bar",
"name":"foo-bar-tg-staging",
"account":"xxxx",
"role":"",
"region":"us-west-2",
"is_adopted":false,"generation":1,"diff":[{"Path":{"Parts":["Spec",
"Targets"]},"A":null,"B":[{"availabilityZone":"us-west-2a",
"id":"xx.x.xx.xx","port":8080}]}]
}

you can see the metrics show unhealthy state at the same time as that log entry when it reconciles.

Image

Expected outcome
target group to keep the targets when using ECS & targetGroupRef.

Environment

  • Using EKS: Yes v1.29.13-eks-8cce635
  • AWS service targeted: ECS, ELBV2

Metadata

Metadata

Assignees

No one assigned

    Labels

    service/elbv2Indicates issues or PRs that are related to elbv2-controller.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions