-
Notifications
You must be signed in to change notification settings - Fork 272
Description
Describe the bug
I'm using Flux for gitops , using elvb2 to create loadbalancer , targetgroup, listeners & also using aws ack ECS for task definition and service.
When the targetgroup is reconciled , it's removing the targets that were registered from ECS via the targetGroupRef. This leads to the target group having 0 targets and being down. Eventually it comes back online ,
Steps to reproduce
Target group is created , example
apiVersion: elbv2.services.k8s.aws/v1alpha1
kind: TargetGroup
metadata:
name: foo-bar-tg-staging
spec:
name: foo-bar-tg-staging
healthCheckEnabled: true
healthCheckIntervalSeconds: 30
healthCheckPath: /
healthCheckPort: traffic-port
healthCheckProtocol: HTTP
healthCheckTimeoutSeconds: 5
healthyThresholdCount: 5
ipAddressType: ipv4
matcher:
httpCode: "200"
port: 8080
protocol: HTTP
protocolVersion: HTTP1
targetType: ip
unhealthyThresholdCount: 2
vpcID: xxxxxIt is created and working fine, it's referenced in the ECS service using targetGroupRef.
apiVersion: ecs.services.k8s.aws/v1alpha1
kind: Service
metadata:
name: foo-bar
spec:
name: foo-bar
capacityProviderStrategy:
- base: 0
capacityProvider: FARGATE
weight: 1
cluster: staging
deploymentConfiguration:
alarms:
alarmNames:
- none
enable: false
rollback: false
deploymentCircuitBreaker:
enable: true
rollback: true
maximumPercent: 200
minimumHealthyPercent: 100
deploymentController:
type: ECS
desiredCount: 1
enableECSManagedTags: true
enableExecuteCommand: false
healthCheckGracePeriodSeconds: 0
loadBalancers:
- containerName: foo-bar
containerPort: 8080
targetGroupRef:
from:
name: foo-bar-tg-staging
networkConfiguration:
awsVPCConfiguration:
assignPublicIP: DISABLED
securityGroups:
- sg-xxxxx
subnets:
- sg-xxxxxx
- sg-xxxxxx
platformVersion: 1.4.0
propagateTags: NONE
schedulingStrategy: REPLICA
taskDefinitionRef:
from:
name: foo-bar-stagingWhen the ELBv2 controller reconciles the target group, it removes the targets completely causing downtime. about ~15 min later ECS re-adds them and it's back online. It happens daily as the reconciler is set to every 10 hours and leads to downtime.
I do see an event for ECS noting task remained in deregistered state for too long
Log from elbv2 controller
2025-03-11T10:45:53.047941104Z stderr F
{"level":"info",
"ts":"2025-03-11T10:45:53.047Z",
"logger":"ackrt",
"msg":"desired resource state has changed",
"kind":"TargetGroup",
"namespace":"foo-bar",
"name":"foo-bar-tg-staging",
"account":"xxxx",
"role":"",
"region":"us-west-2",
"is_adopted":false,"generation":1,"diff":[{"Path":{"Parts":["Spec",
"Targets"]},"A":null,"B":[{"availabilityZone":"us-west-2a",
"id":"xx.x.xx.xx","port":8080}]}]
}you can see the metrics show unhealthy state at the same time as that log entry when it reconciles.
Expected outcome
target group to keep the targets when using ECS & targetGroupRef.
Environment
- Using EKS: Yes v1.29.13-eks-8cce635
- AWS service targeted: ECS, ELBV2
