Description
Describe the bug
When a controller starts while using CARM, particularly if the pod has a CPU limit specified (~250m), the account config map can sometimes be read after reconciliation starts. When this happens, the controller no longer assumes the appropriate cross account role and results in AccessDenied errors because the IRSA role does not have permissions to resources in other accounts.
The problem seems to be getting worse over time and may be related to the number of namespaces.
Even after the ack-role-account-map
is read, the controller does not use the correct cross account role.
Debug logs (with some redactions):
{"level":"info","ts":"2024-02-09T19:54:29.148Z","logger":"setup","msg":"initializing service controller","aws.service":"dynamodb"}
{"level":"debug","ts":"2024-02-09T19:54:29.148Z","logger":"cache.namespace","msg":"Starting namespace cache","watchScope":[],"ignored":["kube-system","kube-public","kube-node-lease"]}
{"level":"debug","ts":"2024-02-09T19:54:29.158Z","logger":"cache.namespace","msg":"created namespace","name":"namespace-1"}
{"level":"debug","ts":"2024-02-09T19:54:29.158Z","logger":"cache.namespace","msg":"created namespace","name":"namespace-2"}
...
{"level":"debug","ts":"2024-02-09T19:54:29.158Z","logger":"cache.namespace","msg":"created namespace","name":"namespace-39"}
{"level":"debug","ts":"2024-02-09T19:54:29.158Z","logger":"cache.namespace","msg":"created namespace","name":"namespace-40"}
{"level":"debug","ts":"2024-02-09T19:54:29.170Z","logger":"ackrt","msg":"Initiating reconciler","reconciler kind":"Backup","resync period seconds":36000}
{"level":"debug","ts":"2024-02-09T19:54:29.170Z","logger":"ackrt","msg":"Initiating reconciler","reconciler kind":"GlobalTable","resync period seconds":36000}
{"level":"debug","ts":"2024-02-09T19:54:29.170Z","logger":"ackrt","msg":"Initiating reconciler","reconciler kind":"Table","resync period seconds":36000}
{"level":"info","ts":"2024-02-09T19:54:29.171Z","logger":"setup","msg":"starting manager","aws.service":"dynamodb"}
{"level":"info","ts":"2024-02-09T19:54:29.171Z","logger":"controller-runtime.metrics","msg":"Starting metrics server"}
...
{"level":"info","ts":"2024-02-09T19:54:29.171Z","msg":"Starting Controller","controller":"table","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"Table"}
{"level":"info","ts":"2024-02-09T19:54:29.273Z","msg":"Starting workers","controller":"adoptedresource","controllerGroup":"services.k8s.aws","controllerKind":"AdoptedResource","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"table","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"Table","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"globaltable","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"GlobalTable","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"globaltable","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"GlobalTable","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"fieldexport","controllerGroup":"services.k8s.aws","controllerKind":"FieldExport","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"table","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"Table","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"backup","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"Backup","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"backup","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"Backup","worker count":1}
{"level":"debug","ts":"2024-02-09T19:54:29.277Z","logger":"exporter.field-export-reconciler","msg":"error did not need requeue","error":"the source resource is not synced yet"}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":"> r.Sync","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":">> r.resetConditions","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":"<< r.resetConditions","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":">> rm.ResolveReferences","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":"<< rm.ResolveReferences","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":">> rm.EnsureTags","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":"<< rm.EnsureTags","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":">> rm.ReadOne","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":">>> rm.sdkFind","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.311Z","logger":"cache.account","msg":"created account config map","name":"ack-role-account-map"}
{"level":"debug","ts":"2024-02-09T19:54:29.349Z","logger":"ackrt","msg":"<<< rm.sdkFind","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3,"error":"AccessDeniedException: User: arn:aws:sts::111111111111:assumed-role/IRSARole/1707508469279464158 is not authorized to perform: dynamodb:DescribeTable on resource: arn:aws:dynamodb:us-east-1:111111111111:table/myapp because no identity-based policy allows the dynamodb:DescribeTable action\n\tstatus code: 400, request id: MH8E64L3D7BR9OJRI11BQ75FBNVV4KQNSO5AEMVJF66Q9ASUAAJG"}
{"level":"debug","ts":"2024-02-09T19:54:29.349Z","logger":"ackrt","msg":"<< rm.ReadOne","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3,"error":"AccessDeniedException: User: arn:aws:sts::111111111111:assumed-role/IRSARole/1707508469279464158 is not authorized to perform: dynamodb:DescribeTable on resource: arn:aws:dynamodb:us-east-1:111111111111:table/myapp because no identity-based policy allows the dynamodb:DescribeTable action\n\tstatus code: 400, request id: MH8E64L3D7BR9OJRI11BQ75FBNVV4KQNSO5AEMVJF66Q9ASUAAJG"}
Steps to reproduce
- Configure CARM with a large number of namespaces (~30-40) and a CPU limit on the controller pod
- Create an ACK resource
- Restart the controller pod to trigger reloading of the config map and reconciliation
Expected outcome
Resource resyncing to wait until after the ack-role-account-map
ConfigMap is read or future resyncs pivots to the cross account role if the ConfigMap is read after resyncing starts.
Environment
- Kubernetes version: 1.27
- Using EKS (yes/no), if so version? Yes, 1.27
- AWS service targeted (S3, RDS, etc.): Multiple (observed with DynamoDB, IAM, S3, SQS)