Skip to content

Race condition between reading the ack-role-account-map ConfigMap and resyncing breaks CARM #2011

Closed
@nampnguyen

Description

@nampnguyen

Describe the bug
When a controller starts while using CARM, particularly if the pod has a CPU limit specified (~250m), the account config map can sometimes be read after reconciliation starts. When this happens, the controller no longer assumes the appropriate cross account role and results in AccessDenied errors because the IRSA role does not have permissions to resources in other accounts.

The problem seems to be getting worse over time and may be related to the number of namespaces.

Even after the ack-role-account-map is read, the controller does not use the correct cross account role.

Debug logs (with some redactions):

{"level":"info","ts":"2024-02-09T19:54:29.148Z","logger":"setup","msg":"initializing service controller","aws.service":"dynamodb"}
{"level":"debug","ts":"2024-02-09T19:54:29.148Z","logger":"cache.namespace","msg":"Starting namespace cache","watchScope":[],"ignored":["kube-system","kube-public","kube-node-lease"]}
{"level":"debug","ts":"2024-02-09T19:54:29.158Z","logger":"cache.namespace","msg":"created namespace","name":"namespace-1"}
{"level":"debug","ts":"2024-02-09T19:54:29.158Z","logger":"cache.namespace","msg":"created namespace","name":"namespace-2"}
...
{"level":"debug","ts":"2024-02-09T19:54:29.158Z","logger":"cache.namespace","msg":"created namespace","name":"namespace-39"}
{"level":"debug","ts":"2024-02-09T19:54:29.158Z","logger":"cache.namespace","msg":"created namespace","name":"namespace-40"}
{"level":"debug","ts":"2024-02-09T19:54:29.170Z","logger":"ackrt","msg":"Initiating reconciler","reconciler kind":"Backup","resync period seconds":36000}
{"level":"debug","ts":"2024-02-09T19:54:29.170Z","logger":"ackrt","msg":"Initiating reconciler","reconciler kind":"GlobalTable","resync period seconds":36000}
{"level":"debug","ts":"2024-02-09T19:54:29.170Z","logger":"ackrt","msg":"Initiating reconciler","reconciler kind":"Table","resync period seconds":36000}
{"level":"info","ts":"2024-02-09T19:54:29.171Z","logger":"setup","msg":"starting manager","aws.service":"dynamodb"}
{"level":"info","ts":"2024-02-09T19:54:29.171Z","logger":"controller-runtime.metrics","msg":"Starting metrics server"}
...
{"level":"info","ts":"2024-02-09T19:54:29.171Z","msg":"Starting Controller","controller":"table","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"Table"}
{"level":"info","ts":"2024-02-09T19:54:29.273Z","msg":"Starting workers","controller":"adoptedresource","controllerGroup":"services.k8s.aws","controllerKind":"AdoptedResource","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"table","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"Table","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"globaltable","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"GlobalTable","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"globaltable","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"GlobalTable","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"fieldexport","controllerGroup":"services.k8s.aws","controllerKind":"FieldExport","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"table","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"Table","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"backup","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"Backup","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"backup","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"Backup","worker count":1}
{"level":"debug","ts":"2024-02-09T19:54:29.277Z","logger":"exporter.field-export-reconciler","msg":"error did not need requeue","error":"the source resource is not synced yet"}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":"> r.Sync","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":">> r.resetConditions","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":"<< r.resetConditions","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":">> rm.ResolveReferences","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":"<< rm.ResolveReferences","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":">> rm.EnsureTags","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":"<< rm.EnsureTags","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":">> rm.ReadOne","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":">>> rm.sdkFind","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.311Z","logger":"cache.account","msg":"created account config map","name":"ack-role-account-map"}
{"level":"debug","ts":"2024-02-09T19:54:29.349Z","logger":"ackrt","msg":"<<< rm.sdkFind","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3,"error":"AccessDeniedException: User: arn:aws:sts::111111111111:assumed-role/IRSARole/1707508469279464158 is not authorized to perform: dynamodb:DescribeTable on resource: arn:aws:dynamodb:us-east-1:111111111111:table/myapp because no identity-based policy allows the dynamodb:DescribeTable action\n\tstatus code: 400, request id: MH8E64L3D7BR9OJRI11BQ75FBNVV4KQNSO5AEMVJF66Q9ASUAAJG"}
{"level":"debug","ts":"2024-02-09T19:54:29.349Z","logger":"ackrt","msg":"<< rm.ReadOne","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3,"error":"AccessDeniedException: User: arn:aws:sts::111111111111:assumed-role/IRSARole/1707508469279464158 is not authorized to perform: dynamodb:DescribeTable on resource: arn:aws:dynamodb:us-east-1:111111111111:table/myapp because no identity-based policy allows the dynamodb:DescribeTable action\n\tstatus code: 400, request id: MH8E64L3D7BR9OJRI11BQ75FBNVV4KQNSO5AEMVJF66Q9ASUAAJG"}

Steps to reproduce

  1. Configure CARM with a large number of namespaces (~30-40) and a CPU limit on the controller pod
  2. Create an ACK resource
  3. Restart the controller pod to trigger reloading of the config map and reconciliation

Expected outcome
Resource resyncing to wait until after the ack-role-account-map ConfigMap is read or future resyncs pivots to the cross account role if the ConfigMap is read after resyncing starts.

Environment

  • Kubernetes version: 1.27
  • Using EKS (yes/no), if so version? Yes, 1.27
  • AWS service targeted (S3, RDS, etc.): Multiple (observed with DynamoDB, IAM, S3, SQS)

Metadata

Metadata

Assignees

Labels

area/carmIssues or PRs related to CARM (Cross Account Resource Management)area/runtimeIssues or PRs as related to controller runtime, common reconciliation logic, etckind/bugCategorizes issue or PR as related to a bug.priority/critical-urgentHighest priority. Must be actively worked on as someone's top priority right now.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions