Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale up from 0 does not work with existing AWS EBS CSI PersistentVolume #3845

Closed
Xyaren opened this issue Jan 25, 2021 · 26 comments · Fixed by #6090
Closed

Scale up from 0 does not work with existing AWS EBS CSI PersistentVolume #3845

Xyaren opened this issue Jan 25, 2021 · 26 comments · Fixed by #6090
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@Xyaren
Copy link

Xyaren commented Jan 25, 2021

Which component are you using?:

  • cluster-autoscaler

What version of the component are you using?:

  • v1.18.3 ( also happened with v1.18.2)
Cluster-Autoscaler Deployment YAML
---
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::AWS_ACCOUNT_ID_OMMITTED:role/mycompany-iam-k8s-cluster-autoscaler-test
  name: cluster-autoscaler
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cluster-autoscaler
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
rules:
  - apiGroups: [""]
    resources: ["events", "endpoints"]
    verbs: ["create", "patch"]
  - apiGroups: [""]
    resources: ["pods/eviction"]
    verbs: ["create"]
  - apiGroups: [""]
    resources: ["pods/status"]
    verbs: ["update"]
  - apiGroups: [""]
    resources: ["endpoints"]
    resourceNames: ["cluster-autoscaler"]
    verbs: ["get", "update"]
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["watch", "list", "get", "update"]
  - apiGroups: [""]
    resources:
      - "pods"
      - "services"
      - "replicationcontrollers"
      - "persistentvolumeclaims"
      - "persistentvolumes"
    verbs: ["watch", "list", "get"]
  - apiGroups: ["extensions"]
    resources: ["replicasets", "daemonsets"]
    verbs: ["watch", "list", "get"]
  - apiGroups: ["policy"]
    resources: ["poddisruptionbudgets"]
    verbs: ["watch", "list"]
  - apiGroups: ["apps"]
    resources: ["statefulsets", "replicasets", "daemonsets"]
    verbs: ["watch", "list", "get"]
  - apiGroups: ["storage.k8s.io"]
    resources: ["storageclasses", "csinodes"]
    verbs: ["watch", "list", "get"]
  - apiGroups: ["batch", "extensions"]
    resources: ["jobs"]
    verbs: ["get", "list", "watch", "patch"]
  - apiGroups: ["coordination.k8s.io"]
    resources: ["leases"]
    verbs: ["create"]
  - apiGroups: ["coordination.k8s.io"]
    resourceNames: ["cluster-autoscaler"]
    resources: ["leases"]
    verbs: ["get", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
rules:
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["create", "list", "watch"]
  - apiGroups: [""]
    resources: ["configmaps"]
    resourceNames: ["cluster-autoscaler-status", "cluster-autoscaler-priority-expander"]
    verbs: ["delete", "get", "update", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cluster-autoscaler
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-autoscaler
subjects:
  - kind: ServiceAccount
    name: cluster-autoscaler
    namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: cluster-autoscaler
subjects:
  - kind: ServiceAccount
    name: cluster-autoscaler
    namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    app: cluster-autoscaler
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '8085'
    spec:
      serviceAccountName: cluster-autoscaler
      priorityClassName: cluster-critical
      containers:
        - image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.18.3 #Major & Minor should match cluster version: https://docs.aws.amazon.com/de_de/eks/latest/userguide/cluster-autoscaler.html#ca-deploy
          name: cluster-autoscaler
          resources:
            limits:
              cpu: 100m
              memory: 300Mi
            requests:
              cpu: 100m
              memory: 300Mi
          command:
            - ./cluster-autoscaler
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false
            - --expander=least-waste
            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/mycompany-test-eks
            - --ignore-daemonsets-utilization=true
            - --scale-down-delay-after-add=10m
            - --scale-down-unneeded-time=10m
            - --balance-similar-node-groups=false
            - --min-replica-count=0
          volumeMounts:
            - name: ssl-certs
              mountPath: /etc/ssl/certs/ca-certificates.crt
              readOnly: true
          imagePullPolicy: "Always"
      volumes:
        - name: ssl-certs
          hostPath:
            path: "/etc/ssl/certs/ca-bundle.crt"

Component version:

What k8s version are you using (kubectl version)?:

kubectl version Output
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.9-eks-d1db3c", GitCommit:"d1db3c46e55f95d6a7d3e5578689371318f95ff9", GitTreeState:"clean", BuildDate:"2020-10-20T22:18:07Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:

What did you expect to happen?:
I do have an ASG dedicated to a single CronJob, that get's triggered 6 times a day.
That ASG is pinned to a specific AWS AZ by it's assigned subnet.
The Cronjob is pinned to that specific ASG by Affinity+Toleration
The job uses a PV, that will be provisioned (AWS EBS) on the first ever run and then subsequently reused on each run.
I expect the ASG to be scaled up to 1 after the Pod gets created and removed shortly after the Pod/Job has finished.

What happened instead?:

The ASG will not be scaled up by the cluster-autoscaler.

cluster-autoscaler log output after the Job is created and the Pod is pending
2021-01-25T05:19:22.523Z : Starting main loop			
2021-01-25T05:19:22.524Z : "Found multiple availability zones for ASG "mycompany-test-eks-myapp-elastic-group-1-20210108154118845300000003"	 using eu-central-1a"		
2021-01-25T05:19:22.525Z : "Found multiple availability zones for ASG "mycompany-test-eks-myapp-worker-group-2-20201029130225136800000004"	 using eu-central-1a"		
2021-01-25T05:19:22.525Z : "Found multiple availability zones for ASG "mycompany-test-eks-worker-group-1-20201029130715836900000005"	 using eu-central-1a"		
2021-01-25T05:19:22.526Z : Filtering out schedulables			
2021-01-25T05:19:22.526Z : 0 pods marked as unschedulable can be scheduled.			
2021-01-25T05:19:22.526Z : No schedulable pods			
2021-01-25T05:19:22.526Z : Pod myapp-masterdata/masterdata-import-cronjob-lambda-d0ad7add-e9b0-424e-94dc-0wbrzw is unschedulable			
2021-01-25T05:19:22.526Z : Upcoming 0 nodes			
2021-01-25T05:19:22.526Z : Skipping node group mycompany-test-eks-myapp-elastic-group-1-20210108154118845300000003 - max size reached			
2021-01-25T05:19:22.526Z : "Pod masterdata-import-cronjob-lambda-d0ad7add-e9b0-424e-94dc-0wbrzw can't be scheduled on mycompany-test-eks-myapp-elastic-group-2-20201029130715759300000004, predicate checking error: node(s) didn't match node selector	 predicateName=NodeAffinity	 reasons: node(s) didn't match node selector	 debugInfo="
2021-01-25T05:19:22.526Z : No pod can fit to mycompany-test-eks-myapp-elastic-group-2-20201029130715759300000004			
2021-01-25T05:19:22.526Z : "Could not get a CSINode object for the node "template-node-for-mycompany-test-eks-myapp-masterdata-import-20210120105639236000000003-8426967936887117836": csinode.storage.k8s.io "template-node-for-mycompany-test-eks-myapp-masterdata-import-20210120105639236000000003-8426967936887117836" not found"			
2021-01-25T05:19:22.527Z : "PersistentVolume "pvc-ef85dcce-e63e-42da-b869-c3389bbd948d", Node "template-node-for-mycompany-test-eks-myapp-masterdata-import-20210120105639236000000003-8426967936887117836" mismatch for Pod "myapp-masterdata/masterdata-import-cronjob-lambda-d0ad7add-e9b0-424e-94dc-0wbrzw": No matching NodeSelectorTerms"			
2021-01-25T05:19:22.527Z : "Pod masterdata-import-cronjob-lambda-d0ad7add-e9b0-424e-94dc-0wbrzw can't be scheduled on mycompany-test-eks-myapp-masterdata-import-20210120105639236000000003, predicate checking error: node(s) had volume node affinity conflict	 predicateName=VolumeBinding	 reasons: node(s) had volume node affinity conflict	 debugInfo="
2021-01-25T05:19:22.527Z : No pod can fit to mycompany-test-eks-myapp-masterdata-import-20210120105639236000000003			
2021-01-25T05:19:22.527Z : "Pod masterdata-import-cronjob-lambda-d0ad7add-e9b0-424e-94dc-0wbrzw can't be scheduled on mycompany-test-eks-myapp-worker-group-120200916154409048800000006, predicate checking error: node(s) didn't match node selector	 predicateName=NodeAffinity	 reasons: node(s) didn't match node selector	 debugInfo="
2021-01-25T05:19:22.527Z : No pod can fit to mycompany-test-eks-myapp-worker-group-120200916154409048800000006			
2021-01-25T05:19:22.527Z : Skipping node group mycompany-test-eks-myapp-worker-group-2-20201029130225136800000004 - max size reached			
2021-01-25T05:19:22.527Z : Skipping node group mycompany-test-eks-worker-group-1-20201029130715836900000005 - max size reached			
2021-01-25T05:19:22.527Z : "Pod masterdata-import-cronjob-lambda-d0ad7add-e9b0-424e-94dc-0wbrzw can't be scheduled on mycompany-test-eks-worker-group-220200916162252020100000006, predicate checking error: node(s) didn't match node selector	 predicateName=NodeAffinity	 reasons: node(s) didn't match node selector	 debugInfo="
2021-01-25T05:19:22.527Z : No pod can fit to mycompany-test-eks-worker-group-220200916162252020100000006			
2021-01-25T05:19:22.527Z : No expansion options			
2021-01-25T05:19:22.527Z : Calculating unneeded nodes			
[...]
2021-01-25T05:19:22.528Z : Scale-down calculation: ignoring 2 nodes unremovable in the last 5m0s			
2021-01-25T05:19:22.528Z : Scale down status: unneededOnly=false lastScaleUpTime=2021-01-25 05:00:14.980160831 +0000 UTC m=+6970.760701246 lastScaleDownDeleteTime=2021-01-25 03:04:22.928996296 +0000 UTC m=+18.709536671 lastScaleDownFailTime=2021-01-25 03:04:22.928996376 +0000 UTC m=+18.709536751 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false			
2021-01-25T05:19:22.528Z : Starting scale down			
2021-01-25T05:19:22.528Z : No candidates for scale down			
2021-01-25T05:19:22.528Z : "Event(v1.ObjectReference{Kind:"Pod", Namespace:"myapp-masterdata", Name:"masterdata-import-cronjob-lambda-d0ad7add-e9b0-424e-94dc-0wbrzw", UID:"97956c38-55f3-4749-ab74-7e7fc674e832", APIVersion:"v1", ResourceVersion:"217276797", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added): 3 max node group size reached, 3 node(s) didn't match node selector, 1 node(s) had volume node affinity conflict"			
2021-01-25T05:19:22.946Z : k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:309: Watch close - *v1beta1.PodDisruptionBudget total 0 items received			
2021-01-25T05:19:32.542Z : Starting main loop			

Anything else we need to know?:
Basically this works fine without the volume.
With the volume it works when the volume is not provisioned yet, but fails when it already has been provisioned.
The job also get's scheduled right away when I manually upscale the ASG.

I noticed the volume affinity on the PVC :

Node Affinity:                                                                                                                                │
  Required Terms:                                                                                                                             │
    Term 0:        topology.ebs.csi.aws.com/zone in [eu-central-1b] 

That tag is probably set on the node by the "ebs-csi-node" DaemonSet and therefore is unknown for the cluster-autoscaler.

Am I expected to tag the ASG with k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone ?
If so, how am I supposed to set them in a Multi-AZ ASGs ?

Possibly related: #3230

@Xyaren Xyaren added the kind/bug Categorizes issue or PR as related to a bug. label Jan 25, 2021
@Xyaren Xyaren changed the title Scale from 0 does not work with existing AWS EBS CSI PersistentVolume Scale up from 0 does not work with existing AWS EBS CSI PersistentVolume Jan 25, 2021
@westernspion
Copy link

westernspion commented Feb 4, 2021

Same problem here (edit after realizing there is no difference relevant difference in my previous post to what you wrote)

After doing some splunking, I you are correct it has something to do with scaling from 0 and usage of the topology.ebs.csi.aws.com/zone label and the ability of the autoscaler to recognize it. Some experimentation corroborates this.

@westernspion
Copy link

westernspion commented Feb 5, 2021

k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone is the approach I am taking and it works like charm.

I can do some footwork in terraform to get the tags setup. Not sure what you're using to provision your cluster.

Though, it would be nice to have the labels generated from the list of AZs assigned to an ASG

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 6, 2021
@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 18, 2021
@mparikhcloudbeds
Copy link

How to resolve this issue for statefulset deployments attached custom storage classes on EKS?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 14, 2021
@FarhanSajid1
Copy link

FarhanSajid1 commented Jan 4, 2022

How to resolve this issue for statefulset deployments attached custom storage classes on EKS?

So just set

k8s.io/cluster-autoscaler/node-template/label/topology.kubernetes.io/zone: "us-east-2a"

for example? Like the OP mentions, how are we supposed to do this for multiple AZs

@iomarcovalente
Copy link

iomarcovalente commented Mar 8, 2022

I have this exact problem too, to add further info the error I get on the pod unable to scale from zero is:
pod didn't trigger scale-up (it wouldn't fit if a new node is added): 3 node(s) had volume node affinity conflict

@jbg
Copy link

jbg commented Mar 29, 2022

@FarhanSajid1 you should have one node group (and thus one ASG) for each AZ. The above tag needs to be applied to the ASG.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 27, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jul 27, 2022
@decipher27
Copy link

decipher27 commented Sep 20, 2022

Hi Folks! Facing the same issue:
CA Version: v1.21.1
aws-ebs-csi-driver Version
v1.10.0-eksbuild.1

Cluster-autosacler logs:

I0920 17:30:00.585954       1 scheduler_binder.go:803] Could not get a CSINode object for the node "ip-10-121-173-251.ap-south-1.compute.internal": csinode.storage.k8s.io "ip-10-121-173-251.ap-south-1.compute.internal" not found
I0920 17:30:00.586008       1 scheduler_binder.go:823] PersistentVolume "pvc-50c002d3-a5cc-4143-adf2-1362d18fc40e", Node "ip-10-121-173-251.ap-south-1.compute.internal" mismatch for Pod "kafka/kafka-0": no matching NodeSelectorTerms
I0920 17:30:00.586074       1 scheduler_binder.go:803] Could not get a CSINode object for the node "ip-10-121-68-79.ap-south-1.compute.internal": csinode.storage.k8s.io "ip-10-121-68-79.ap-south-1.compute.internal" not found
I0920 17:30:00.586107       1 scheduler_binder.go:823] PersistentVolume "pvc-31af46c4-0d27-4eea-8ef6-148bbb2b4f0b", Node "ip-10-121-68-79.ap-south-1.compute.internal" mismatch for Pod "kafka/kafka-0": no matching NodeSelectorTerms
I0920 17:30:00.586149       1 scheduler_binder.go:803] Could not get a CSINode object for the node "ip-10-121-162-179.ap-south-1.compute.internal": csinode.storage.k8s.io "ip-10-121-162-179.ap-south-1.compute.internal" not found
I0920 17:30:00.586172       1 scheduler_binder.go:823] PersistentVolume "pvc-50c002d3-a5cc-4143-adf2-1362d18fc40e", Node "ip-10-121-162-179.ap-south-1.compute.internal" mismatch for Pod "kafka/kafka-0": no matching NodeSelectorTerms
I0920 17:30:00.586247       1 scheduler_binder.go:803] Could not get a CSINode object for the node "ip-10-121-241-242.ap-south-1.compute.internal": csinode.storage.k8s.io "ip-10-121-241-242.ap-south-1.compute.internal" not found
I0920 17:30:00.586275       1 scheduler_binder.go:823] PersistentVolume "pvc-50c002d3-a5cc-4143-adf2-1362d18fc40e", Node "ip-10-121-241-242.ap-south-1.compute.internal" mismatch for Pod "kafka/kafka-0": no matching NodeSelectorTerms
I0920 17:30:00.586328       1 scheduler_binder.go:803] Could not get a CSINode object for the node "ip-10-121-5-204.ap-south-1.compute.internal": csinode.storage.k8s.io "ip-10-121-5-204.ap-south-1.compute.internal" not found
I0920 17:30:00.586350       1 scheduler_binder.go:823] PersistentVolume "pvc-31af46c4-0d27-4eea-8ef6-148bbb2b4f0b", Node "ip-10-121-5-204.ap-south-1.compute.internal" mismatch for Pod "kafka/kafka-0": no matching NodeSelectorTerms
I0920 17:30:00.586533       1 scheduler_binder.go:803] Could not get a CSINode object for the node "ip-10-121-173-251.ap-south-1.compute.internal": csinode.storage.k8s.io "ip-10-121-173-251.ap-south-1.compute.internal" not found
I0920 17:30:00.586572       1 scheduler_binder.go:823] PersistentVolume "pvc-0c9887c2-eea3-4ef7-baae-c4c0aca78699", Node "ip-10-121-173-251.ap-south-1.compute.internal" mismatch for Pod "kafka/kafka-1": no matching NodeSelectorTerms
I0920 17:30:00.586622       1 scheduler_binder.go:803] Could not get a CSINode object for the node "ip-10-121-68-79.ap-south-1.compute.internal": csinode.storage.k8s.io "ip-10-121-68-79.ap-south-1.compute.internal" not found
I0920 17:30:00.586663       1 scheduler_binder.go:823] PersistentVolume "pvc-df590cf4-a584-4842-9842-9629312c0e45", Node "ip-10-121-68-79.ap-south-1.compute.internal" mismatch for Pod "kafka/kafka-1": no matching NodeSelectorTerms
I0920 17:30:00.586711       1 scheduler_binder.go:803] Could not get a CSINode object for the node "ip-10-121-162-179.ap-south-1.compute.internal": csinode.storage.k8s.io "ip-10-121-162-179.ap-south-1.compute.internal" not found
I0920 17:30:00.586737       1 scheduler_binder.go:823] PersistentVolume "pvc-0c9887c2-eea3-4ef7-baae-c4c0aca78699", Node "ip-10-121-162-179.ap-south-1.compute.internal" mismatch for Pod "kafka/kafka-1": no matching NodeSelectorTerms
I0920 17:30:00.586802       1 scheduler_binder.go:803] Could not get a CSINode object for the node "ip-10-121-241-242.ap-south-1.compute.internal": csinode.storage.k8s.io "ip-10-121-241-242.ap-south-1.compute.internal" not found
I0920 17:30:00.586827       1 scheduler_binder.go:823] PersistentVolume "pvc-0c9887c2-eea3-4ef7-baae-c4c0aca78699", Node "ip-10-121-241-242.ap-south-1.compute.internal" mismatch for Pod "kafka/kafka-1": no matching NodeSelectorTerms
I0920 17:30:00.586869       1 scheduler_binder.go:803] Could not get a CSINode object for the node "ip-10-121-5-204.ap-south-1.compute.internal": csinode.storage.k8s.io "ip-10-121-5-204.ap-south-1.compute.internal" not found
I0920 17:30:00.586907       1 scheduler_binder.go:823] PersistentVolume "pvc-df590cf4-a584-4842-9842-9629312c0e45", Node "ip-10-121-5-204.ap-south-1.compute.internal" mismatch for Pod "kafka/kafka-1": no matching NodeSelectorTerms
I0920 17:30:00.586929       1 filter_out_schedulable.go:170] 0 pods were kept as unschedulable based on caching
I0920 17:30:00.586938       1 filter_out_schedulable.go:171] 0 pods marked as unschedulable can be scheduled.
I0920 17:30:00.586952       1 filter_out_schedulable.go:82] No schedulable pods
I0920 17:30:00.586966       1 klogx.go:86] Pod kafka/kafka-0 is unschedulable
I0920 17:30:00.586972       1 klogx.go:86] Pod kafka/kafka-1 is unschedulable
I0920 17:30:00.587014       1 scale_up.go:376] Upcoming 0 nodes
I0920 17:30:00.587153       1 scheduler_binder.go:803] Could not get a CSINode object for the node "template-node-for-eks-atlan-node-kafka-pod-spot-20220920151848482300000005-36c1adbd-7aef-51ce-830e-d848e9f27e09-6789034556239763083": csinode.storage.k8s.io "template-node-for-eks-atlan-node-kafka-pod-spot-20220920151848482300000005-36c1adbd-7aef-51ce-830e-d848e9f27e09-6789034556239763083" not found
I0920 17:30:00.587188       1 scheduler_binder.go:823] PersistentVolume "pvc-31af46c4-0d27-4eea-8ef6-148bbb2b4f0b", Node "template-node-for-eks-atlan-node-kafka-pod-spot-20220920151848482300000005-36c1adbd-7aef-51ce-830e-d848e9f27e09-6789034556239763083" mismatch for Pod "kafka/kafka-0": no matching NodeSelectorTerms
I0920 17:30:00.587210       1 scale_up.go:300] Pod kafka-0 can't be scheduled on eks-atlan-node-kafka-pod-spot-20220920151848482300000005-36c1adbd-7aef-51ce-830e-d848e9f27e09, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo=
I0920 17:30:00.587316       1 scheduler_binder.go:803] Could not get a CSINode object for the node "template-node-for-eks-atlan-node-kafka-pod-spot-20220920151848482300000005-36c1adbd-7aef-51ce-830e-d848e9f27e09-6789034556239763083": csinode.storage.k8s.io "template-node-for-eks-atlan-node-kafka-pod-spot-20220920151848482300000005-36c1adbd-7aef-51ce-830e-d848e9f27e09-6789034556239763083" not found
I0920 17:30:00.587361       1 scheduler_binder.go:823] PersistentVolume "pvc-df590cf4-a584-4842-9842-9629312c0e45", Node "template-node-for-eks-atlan-node-kafka-pod-spot-20220920151848482300000005-36c1adbd-7aef-51ce-830e-d848e9f27e09-6789034556239763083" mismatch for Pod "kafka/kafka-1": no matching NodeSelectorTerms
I0920 17:30:00.587386       1 scale_up.go:300] Pod kafka-1 can't be scheduled on eks-atlan-node-kafka-pod-spot-20220920151848482300000005-36c1adbd-7aef-51ce-830e-d848e9f27e09, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo=
I0920 17:30:00.587417       1 scale_up.go:449] No pod can fit to eks-atlan-node-kafka-pod-spot-20220920151848482300000005-36c1adbd-7aef-51ce-830e-d848e9f27e09

Our pods are in pending state due to volume node affinity conflict.

Describe kafka-1 pod

LAST SEEN   TYPE      REASON              OBJECT        MESSAGE
6m52s       Warning   FailedScheduling    pod/kafka-0   0/5 nodes are available: 5 node(s) had volume node affinity conflict.
6m52s       Warning   FailedScheduling    pod/kafka-1   0/5 nodes are available: 5 node(s) had volume node affinity conflict.
73s         Normal    NotTriggerScaleUp   pod/kafka-0   pod didn't trigger scale-up: 1 node(s) had volume node affinity conflict
73s         Normal    NotTriggerScaleUp   pod/kafka-1   pod didn't trigger scale-up: 1 node(s) had volume node affinity conflict

@JBOClara
Copy link

JBOClara commented Sep 21, 2022

Hi @decipher27 ,

Could you show us the labels on you AWS ASG aws autoscaling describe-auto-scaling-groups ?

My understanding of this issue is that you need the topology tags:

                {
                    "ResourceId": "eks-spot-2-XXXX",
                    "ResourceType": "auto-scaling-group",
                    "Key": "k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone",
                    "Value": "us-east-1c",
                    "PropagateAtLaunch": false
                },

I've also added

                {
                    "ResourceId": "eks-spot-2-5xxxx",
                    "ResourceType": "auto-scaling-group",
                    "Key": "k8s.io/cluster-autoscaler/node-template/label/topology.kubernetes.io/zone",
                    "Value": "us-east-1c",
                    "PropagateAtLaunch": false
                },

When your ASG is at 0, there no node to retrieve the topogy from. You must have topology labels on ASG itself to allow CA and CSI Driver to retrieve the topology.

@decipher27
Copy link

decipher27 commented Sep 29, 2022

We don't have the mentioned tags mentioned above, and it was working earlier. Though, we found the issue was with the scheduler. we are using a custom scheduler..
Our vendor had made some tweaks and it's fixed. Thank you @JBOClara

@decipher27
Copy link

decipher27 commented Sep 29, 2022

Also, from your comment, what do you mean by When your ASG is at 0? You mean if I set the desired count to be '0'?

@JBOClara
Copy link

JBOClara commented Sep 30, 2022

Also, from your comment, what do you mean by When your ASG is at 0? You mean if I set the desired count to be '0'?
@decipher27

Exactly, when an ASG desired value is set to 0 (for instance, after a downscale of all replicas with kube-downscaler, except those from CA itself). CA will not be able to read node labels, because there is no node.

@debu99
Copy link

debu99 commented Oct 19, 2022

Got the same issue, if a pvc & pod created and then suspend the asg group & scaled down the asg to 0 to save cost at weekend, but on Monday this pod is not able to start from 0, other stateless pods are okay

@JBOClara
Copy link

@debu99
Look at:

Hi @decipher27 ,

Could you show us the labels on you AWS ASG aws autoscaling describe-auto-scaling-groups ?

My understanding of this issue is that you need the topology tags:

                {
                    "ResourceId": "eks-spot-2-XXXX",
                    "ResourceType": "auto-scaling-group",
                    "Key": "k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone",
                    "Value": "us-east-1c",
                    "PropagateAtLaunch": false
                },

I've also added

                {
                    "ResourceId": "eks-spot-2-5xxxx",
                    "ResourceType": "auto-scaling-group",
                    "Key": "k8s.io/cluster-autoscaler/node-template/label/topology.kubernetes.io/zone",
                    "Value": "us-east-1c",
                    "PropagateAtLaunch": false
                },

When your ASG is at 0, there no node to retrieve the topogy from. You must have topology labels on ASG itself to allow CA and CSI Driver to retrieve the topology.

@debu99
Copy link

debu99 commented Oct 19, 2022

my pv requires

Node Affinity:
  Required Terms:
    Term 0:        topology.ebs.csi.aws.com/zone in [ap-southeast-1a]

But I believe this label is added automatically to all nodes? as i didn't add it into ASG tags, but all my nodes has it

ip-10-40-44-63.ap-southeast-1.compute.internal    Ready    <none>   5h3m    v1.21.14-eks-ba74326   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t3a.large,beta.kubernetes.io/os=linux,dedicated=redis,failure-domain.beta.kubernetes.io/region=ap-southeast-1,failure-domain.beta.kubernetes.io/zone=ap-southeast-1b,k8s-node-lifecycle=on-demand,k8s-node-role/on-demand-worker=true,k8s-node-role/type=none,k8s-node/instance-level=large,k8s-node/worker-type=t-type,k8s.io/cloud-provider-aws=be298adc77b66eafc3745cf0a9c131e0,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-40-44-63.ap-southeast-1.compute.internal,kubernetes.io/os=linux,node.kubernetes.io/instance-type=t3a.large,sb-subnet/type=primary,sb-subnet/zone-id=1,topology.ebs.csi.aws.com/zone=ap-southeast-1b,topology.kubernetes.io/region=ap-southeast-1,topology.kubernetes.io/zone=ap-southeast-1b
ip-10-40-7-219.ap-southeast-1.compute.internal    Ready    <none>   25m     v1.21.14-eks-ba74326   beta.kubernetes.io/arch=arm64,beta.kubernetes.io/instance-type=r6g.large,beta.kubernetes.io/os=linux,dedicated=prometheus-operator,failure-domain.beta.kubernetes.io/region=ap-southeast-1,failure-domain.beta.kubernetes.io/zone=ap-southeast-1a,k8s-node-lifecycle=on-demand,k8s-node-role/on-demand-worker=true,k8s-node-role/type=none,k8s-node/instance-level=large,k8s-node/worker-type=r-type,k8s.io/cloud-provider-aws=be298adc77b66eafc3745cf0a9c131e0,kubernetes.io/arch=arm64,kubernetes.io/hostname=ip-10-40-7-219.ap-southeast-1.compute.internal,kubernetes.io/os=linux,node.kubernetes.io/instance-type=r6g.large,sb-subnet/type=primary,sb-subnet/zone-id=0,topology.ebs.csi.aws.com/zone=ap-southeast-1a,topology.kubernetes.io/region=ap-southeast-1,topology.kubernetes.io/zone=ap-southeast-1a

@jbg
Copy link

jbg commented Oct 19, 2022

Yes, but when the ASG is at 0, there are no nodes. cluster-autoscaler needs the labels tagged on the ASG to know what labels the node would have if it would scale up the ASG from 0.

@KiranReddy230
Copy link

We are facing the same issue with VolumeNodeAffinity error, and our ASG has node Spun Across AZs, What is the best way for CA to spin up the nodes based on the right AZ. We use the priority expander.
Also CA takes throws the error:

I0103 17:43:29.663090       1 scale_up.go:449] No pod can fit to eks-atlan-node-spot-c2c299ee-8af5-1b60-2ce3-2e4dc50b5484
I0103 17:43:29.663106       1 scale_up.go:453] No expansion options

Above error comes when there is enough room for CA to spin up new nodes in the Nodegroup and also there is one more nodegroup where CA can launch, but CA not functioning as expected. CA version: 1.21

@jbg
Copy link

jbg commented Mar 6, 2023

@KiranReddy230 if you read the comments above yours, the question has been answered three times already. You need to add the tags mentioned above to your ASG. In order for this to work properly, each node group (and thus each ASG) should have only one zone (this is the recommended architecture anyway).

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 4, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 4, 2023
@michalschott
Copy link

michalschott commented Jul 13, 2023

I have this issue despite (I believe) having everything set up correctly.

EKS - 1.25

CA - 1.25.2:

      - command:
        - ./cluster-autoscaler
        - --cloud-provider=aws
        - --namespace=kube-system
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/XXX
        - --balance-similar-node-groups=true
        - --emit-per-nodegroup-metrics=true
        - --expander=most-pods,least-waste
        - --ignore-taint=node.cilium.io/agent-not-ready
        - --logtostderr=true
        - --namespace=kube-system
        - --regional=true
        - --scan-interval=1m
        - --skip-nodes-with-local-storage=false
        - --skip-nodes-with-system-pods=false
        - --stderrthreshold=error
        - --v=0
        env:
        - name: AWS_REGION
          value: eu-west-1

My 3 ASGs are tagged as following (each of them covers single region a/b/c):

k8s.io/cluster-autoscaler/node-template/label/failure-domain.beta.kubernetes.io/zone	eu-west-1a / eu-west-1b / eu-west-1c
k8s.io/cluster-autoscaler/node-template/label/node.kubernetes.io/instance-type  m5.2xlarge
k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone  eu-west-1a / eu-west-1b / eu-west-1c
k8s.io/cluster-autoscaler/node-template/label/topology.kubernetes.io/region	 eu-west-1
k8s.io/cluster-autoscaler/node-template/label/topology.kubernetes.io/zone	 eu-west-1a / eu-west-1b / eu-west-1c
k8s.io/cluster-autoscaler/node-template/taint/node.cilium.io/agent-not-ready	true:NO_EXECUTE	Yes

I'm running Prometheus as STS with PVC (affinity rules set to ensure replicas are spread across AZ and hosts):

apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    polaris.fairwinds.com/automountServiceAccountToken-exempt: "true"
    prometheus-operator-input-hash: "4772490143308579296"
  creationTimestamp: "2023-03-03T20:52:48Z"
  generation: 56
  labels:
    app: kube-prometheus-stack-prometheus
    app.kubernetes.io/instance: prometheus
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/part-of: kube-prometheus-stack
    app.kubernetes.io/version: 47.0.0
    argocd.argoproj.io/instance: xxx-prometheus
    chart: kube-prometheus-stack-47.0.0
    heritage: Helm
    operator.prometheus.io/mode: server
    operator.prometheus.io/name: prometheus-prometheus
    operator.prometheus.io/shard: "0"
    release: prometheus
  name: prometheus-prometheus-prometheus
  namespace: prometheus
  ownerReferences:
  - apiVersion: monitoring.coreos.com/v1
    blockOwnerDeletion: true
    controller: true
    kind: Prometheus
    name: prometheus-prometheus
    uid: ce818fdf-02b4-4718-a430-f4ff4c5acbc5
  resourceVersion: "342440131"
  uid: 662e082a-af26-40e4-b39e-d354a023fe0a
spec:
  podManagementPolicy: Parallel
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: prometheus-prometheus
      app.kubernetes.io/managed-by: prometheus-operator
      app.kubernetes.io/name: prometheus
      operator.prometheus.io/name: prometheus-prometheus
      operator.prometheus.io/shard: "0"
      prometheus: prometheus-prometheus
  serviceName: prometheus-operated
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
        kubectl.kubernetes.io/default-container: prometheus
        linkerd.io/inject: enabled
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: prometheus-prometheus
        app.kubernetes.io/managed-by: prometheus-operator
        app.kubernetes.io/name: prometheus
        app.kubernetes.io/version: 2.44.0
        operator.prometheus.io/name: prometheus-prometheus
        operator.prometheus.io/shard: "0"
        prometheus: prometheus-prometheus
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app.kubernetes.io/instance: prometheus-prometheus
                app.kubernetes.io/name: prometheus
                prometheus: prometheus-prometheus
            topologyKey: topology.kubernetes.io/zone
          - labelSelector:
              matchLabels:
                app.kubernetes.io/instance: prometheus-prometheus
                app.kubernetes.io/name: prometheus
                prometheus: prometheus-prometheus
            topologyKey: kubernetes.io/hostname
      automountServiceAccountToken: true
      containers:
      - args:
        - --web.console.templates=/etc/prometheus/consoles
        - --web.console.libraries=/etc/prometheus/console_libraries
        - --config.file=/etc/prometheus/config_out/prometheus.env.yaml
        - --web.enable-lifecycle
        - --web.external-url=https://prometheus.xxx.xxx
        - --web.route-prefix=/
        - --log.level=error
        - --log.format=json
        - --storage.tsdb.retention.time=3h
        - --storage.tsdb.path=/prometheus
        - --storage.tsdb.wal-compression
        - --web.config.file=/etc/prometheus/web_config/web-config.yaml
        - --storage.tsdb.max-block-duration=2h
        - --storage.tsdb.min-block-duration=2h
        image: XXX.dkr.ecr.eu-west-1.amazonaws.com/quay.io/prometheus/prometheus:v2.44.0
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 6
          httpGet:
            path: /-/healthy
            port: http-web
            scheme: HTTP
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 3
        name: prometheus
        ports:
        - containerPort: 9090
          name: http-web
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /-/ready
            port: http-web
            scheme: HTTP
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 3
        resources:
          limits:
            memory: 20Gi
          requests:
            cpu: 300m
            memory: 20Gi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
        startupProbe:
          failureThreshold: 60
          httpGet:
            path: /-/ready
            port: http-web
            scheme: HTTP
          periodSeconds: 15
          successThreshold: 1
          timeoutSeconds: 3
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /etc/prometheus/config_out
          name: config-out
          readOnly: true
        - mountPath: /etc/prometheus/certs
          name: tls-assets
          readOnly: true
        - mountPath: /prometheus
          name: prometheus-prometheus-prometheus-db
          subPath: prometheus-db
        - mountPath: /etc/prometheus/rules/prometheus-prometheus-prometheus-rulefiles-0
          name: prometheus-prometheus-prometheus-rulefiles-0
        - mountPath: /etc/prometheus/web_config/web-config.yaml
          name: web-config
          readOnly: true
          subPath: web-config.yaml
      - args:
        - --listen-address=:8080
        - --reload-url=http://127.0.0.1:9090/-/reload
        - --config-file=/etc/prometheus/config/prometheus.yaml.gz
        - --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml
        - --watched-dir=/etc/prometheus/rules/prometheus-prometheus-prometheus-rulefiles-0
        - --log-level=error
        - --log-format=json
        command:
        - /bin/prometheus-config-reloader
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: SHARD
          value: "0"
        image: XXX.dkr.ecr.eu-west-1.amazonaws.com/quay.io/prometheus-operator/prometheus-config-reloader:v0.66.0
        imagePullPolicy: Always
        name: config-reloader
        ports:
        - containerPort: 8080
          name: reloader-web
          protocol: TCP
        resources:
          limits:
            cpu: 200m
            memory: 50Mi
          requests:
            cpu: 50m
            memory: 50Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /etc/prometheus/config
          name: config
        - mountPath: /etc/prometheus/config_out
          name: config-out
        - mountPath: /etc/prometheus/rules/prometheus-prometheus-prometheus-rulefiles-0
          name: prometheus-prometheus-prometheus-rulefiles-0
      - args:
        - sidecar
        - --prometheus.url=http://127.0.0.1:9090/
        - '--prometheus.http-client={"tls_config": {"insecure_skip_verify":true}}'
        - --grpc-address=:10901
        - --http-address=:10902
        - --objstore.config=$(OBJSTORE_CONFIG)
        - --tsdb.path=/prometheus
        - --log.level=error
        - --log.format=json
        env:
        - name: OBJSTORE_CONFIG
          valueFrom:
            secretKeyRef:
              key: config
              name: thanos-config
        image: XXX.dkr.ecr.eu-west-1.amazonaws.com/bitnami/thanos:0.31.0
        imagePullPolicy: Always
        name: thanos-sidecar
        ports:
        - containerPort: 10902
          name: http
          protocol: TCP
        - containerPort: 10901
          name: grpc
          protocol: TCP
        resources:
          limits:
            memory: 256Mi
          requests:
            cpu: 10m
            memory: 256Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /prometheus
          name: prometheus-prometheus-prometheus-db
          subPath: prometheus-db
      dnsPolicy: ClusterFirst
      initContainers:
      - args:
        - --watch-interval=0
        - --listen-address=:8080
        - --config-file=/etc/prometheus/config/prometheus.yaml.gz
        - --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml
        - --watched-dir=/etc/prometheus/rules/prometheus-prometheus-prometheus-rulefiles-0
        - --log-level=error
        - --log-format=json
        command:
        - /bin/prometheus-config-reloader
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: SHARD
          value: "0"
        image: XXX.dkr.ecr.eu-west-1.amazonaws.com/quay.io/prometheus-operator/prometheus-config-reloader:v0.66.0
        imagePullPolicy: Always
        name: init-config-reloader
        ports:
        - containerPort: 8080
          name: reloader-web
          protocol: TCP
        resources:
          limits:
            cpu: 200m
            memory: 50Mi
          requests:
            cpu: 50m
            memory: 50Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /etc/prometheus/config
          name: config
        - mountPath: /etc/prometheus/config_out
          name: config-out
        - mountPath: /etc/prometheus/rules/prometheus-prometheus-prometheus-rulefiles-0
          name: prometheus-prometheus-prometheus-rulefiles-0
      nodeSelector:
        node.kubernetes.io/instance-type: m5.2xlarge
      priorityClassName: prometheus
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 2000
        runAsGroup: 2000
        runAsNonRoot: true
        runAsUser: 1000
        seccompProfile:
          type: RuntimeDefault
      serviceAccount: prometheus-prometheus
      serviceAccountName: prometheus-prometheus
      terminationGracePeriodSeconds: 600
      topologySpreadConstraints:
      - labelSelector:
          matchLabels:
            app.kubernetes.io/instance: prometheus-prometheus
            app.kubernetes.io/name: prometheus
            prometheus: prometheus-prometheus
        maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAnyway
      volumes:
      - name: config
        secret:
          defaultMode: 420
          secretName: prometheus-prometheus-prometheus
      - name: tls-assets
        projected:
          defaultMode: 420
          sources:
          - secret:
              name: prometheus-prometheus-prometheus-tls-assets-0
      - emptyDir:
          medium: Memory
        name: config-out
      - configMap:
          defaultMode: 420
          name: prometheus-prometheus-prometheus-rulefiles-0
        name: prometheus-prometheus-prometheus-rulefiles-0
      - name: web-config
        secret:
          defaultMode: 420
          secretName: prometheus-prometheus-prometheus-web-config
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      name: prometheus-prometheus-prometheus-db
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 10Gi
      storageClassName: ebs-sc-preserve
      volumeMode: Filesystem
    status:
      phase: Pending
~ k get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                                                               STORAGECLASS      REASON   AGE
pvc-e6df1f14-4f62-41ce-8f21-97b73b0c055f   10Gi       RWO            Retain           Bound    prometheus/prometheus-prometheus-prometheus-db-prometheus-prometheus-prometheus-0   ebs-sc-preserve            61d
pvc-f40a6589-6fcf-4419-9486-70e5efa43575   10Gi       RWO            Retain           Bound    prometheus/prometheus-prometheus-prometheus-db-prometheus-prometheus-prometheus-1   ebs-sc-preserve            9d

~ k describe pv pvc-e6df1f14-4f62-41ce-8f21-97b73b0c055f pvc-f40a6589-6fcf-4419-9486-70e5efa43575
Name:              pvc-e6df1f14-4f62-41ce-8f21-97b73b0c055f
Labels:            <none>
Annotations:       pv.kubernetes.io/provisioned-by: ebs.csi.aws.com
                   volume.kubernetes.io/provisioner-deletion-secret-name:
                   volume.kubernetes.io/provisioner-deletion-secret-namespace:
Finalizers:        [kubernetes.io/pv-protection external-attacher/ebs-csi-aws-com]
StorageClass:      ebs-sc-preserve
Status:            Bound
Claim:             prometheus/prometheus-prometheus-prometheus-db-prometheus-prometheus-prometheus-0
Reclaim Policy:    Retain
Access Modes:      RWO
VolumeMode:        Filesystem
Capacity:          10Gi
Node Affinity:
  Required Terms:
    Term 0:        topology.ebs.csi.aws.com/zone in [eu-west-1c]
Message:
Source:
    Type:              CSI (a Container Storage Interface (CSI) volume source)
    Driver:            ebs.csi.aws.com
    FSType:            ext4
    VolumeHandle:      vol-08b0f4a31f192dad7
    ReadOnly:          false
    VolumeAttributes:      storage.kubernetes.io/csiProvisionerIdentity=1683859406228-8081-ebs.csi.aws.com
Events:                <none>


Name:              pvc-f40a6589-6fcf-4419-9486-70e5efa43575
Labels:            <none>
Annotations:       pv.kubernetes.io/provisioned-by: ebs.csi.aws.com
                   volume.kubernetes.io/provisioner-deletion-secret-name:
                   volume.kubernetes.io/provisioner-deletion-secret-namespace:
Finalizers:        [kubernetes.io/pv-protection external-attacher/ebs-csi-aws-com]
StorageClass:      ebs-sc-preserve
Status:            Bound
Claim:             prometheus/prometheus-prometheus-prometheus-db-prometheus-prometheus-prometheus-1
Reclaim Policy:    Retain
Access Modes:      RWO
VolumeMode:        Filesystem
Capacity:          10Gi
Node Affinity:
  Required Terms:
    Term 0:        topology.ebs.csi.aws.com/zone in [eu-west-1b]
Message:
Source:
    Type:              CSI (a Container Storage Interface (CSI) volume source)
    Driver:            ebs.csi.aws.com
    FSType:            ext4
    VolumeHandle:      vol-07d31d533b2e01a4b
    ReadOnly:          false
    VolumeAttributes:      storage.kubernetes.io/csiProvisionerIdentity=1687797020030-8081-ebs.csi.aws.com
Events:                <none>

Every night between 00:00 - 06:00 (I believe this is when AWS rebalancing happens) at least one of prometheus replicas is being stuck in Pending state. Once cluster-autoscaler is being restarted - k -n kube-system rollout restart deploy cluster-autoscaler - ASG is being properly scheduled up.

For now I had to set minCapacity = 1 for these ASGs to prevent such situations.

@mmerrill3
Copy link

This is closely related to issue #4739, which was fixed in cluster autoscaler version 1.22 onward. If you look at the function that generates a hypothetical new node to satisfy the pending pod, the new label that is needed to satisfy volumes created by the EBS CSI driver is not part of that function. It will not scale up unless you add the tag to the ASG manually.

Current function:
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L409

The next function is why adding the labels to the ASG makes this work

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L423

Since the annotation is widely used now, maybe we update the buildGenericLabels function to use the label topology.ebs.csi.aws.com/zone as well for the new node when its hypothetically being built.

@msvticket
Copy link
Contributor

I can make a stab at providing a PR with a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging a pull request may close this issue.