Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoscaler doesn't recognize nvidia.com/gpu when scaling up from 0 to n nodes on AWS. #929

Closed
alexnederlof opened this issue Jun 7, 2018 · 1 comment

Comments

@alexnederlof
Copy link

alexnederlof commented Jun 7, 2018

I setup a GPU pool, and autoscaler works fine scaling up from 1 to n nodes, but not from 0 to n nodes. The error message is:

I0605 11:27:29.865576       1 scale_up.go:54] Pod default/simple-gpu-test-6f48d9555d-l9822 is unschedulable
I0605 11:27:29.961051       1 scale_up.go:86] Upcoming 0 nodes
I0605 11:27:30.005163       1 scale_up.go:146] Scale-up predicate failed: PodFitsResources predicate mismatch, cannot put default/simple-gpu-test-6f48d9555d-l9822 on template-node-for-gpus.ci.k8s.local-5829202798403814789, reason: Insufficient nvidia.com/gpu
I0605 11:27:30.005262       1 scale_up.go:175] No pod can fit to gpus.ci.k8s.local
I0605 11:27:30.005324       1 scale_up.go:180] No expansion options
I0605 11:27:30.005393       1 static_autoscaler.go:299] Calculating unneeded nodes
I0605 11:27:30.008919       1 factory.go:33] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"simple-gpu-test-6f48d9555d-l9822", UID:"3416d787-68b3-11e8-8e8f-0639a6e973b0", APIVersion:"v1", ResourceVersion:"12429157", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added)
I0605 11:27:30.031707       1 leaderelection.go:199] successfully renewed lease kube-system/cluster-autoscaler

This is on Kubernetes 1.9.6 with autoscaler 1.1.2.

The nodes carry the label kops.k8s.io/instancegroup=gpus, which is also present in the autoscaler group on AWS:

{
            "ResourceType": "auto-scaling-group",
            "ResourceId": "gpus.ci.k8s.local",
            "PropagateAtLaunch": true,
            "Value": "gpus",
            "Key": "k8s.io/cluster-autoscaler/node-template/label/kops.k8s.io/instancegroup"
        },

If I start a node, I see it has the required capacity:

Capacity:
 cpu:             4
 memory:          62884036Ki
 nvidia.com/gpu:  1
 pods:            110

This is the simple deployment I use to test it:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: simple-gpu-test
spec: 
  replicas: 1
  template:
    metadata:
      labels:
        app: "simplegputest"
    spec:
      containers: 
      - name: "nvidia-smi-gpu"
        image: "nvidia/cuda:8.0-cudnn5-runtime"
        resources: 
          limits: 
             nvidia.com/gpu: 1 # requesting 1 GPU
        volumeMounts:
        - mountPath: /usr/local/nvidia
          name: nvidia
        command: [ "/bin/bash", "-c", "--" ]
        args: [ "while true; do nvidia-smi; sleep 5; done;" ]
      volumes:
      - hostPath:
          path: /usr/local/nvidia
        name: nvidia

Related to #321 where I reported it earlier

@alexnederlof
Copy link
Author

Sorry this is a duplicate of #903

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant