-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scaling up from 0 nodes on AWS, CA not aware of custom resources #321
Comments
Is the only way to do this to use labels and nodeSelectors? |
GPU information is not extracted. and cc: @sethpollack |
Any idea how to pull that info from the ASG? |
Actually you can add that info to the |
I can push up a fix soon @7chenko would you be able to test it? |
Yup, I will test! |
Thanks! |
Confirmed this works, scaling up from 0 triggered. Thanks! |
Weirdly, this works when the nodes are g2.2xlarge instances, but not when they are p2.xlarge instances. Same error as before:
What could cause this difference in behavior? |
Yes, we pull that data from https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/ec2_instance_types.go#L444 Which is generated by https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/ec2_instance_types/gen.go For some reason that info is not getting parsed correctly. |
Ok, so it is parsing correctly, AWS just isn't providing the gpu data for the https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/us-west-2/index.json |
Gotcha, thanks for that. Will contact AWS with that info. |
Ok thanks |
Confirmed that AWS has now fixed the gpu data for p2.xxx: https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/us-west-2/index.json |
Thanks! I'll push an update. |
Hmm doesn't work for me. I get I have two instance groups, one with cpus, and a new one I want to scale down to 0 nodes called gpus, so apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2018-05-31T09:27:31Z
labels:
kops.k8s.io/cluster: ci.k8s.local
name: gpus
spec:
cloudLabels:
instancegroup: gpus
k8s.io/cluster-autoscaler/enabled: ""
k8s.io/cluster-autoscaler/node-template/label: ""
image: ami-4450543d
kubelet:
featureGates:
DevicePlugins: "true"
machineType: p2.xlarge
maxPrice: "0.5"
maxSize: 3
minSize: 0
nodeLabels:
kops.k8s.io/instancegroup: gpus
spot: "true"
role: Node
rootVolumeOptimization: true
subnets:
- eu-west-1c And the autoscaler deployment has: spec:
containers:
- command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --nodes=0:3:gpus.ci.k8s.local
env:
- name: AWS_REGION
value: eu-west-1
image: k8s.gcr.io/cluster-autoscaler:v1.1.2 Now I try to deploy a simple GPU test: apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: simple-gpu-test
spec:
replicas: 1
template:
metadata:
labels:
app: "simplegputest"
spec:
containers:
- name: "nvidia-smi-gpu"
image: "nvidia/cuda:8.0-cudnn5-runtime"
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
volumeMounts:
- mountPath: /usr/local/nvidia
name: nvidia
command: [ "/bin/bash", "-c", "--" ]
args: [ "while true; do nvidia-smi; sleep 5; done;" ]
volumes:
- hostPath:
path: /usr/local/nvidia
name: nvidia I expect the instance group to go from 0 to 1, but the autoscaler logs show:
When I start a node by setting the minimum tot 1, I see that it has the capacity:
Finally, when I set the min pool size to 1, it can scale from 1 to 3 automatically. Just doesn't doe 0 to 1. |
This has broken for me from CA 1.1.0 to 1.2.2. Same configuration now fails to scale up from 0 nodes with "Insufficient nvidia.com/gpu". Reverting back to 1.1.0 fixes it. (Kubernetes 1.10.0). |
fix test in workload defaulter
When scaling up from 0 nodes on AWS, how can I make cluster-autoscaler aware of custom resources on the nodes, such as "alpha.kubernetes.io/nvidia-gpu"?
Using kops 1.7.0, kubernetes 1.7.5, cluster-autoscaler 0.6.1, when I have 0 nodes running, starting a job with "resources: limits: alpha.kubernetes.io/nvidia-gpu: 1" results in CA inaction due to (note "Insufficient alpha.kubernetes.io/nvidia-gpu"):
It looks like the "template-node-for-nodes" doesn't have the resources listed. However if I start a job without the gpu requirement, a node is spun up, and then I can start the original gpu job and it gets scheduled on the node! The node looks like this (kubectl describe nodes) (note "alpha.kubernetes.io/nvidia-gpu: 1"):
New nodes are also spun up correctly as long as there is already at least 1 node running. Any idea how to make the "template" for nodes list the correct resources? Thanks!
The text was updated successfully, but these errors were encountered: