Install the following tools:
- awscli - 2.7.4
- kubectl - 1.22
- kuberay - 0.2.0
- helm - 3.9.0
- karpenter - 0.10.1
- kubernetes - 1.22
- python - 3.8
- ray - 1.21.1
- terraform - 1.2.1
To create this infrastructure, you need to run:
terraform init
terraform apply -auto-approve
Note: It can take several minutes (> 15 minutes) to deploy this infrastructure.
Note: This will create resources that will cost money.
aws eks --region us-east-1 update-kubeconfig --name mlops-world-22-demo
kubectl config set-context --current --namespace=karpenter
Run:
kubectl get po
You should see something like:
NAME READY STATUS RESTARTS AGE
karpenter-6df4bc5d97-n9v9p 2/2 Running 0 26h
ray-operator-8445c997c8-lm8dj 1/1 Running 0 23h
ray-ray-head-type-nkdxm 1/1 Running 0 23h
ray-ray-worker-type-6kqwl 1/1 Running 0 8m11s
ray-ray-worker-type-96smc 1/1 Running 0 8m32s
kubectl -n karpenter port-forward service/ray-ray-head 10001:10001
Cleanly remove a node:
kubectl drain no/${node-id} --delete-emptydir-data --ignore-daemonsets
Because Karpenter manages the state of node resources outside of Terraform, Karpenter created resources will need to be de-provisioned first before removing the remaining resources with Terraform.
kubectl patch rayclusters/ray -p '{"metadata":{"finalizers":[]}}' --type=merge
helm uninstall ray
helm uninstall karpenter
kubectl delete po -n karpenter -l ray-cluster-name=ray
kubectl delete node -n karpenter -l karpenter.sh/provisioner-name=cpu-on-demand,karpenter.sh/provisioner-name=cpu-spot,karpenter.sh/provisioner-name=gpu
If deleting nodes hang, you can terminate the instances directly with:
aws --region us-east-1 ec2 describe-instances --filters Name=tag:karpenter.sh/discovery,Values=mlops-world-22-demo --query 'Reservations[*].Instances[*].InstanceId' --output text | xargs aws --region us-east-1 ec2 terminate-instances --instance-ids
Remove the resources created by Terraform
terraform destroy -auto-approve