Description
In this freshdesk ticket Julius with LEAP asks for help debugging the cause of why SIGKILL (signal 9) is sent to a dask-worker. This github issue is scoped to help us rule out one specific reason for future failures - that a cheaper spot instance has been pre-empted - by providing a) monitoring for pod evictions and b) a documented way to see if such events has occurred on AWS/EKS and GCP/GKE.
Background
Spot instances, also known as pre-emptible instances, are as compared to "on demand" instances something that you aren't guaranteed to be able to request or keep running. For this reason, they are significantly cheaper.
Two features
Monitoring
I've opened jupyterhub/grafana-dashboards#65 to help us work towards monitoring pod evictions, and I think the termination of pods on a spot instance node will be made using pod evictions.
Documentation to get more details
I suspect its relevant to see more details about this event than just a blip in grafana indicating a pre-emption event. Details such as a message on why. So if for example grafana provides a counter for how many evictions takes place, we may still want to learn more about them when they are observed to happen. I suspect there will be information in the k8s Event resources that only stays around for 60 minutes in a k8s cluster, but there may also be logs from something or notices made in some way outside k8s as well. Either capturing the k8s Events or the cloud provider details would be fine.
If we can learn how to retroactively inspect k8s Events related to pod evictions, that is a more general benefit though as pods can be evicted for manual drains, memory pressure, running out of ephemeral storage, etc.
Two cloud providers to focus on
GCP's GKE
Compute Engine gives you 30 seconds to shut down when you're preempted, letting you save your work in progress for later.
In practice for a GKE based k8s non-system Pod like a dask-gateway cluster's worker pod, they will have 15 seconds and not 30 seconds as a standalone VM has.
There will be a SIGTERM / 15 signal sent to the pod's containers, and after 15 seconds a SIGKILL / 9 signal is sent which makes it forcefully stopped. Ideally, the dask-worker being terminated would let the dask-scheduler know about the situation and terminate, but I'm not sure how it works.
AWS's EKS
TODO: Provide initial research and background here (anyone are welcome to update this issue!)
Action points
I'm not sure, there is a lot of investigative work about this initially. Here are some ideas on action points.
For monitoring:
- Verify the belief that spot instance shutdown implies pod eviction with SIGTERM being sent out
- Work to resolve Dashboard panel for pod evictions (out of memory, out of ephemeral space, manual node drains) jupyterhub/grafana-dashboards#65
For documentation:
- Read up on docs and search the internet etc for ways to learn if a spot instance VM has been removed
Related
- GCP's general docs about spot instances
- About the preemption process
You can simulate the preemption of a VM by stopping the VM or deleting the VM accordingly.
- About the preemption process
- GCP's GKE docs about spot instances
- AWS's docs on determening if AWS terminated a spot instance
- AWS's general docs about spot instances
- AWS's EKS best practices of using spot instances
- AWS's docs about terminating spot instances