Monitoring EKS/GKE spot instance pre-emption events

In [this freshdesk ticket](https://2i2c.freshdesk.com/a/tickets/561) Julius with LEAP asks for help debugging the cause of why SIGKILL (signal 9) is sent to a dask-worker. This github issue is scoped to help us rule out one specific reason for future failures - that a cheaper spot instance has been pre-empted - by providing a) monitoring for pod evictions and b) a documented way to see if such events has occurred on AWS/EKS and GCP/GKE.

## Background

Spot instances, also known as pre-emptible instances, are as compared to "on demand" instances something that you aren't guaranteed to be able to request or keep running. For this reason, they are significantly cheaper.

## Two features

### Monitoring

I've opened https://github.com/jupyterhub/grafana-dashboards/issues/65 to help us work towards monitoring pod evictions, and I think the termination of pods on a spot instance node will be made using pod evictions.

### Documentation to get more details

I suspect its relevant to see more details about this event than just a blip in grafana indicating a pre-emption event. Details such as a message on why. So if for example grafana provides a counter for how many evictions takes place, we may still want to learn more about them when they are observed to happen. I suspect there will be information in the k8s Event resources that only stays around for 60 minutes in a k8s cluster, but there may also be logs from something or notices made in some way outside k8s as well. Either capturing the k8s Events or the cloud provider details would be fine.

If we can learn how to retroactively inspect k8s Events related to pod evictions, that is a more general benefit though as pods can be evicted for manual drains, memory pressure, running out of ephemeral storage, etc.

## Two cloud providers to focus on

### GCP's GKE

> Compute Engine gives you 30 seconds to shut down when you're preempted, letting you save your work in progress for later.

In practice for a GKE based k8s non-system Pod like a dask-gateway cluster's worker pod, they will [_have 15 seconds_](https://cloud.google.com/kubernetes-engine/docs/concepts/spot-vms) and not 30 seconds as a standalone VM has.

There will be a SIGTERM / 15 signal sent to the pod's containers, and after 15 seconds a SIGKILL / 9 signal is sent which makes it forcefully stopped. Ideally, the dask-worker being terminated would let the dask-scheduler know about the situation and terminate, but I'm not sure how it works.

### AWS's EKS

TODO: Provide initial research and background here (anyone are welcome to update this issue!)

## Action points

I'm not sure, there is a lot of investigative work about this initially. Here are some ideas on action points.

For monitoring:

- [ ] Verify the belief that spot instance shutdown implies pod eviction with SIGTERM being sent out
- [ ] Work to resolve https://github.com/jupyterhub/grafana-dashboards/issues/65

For documentation:

- [ ] Read up on docs and search the internet etc for ways to learn if a spot instance VM has been removed

## Related

- [GCP's general docs about spot instances](https://cloud.google.com/spot-vms)
  - [About the preemption process](https://cloud.google.com/compute/docs/instances/spot#preemption-process)
    > You can simulate the preemption of a VM by [stopping the VM](https://cloud.google.com/compute/docs/instances/stop-start-instance#stopping_an_instance) or [deleting the VM](https://cloud.google.com/compute/docs/instances/deleting-instance) accordingly.
- [GCP's GKE docs about spot instances](https://cloud.google.com/kubernetes-engine/docs/concepts/spot-vms)
- [AWS's docs on determening if AWS terminated a spot instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/BidEvictedEvent.html)
- [AWS's general docs about spot instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html)
- [AWS's EKS best practices of using spot instances](https://aws.amazon.com/premiumsupport/knowledge-center/eks-spot-instance-best-practices/)
- [AWS's docs about terminating spot instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-interruptions.html)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitoring EKS/GKE spot instance pre-emption events #2369

consideRatio
openedon Mar 19, 2023

Background

Two features

Monitoring

Documentation to get more details

Two cloud providers to focus on

GCP's GKE

AWS's EKS

Action points

Related

Assignees

Labels

Type

Projects

Milestone

Relationships

Development