Skip to content

Monitoring EKS/GKE spot instance pre-emption events #2369

Open

Description

In this freshdesk ticket Julius with LEAP asks for help debugging the cause of why SIGKILL (signal 9) is sent to a dask-worker. This github issue is scoped to help us rule out one specific reason for future failures - that a cheaper spot instance has been pre-empted - by providing a) monitoring for pod evictions and b) a documented way to see if such events has occurred on AWS/EKS and GCP/GKE.

Background

Spot instances, also known as pre-emptible instances, are as compared to "on demand" instances something that you aren't guaranteed to be able to request or keep running. For this reason, they are significantly cheaper.

Two features

Monitoring

I've opened jupyterhub/grafana-dashboards#65 to help us work towards monitoring pod evictions, and I think the termination of pods on a spot instance node will be made using pod evictions.

Documentation to get more details

I suspect its relevant to see more details about this event than just a blip in grafana indicating a pre-emption event. Details such as a message on why. So if for example grafana provides a counter for how many evictions takes place, we may still want to learn more about them when they are observed to happen. I suspect there will be information in the k8s Event resources that only stays around for 60 minutes in a k8s cluster, but there may also be logs from something or notices made in some way outside k8s as well. Either capturing the k8s Events or the cloud provider details would be fine.

If we can learn how to retroactively inspect k8s Events related to pod evictions, that is a more general benefit though as pods can be evicted for manual drains, memory pressure, running out of ephemeral storage, etc.

Two cloud providers to focus on

GCP's GKE

Compute Engine gives you 30 seconds to shut down when you're preempted, letting you save your work in progress for later.

In practice for a GKE based k8s non-system Pod like a dask-gateway cluster's worker pod, they will have 15 seconds and not 30 seconds as a standalone VM has.

There will be a SIGTERM / 15 signal sent to the pod's containers, and after 15 seconds a SIGKILL / 9 signal is sent which makes it forcefully stopped. Ideally, the dask-worker being terminated would let the dask-scheduler know about the situation and terminate, but I'm not sure how it works.

AWS's EKS

TODO: Provide initial research and background here (anyone are welcome to update this issue!)

Action points

I'm not sure, there is a lot of investigative work about this initially. Here are some ideas on action points.

For monitoring:

For documentation:

  • Read up on docs and search the internet etc for ways to learn if a spot instance VM has been removed

Related

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions