Skip to content

FYI - Simple remedy system designed for use with NPD #199

Closed
@negz

Description

@negz

Hello,

I wanted to bring Draino to your attention, in case it's useful to others. Draino is a very simple 'remedy' system for permanent problems detected by the Node Problem Detector - it simply cordons and drains nodes exhibiting configurable Node Conditions.

At Planet we run a small handful of Kubernetes clusters on GCE (not GKE). We have a particular analytics workload that is really good at killing GCE persistent volumes. Without going into too much detail, we see persistent volume related processes (mkfs.ext4, mount, etc) hanging forever in uninterruptible sleep, preventing the pods wanting to consume said volumes from running. We're working with GCP to resolve this issue, but in the meantime we got tired of manually cordoning and draining affected nodes, so we wrote Draino.

Our remedy system looks like:

  1. Detect permanent node problems and set Node Conditions using the Node Problem Detector.
  2. Configure Draino to cordon and drain nodes when they exhibit the NPD's KernelDeadlock condition, or a variant of KernelDeadlock we call VolumeTaskHung.
  3. Let the Cluster Autoscaler scale down underutilised nodes, including the nodes Draino has drained.

It's worth noting that once the Descheduler supports descheduling pods based on taints Draino could be replaced by the Descheduler running in combination with the scheduler's TaintNodesByCondition functionality.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions