Rearchitecture: Kubernetes Backend

In #195 server architecture was redone to improve resiliency/performance and make the kubernetes backend more kubernetes-native. In that PR we dropped support entirely for kubernetes, support will be added back in a subsequent PR. Here I'll outline how I think the new Kubernetes backend should be architected.

---

When running on Kubernetes, `dask-gateway` will be composed of the following components:

- A dask-gateway CRD object to describe specification/state about a single dask cluster
- One or more instances of the `dask-gateway` api server, running behind a ClusterIP Service. These provide the internal api routing, but users won't (usually) connect directly to the api server.
- One or more instances of `traefik` configured to use the Kubernetes CRD provider (https://docs.traefik.io/providers/kubernetes-crd/). These will handle proxying both the scheduler and dashboard routes.
- One instance of a controller, managing dask-gateway CRD objects.

When initially installed, only a single `IngressRoute` object will be created to expose the `dask-gateway` api server through the proxy. As dask-clusters come and go, additional routes will be added/removed. Users will only communicate with the `traefik` proxy service directly, remaining network traffic doesn't need to be exposed directly.

Walking through the components individually.

## Dask-Gateway CRD

An instance of a Dask-Gateway CRD object needs to contain information on how to run a single dask cluster. Here's a sketch of a proposed schema (in pseudocode):

```yaml
spec:
    schedulerTemplate: ...  # A pod template for what the scheduler pod should look like
    workerTemplate: ...  # A pod template for what a single worker pod should look like
    ingressRoutes: ...  # A list of IngressRoute objects to create after the scheduler pod is running
    replicas: 0  # The number of workers to create, initially 0
status:
    # Status tracking things go here, not sure what we'd need to track
```

Each dask cluster created by the gateway will have one of these objects created.

## Backend Class

To plugin to the dask-gateway api server, we'd need to write a kubernetes backend. We may need to tweak the interface exposed by the [`dask_gateway_server.backends.base.Backend`](https://github.com/dask/dask-gateway/blob/master/dask-gateway-server/dask_gateway_server/backends/base.py) class, but I think the existing interface is pretty close to what we'll want.

Walking through the methods:

- [`start_cluster(self, user, cluster_options)`](https://github.com/dask/dask-gateway/blob/ec3c344137d6a074c35cbcdb3d7784817edd3e41/dask-gateway-server/dask_gateway_server/backends/base.py#L165)

This is called when a new cluster is requested for a `user`. This should validate the request, then compose and create the dask-gateway CRD object and corresponding Secret (storing the TLS credentials and api token). It should return the object name (or some proxy for it) so that it can be looked up again. Note that when this method returns the cluster doesn't need to be running, only submitted.

- [`stop_cluster(self, cluster_name, failed=False)`](https://github.com/dask/dask-gateway/blob/ec3c344137d6a074c35cbcdb3d7784817edd3e41/dask-gateway-server/dask_gateway_server/backends/base.py#L181)

This is called when a cluster is requested to stop. This should do whatever is needed to stop the CRD object, and should be a no-op if it's already stopped. As above, this doesn't need to wait for the cluster to stop, only for it to be stopping.

If we want to keep some job history then `stop_cluster` can't delete the CRD object, since that's the only record that the job happened. If we don't care about this then deleting the CRD is fine. Other backends keep around a short job history, but that may not matter here.

- [`get_cluster(self, cluster_name, wait=False)`](https://github.com/dask/dask-gateway/blob/ec3c344137d6a074c35cbcdb3d7784817edd3e41/dask-gateway-server/dask_gateway_server/backends/base.py#L145)

This is called when a user wants to lookup or connect to a cluster. It provides the cluster name (as returned by `start_cluster`. It should return a [`dask_gateway_server.models.Cluster`](https://github.com/dask/dask-gateway/blob/ec3c344137d6a074c35cbcdb3d7784817edd3e41/dask-gateway-server/dask_gateway_server/models.py#L33) object.

I *think* what we'll want is for each instance of the api server to watch for objects required to compose the `Cluster` model (dask-gateway CRD objects, secrets with a label selector, etc...). If the needed object is already reflected locally, we can use that. If it's not, we can try querying the k8s server in case the object was created and hasn't propagated to us yet (and then store it in the reflector locally). It should be fine to return slightly outdated information from this call.

- [`list_clusters(self, username=None, statuses=None)`](https://github.com/dask/dask-gateway/blob/ec3c344137d6a074c35cbcdb3d7784817edd3e41/dask-gateway-server/dask_gateway_server/backends/base.py#L127)

This is used to query clusters, optionally filtering on username and statuses. Like `get_cluster` above, this can be slightly out of date, so I *think* returning only the reflected view is fine. If not, this should map pretty easily onto a k8s query with a label selector (a cluster's `username` will be stored in a label).

- [`on_cluster_heartbeat(self, cluster_name, msg)`](https://github.com/dask/dask-gateway/blob/ec3c344137d6a074c35cbcdb3d7784817edd3e41/dask-gateway-server/dask_gateway_server/backends/base.py#L196)

This responds to messages from the cluster to the gateway. For the k8s backend, we only really need to handle messages that contain changes to the cluster worker count. These will translate to patch updates to the CRD objects, updating the `replicas` field. 

## Controller

The controller handles the bulk of the logic around when/how to run pods. It should watch for updates to the CRD objects, and update the child resources (pods, ...) accordingly. Walking through the logic:

- When a new CRD is created, the controller will create a scheduler pod from the scheduler template, along with any `IngressRoute` objects specified in the CRD. The scheduler pod should have the CRD as an [owner reference](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/) so it'll be automatically cleaned up on CRD deletion. Likewise the `IngressRoute` objects should have the scheduler pod as an owner reference.

- When the scheduler pod is running, it will transition the CRD status to RUNNING. This signals to the API server that the cluster is now available.

- When the `replicas` value is higher than the current number of child workers (scale up event), additional worker pods will be created. The worker pods should be created with the scheduler pod as an owner reference so that they'll be automatically cleaned up.

- When the `replicas` value is lower than the current number of child workers the controller should do nothing. All worker pods should be configured with `restartPolicy: onFailure`. When a cluster is scaling down, the workers will exit successfully and the pods will stop themselves.

- When the CRD is signaled to shutdown, the scheduler pod should be deleted, which will delete the worker pods as well. We'll also need to delete the corresponding Secret. Perhaps `start_cluster` sets the CRD as the owner reference of the secret, and the controller is robust to CRDs being created before their corresponding Secret? Not sure.

We want to support running clusters in multiple namespaces in the same gateway. I'm not sure what the performance implications of watching all namespaces are vs a single namespace - if it's significant we may want to make this configurable, otherwise we may watch all namespaces always even if we only deploy in a single namespace.

We'll also want to be resilient to pod creation being denied due to a [Resource Quota](https://kubernetes.io/docs/concepts/policy/resource-quotas/).

I've written a controller with the Python k8s api and it was pretty simple, but I'd be happy writing this using operator-sdk or kubebuilder as well. No strong opinions.

---

Since this splits up the kubernetes implementation into a couple of smaller services, any meaningful testing will likely require all of them running. I'm not familiar with the tooling people use for this (see #196), advice here would be useful. We can definitely write unit tests for the smaller components, but an integration test will require everything running.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rearchitecture: Kubernetes Backend #198

Dask-Gateway CRD

Backend Class

Controller

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Rearchitecture: Kubernetes Backend #198

Description

Dask-Gateway CRD

Backend Class

Controller

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions