Rearchitecture #195

jcrist · 2020-02-04T00:02:48Z

Fixes #186.

This rearranges our backend abstraction (as well as some of the control flow) to make it easier to make the kubernetes backend (and other backends) more native/performant.

Redo authentication to remove need for database
Abstract out backend class
Move database to be backend-specific, rather than required by the gateway app itself
Switch from tornado to aiohttp/asyncio everywhere
Local process backend
YARN backend
PBS backend
Slurm backend
~~[ ] User limits~~
~~[ ] Kubernetes backend~~
Tests
~~[ ] Docs~~

For the most part this is fully-functional already, and should just work. The only breaking changes here are on the administrator side - the user api remains the same.

I might drop the kubernetes support on master for a bit before we work things back up. The kubernetes backend will be split between the server and the a controller, I don't really want to rewrite the kubernetes backend to use the db only to remove it in a subsequent PR. I'm hesitant to continue this work on a feature branch since I don't want to deal with huge merge conflicts down the line, and this PR is already big enough. Either way tests and docs will be updated accordingly before merging.

I'll write up a design overview of what this PR brings tomorrow, it's late here.

Slowly migrating off tornado.

- Pare back functionality to allow rebuilding incrementally. - Move from tornado to aiohttp - Introduce backend abstraction

- Add user cache - Move all auth handling to authenticator class

Fuly remove tornado from the codebase, rely on asyncio and aiohttp instead. Also updates tests (still many many tests to go).

Some configurables should be global (e.g. timeouts, etc...) while others are cluster specific (e.g. ``worker_cores``). We now split out cluster-specific things into a separate object, which can be used by the backend.

Now supports full start/connect/stop roundtrip.

Supports the full scale protocol. Things may change still as we implement more features, but things kinda work now.

Workers that die unexpectedly are automatically restarted by the gateway server.

Adds timeout support for cluster/worker startup, and heartbeat timeouts.

Helps catch failures between submission and heartbeats starting up.

Allows for long polling when querying for a cluster status.

Also a few general fixes

- Properly cleanup periodic tasks when shutting down the scheduler service - Bulk updates require updating the same columns for all rows in sqlalchemy, update code accordingly.

Improve logging of events.

Add an option to cleanup all active clusters when the application shutsdown. Default is True.

- Cleanup the backend base class definition - Delete a few unnecessary things

Also update setup.py [ci skip]

- Move everything into single process for ease of deployment - Use jsonrpc over tcp for configuration instead of a REST api - Support bulk operations in api

Switch communication direction so that the proxy polls the api server to maintain it's routing table.

mmccarty · 2020-02-07T13:48:45Z

Thanks @jcrist I agree on dropping kubernetes support in master while this work is in flight rather than a long running feature branch. When we tag up later today, let's talk about how my folks can help with kubernetes.

- Fix bug in router implementation when serving on the root path - Integrate new proxy design with the gateway server - Drop `/gateway/` prefix, serving on root prefix by default now - General cleanups.

[ci skip]

Main gateway tests now pass. Also cleaned up and standardized some interface method names.

Previously any existing tasks in the reconcilation queue would be immediately cancelled. We now cancel all pending tasks, but tasks that are currently being worked on are allowed to complete. This makes shutdown cleaner.

Getting closer to having all old tests converted over. [ci skip]

Update backend tests for new functionality. Tests now work for - local - yarn - slurm - pbs backends.

A few tweaks to the test suite

- `asyncio.run` is python 3.7+ only, include implementation in `compat.py` for now. - `ssl_context` is removed in new aiohttp, use `ssl` instead - Drop kube and docs in travis yml for now [test-all]

jcrist · 2020-02-11T18:59:26Z

Ok, I think this is good to merge now. I've been heads-down getting this refactor done (apologies for ignoring other things for a bit). Refactoring the world is never fun, and this PR had to touch almost everything.

For users this should have no visible changes, but this does completely change the configuration/deployment story for admins. As such the docs will have to be updated. I've turned off doc builds on master for now, since we don't have the ability to publish multiple docs versions. I think when everything is up to date we can re-enable things.

A few features that existed in the previous release have been dropped in this refactor. They can be added in subsequent PRs:

Kubernetes support (large amount of work, will file a separate issue detailing the design)
User resource limits (need to change implementation for new design)

A summary of changes included in this PR:

Dependencies

We switched from using tornado to using aiohttp for our web server/client library. Benchmarks show this to be a mild performance improvement on the server side, and a moderate one on the client side (note - this is http client requests in the server only, updating the dask-gateway client library to use aiohttp is a separate issue). The aiohttp client also doesn't depend on pycurl, which had been problematic to install in the past.

Authentication

Authentication has been rewritten to remove the need for a backing database. External authenticator classes will likely need to be changed, but the changes should be minimal (function signature, switch to aiohttp handlers).

Clusters now have a username associated with them, but a "user" isn't a persistent object in the gateway data model. This removes any inconsistency issues between dask-gateway's user table and the upstream authentication system (e.g. JupyterHub). To reduce load on the upstream system, the server will cache authenticated users for a configurable period of time (5 min by default). Since authentication is transparent to the user, forcing periodic reauthentication shouldn't cause any issues (thanks @yuvipanda for the idea!).

User groups have also been added to the authentication model, but we're not doing anything interesting with them yet. This could be used to provide view/modify access, shared clusters, etc...

Backends

Previously we abstracted on a ClusterManager interface, which provided methods for managing an individual dask cluster (each new dask cluster would get a new ClusterManager instance). The DaskGateway application contained methods for querying/managing many clusters across users.

We now extract both of these out into a new Backend class. A single Backend is created when dask-gateway starts up. The backend contains methods for starting/stopping/querying clusters in bulk (see https://github.com/dask/dask-gateway/pull/195/files#diff-a469bacde3a6b41a2553ba319fb23f91R127-R206). Note that it does not require a database, implementations are free to store information in whatever way they see fit.

We also provide an implementation of the base Backend interface that stores state in a database. This is used to implement the existing backends (local processes, YARN, SLURM, PBS, etc...). More on this in the next section.

Support for user-configurable cluster options has also been expanded to support per-user or per-group options.

Database Backend

The current non-kubernetes backends all subclass from dask_gateway_server.backends.db_base.DatabaseBackend, which uses a configurable database to manage cluster state. This is similar to the previous implementation, but has a few changes to make things more robust/scalable.

State transitions are managed using the reconciler pattern (as done in kubernetes controllers). Clusters and worker records each store their current status (e.g. SUBMITTED) along with a target (e.g. RUNNING). As events trigger on the server (whether from periodic tasks or incoming requests), these records may be updated and the object (cluster or worker) queued for processing. The processing queues deduplicate queued tasks, so if many events happen for cluster X, only the final state will be processed when it's popped off the queue.
Failed workers are now restarted, meaning each individual dask cluster is more resilient to failure.
Interactions between the gateway server and the individual dask clusters have been rearranged to minimize outgoing requests from the gateway server to an individual dask cluster. Most communication is now initiated by the individual clusters (as heartbeats) rather than the gateway (as polling).
Previously the gateway server could be killed and restarted without losing active clusters - on restart it would check the status of each cluster that was running before, and update the database state accordingly. This worked fine, but could be slow for a large number of active clusters. This failover model is still supported, but is now much much quicker. A newly started gateway process is ready to respond to requests immediately, failures that happened during the downtime are discovered and cleaned up lazily.
There is a limiting parallelism factor c.DatabaseBackend.parallelism that sets the number of processing workers in the gateway. This limits the number of start/stop workers/clusters that can be processing at any time. Previously we didn't limit this, requests to start a worker would queue immediately and may fail under high load. This limit may increase latency under high load (as things back up), but should reduce the cases of failure.

Proxy

A dask-gateway instance no longer has a Proxy object as a top-level configurable thing. Rather, the proxy belongs to the Backend, and may or may not be managed by it.

For non-kubernetes backends, the DatabaseBackend class manages a refactored version of our custom proxy implementation. A few changes have been made to make this more robust and scalable.

Both the HTTP proxy and TLS proxy have been merged into a single process, rather than 2 separate processes.
Instead of having the gateway server push route updates to the proxy server, the proxy server now pulls updates from the gateway server. When a proxy instance starts up, it pings a configured URL on the gateway server to request the routing table. Once received, it opens a streaming request to get updates to this table. All route change events are versioned, so if a connection fails it can restart from the last change it saw (this protocol borrowed and modified from how kubernetes handles watch streams). One perk of this design is that it allows for running multiple instances of the proxy transparently to the gateway server, as the gateway server just sees this as another request.

jcrist · 2020-02-11T19:02:22Z

I don't expect anyone to be able to review a +6,608/−8,575 line PR, so I'm going to merge this for now. The master branch will be less featureful than the previous release until we re-add these features, but I'd rather work on the master branch than a feature branch.

jcrist added 30 commits January 21, 2020 11:52

Switch to colorlog

966d1c5

Slowly migrating off tornado.

Start on refactor

c598aa1

- Pare back functionality to allow rebuilding incrementally. - Move from tornado to aiohttp - Introduce backend abstraction

Add health check route

7ae105c

Remove tornado from auth

39b4296

Update authentication

3655697

- Add user cache - Move all auth handling to authenticator class

Remove tornado from proxies

fd077a7

Fuly remove tornado from the codebase, rely on asyncio and aiohttp instead. Also updates tests (still many many tests to go).

Update auth tests to work

b11eb67

Implement list_clusters for db backend

723a293

Expand supported routes

7b70ccb

Start work on reconciler pattern using local queues

971a3bf

Split out cluster config from backend

5309a1f

Some configurables should be global (e.g. timeouts, etc...) while others are cluster specific (e.g. ``worker_cores``). We now split out cluster-specific things into a separate object, which can be used by the backend.

Start on local backend

486411a

Implement initial scheduler <-> gateway comm

2e36239

Add proxies back in

024ecc9

Now supports full start/connect/stop roundtrip.

Add worker support

ddb636c

Supports the full scale protocol. Things may change still as we implement more features, but things kinda work now.

More robust shutdown process

83279b2

A few cleanups

a953dde

Add automatic worker restarts

a07987d

Workers that die unexpectedly are automatically restarted by the gateway server.

Add support for timeouts

5553e60

Adds timeout support for cluster/worker startup, and heartbeat timeouts.

Add status checks for pending clusters/workers

54043a7

Helps catch failures between submission and heartbeats starting up.

Add wait support back to get cluster handler

1914980

Allows for long polling when querying for a cluster status.

Add back adaptive scaling support

5e6c2a9

Add InProcessBackend

bd04c30

Also a few general fixes

A few fixups

74ac92b

- Properly cleanup periodic tasks when shutting down the scheduler service - Bulk updates require updating the same columns for all rows in sqlalchemy, update code accordingly.

Add more logs

870ce64

Improve logging of events.

Support reloading from database

6580d61

Shutdown active clusters on application shutdown

5c0df87

Add an option to cleanup all active clusters when the application shutsdown. Default is True.

Add YARN backend

9f224e2

Expose cluster options as api route

cb5b8da

Port over PBS backend

3687cd9

jcrist added 5 commits February 3, 2020 17:33

Port over SLURM backend

f26b927

A few cleanups

2619fc6

- Cleanup the backend base class definition - Delete a few unnecessary things

Delete kubernetes manager, managers

a9ad080

Also update setup.py [ci skip]

Refactor proxy

ab71a00

- Move everything into single process for ease of deployment - Use jsonrpc over tcp for configuration instead of a REST api - Support bulk operations in api

dask-gateway-proxy follows api server

d86520f

Switch communication direction so that the proxy polls the api server to maintain it's routing table.

jcrist added 14 commits February 7, 2020 15:51

Cleanup proxy implementation

0b6b0f5

- Fix bug in router implementation when serving on the root path - Integrate new proxy design with the gateway server - Drop `/gateway/` prefix, serving on root prefix by default now - General cleanups.

Delete unused utils, add tests for used utils

94f48cd

Update proxy tests

10ba65d

Update auth tests

d982188

[ci skip]

Simplify proxy testing fixture

ba74851

Update test_cli.py

fbba638

Update test_client.py, simplify test fixtures

c09154a

Update test_gateway.py

ea963e2

Main gateway tests now pass. Also cleaned up and standardized some interface method names.

Gracefully shutdown reconcilation queues

937ebbb

Previously any existing tasks in the reconcilation queue would be immediately cancelled. We now cancel all pending tasks, but tasks that are currently being worked on are allowed to complete. This makes shutdown cleaner.

Expand db tests

b4f8e34

Getting closer to having all old tests converted over. [ci skip]

Update backend tests

d25776a

Update backend tests for new functionality. Tests now work for - local - yarn - slurm - pbs backends.

Tweak test suite

f13debb

A few tweaks to the test suite

Delete dead code

96fb8a5

Compat fixes, update travis yml

6a70510

- `asyncio.run` is python 3.7+ only, include implementation in `compat.py` for now. - `ssl_context` is removed in new aiohttp, use `ssl` instead - Drop kube and docs in travis yml for now [test-all]

jcrist changed the title ~~[WIP] Rearchitecture~~ Rearchitecture Feb 11, 2020

jcrist merged commit ec3c344 into dask:master Feb 11, 2020

jcrist deleted the major-refactor branch February 11, 2020 19:16

This was referenced Feb 11, 2020

Rearchitecture: Kubernetes Backend #198

Closed

Rearchitecture #186

Closed

droctothorpe mentioned this pull request Apr 25, 2021

The readinessProbe for traefik pod (1 replica) is too sensitive #390

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rearchitecture #195

Rearchitecture #195

Uh oh!

jcrist commented Feb 4, 2020 •

edited

Loading

Uh oh!

mmccarty commented Feb 7, 2020 •

edited

Loading

Uh oh!

jcrist commented Feb 11, 2020 •

edited

Loading

Uh oh!

jcrist commented Feb 11, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Rearchitecture #195

Rearchitecture #195

Uh oh!

Conversation

jcrist commented Feb 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mmccarty commented Feb 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jcrist commented Feb 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependencies

Authentication

Backends

Database Backend

Proxy

Uh oh!

jcrist commented Feb 11, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jcrist commented Feb 4, 2020 •

edited

Loading

mmccarty commented Feb 7, 2020 •

edited

Loading

jcrist commented Feb 11, 2020 •

edited

Loading