Skip to content

Conversation

@jcrist
Copy link
Member

@jcrist jcrist commented Feb 4, 2020

Fixes #186.

This rearranges our backend abstraction (as well as some of the control flow) to make it easier to make the kubernetes backend (and other backends) more native/performant.

  • Redo authentication to remove need for database
  • Abstract out backend class
  • Move database to be backend-specific, rather than required by the gateway app itself
  • Switch from tornado to aiohttp/asyncio everywhere
  • Local process backend
  • YARN backend
  • PBS backend
  • Slurm backend
    [ ] User limits
    [ ] Kubernetes backend
  • Tests
    [ ] Docs

For the most part this is fully-functional already, and should just work. The only breaking changes here are on the administrator side - the user api remains the same.

I might drop the kubernetes support on master for a bit before we work things back up. The kubernetes backend will be split between the server and the a controller, I don't really want to rewrite the kubernetes backend to use the db only to remove it in a subsequent PR. I'm hesitant to continue this work on a feature branch since I don't want to deal with huge merge conflicts down the line, and this PR is already big enough. Either way tests and docs will be updated accordingly before merging.

I'll write up a design overview of what this PR brings tomorrow, it's late here.

jcrist added 30 commits January 21, 2020 11:52
Slowly migrating off tornado.
- Pare back functionality to allow rebuilding incrementally.
- Move from tornado to aiohttp
- Introduce backend abstraction
- Add user cache
- Move all auth handling to authenticator class
Fuly remove tornado from the codebase, rely on asyncio and aiohttp
instead. Also updates tests (still many many tests to go).
Some configurables should be global (e.g. timeouts, etc...) while others
are cluster specific (e.g. ``worker_cores``). We now split out
cluster-specific things into a separate object, which can be used by the
backend.
Now supports full start/connect/stop roundtrip.
Supports the full scale protocol. Things may change still as we
implement more features, but things kinda work now.
Workers that die unexpectedly are automatically restarted by the gateway
server.
Adds timeout support for cluster/worker startup, and heartbeat timeouts.
Helps catch failures between submission and heartbeats starting up.
Allows for long polling when querying for a cluster status.
Also a few general fixes
- Properly cleanup periodic tasks when shutting down the scheduler
service
- Bulk updates require updating the same columns for all rows in
sqlalchemy, update code accordingly.
Improve logging of events.
Add an option to cleanup all active clusters when the application
shutsdown. Default is True.
- Cleanup the backend base class definition
- Delete a few unnecessary things
Also update setup.py

[ci skip]
- Move everything into single process for ease of deployment
- Use jsonrpc over tcp for configuration instead of a REST api
- Support bulk operations in api
Switch communication direction so that the proxy polls the api server
to maintain it's routing table.
@mmccarty
Copy link
Member

mmccarty commented Feb 7, 2020

Thanks @jcrist I agree on dropping kubernetes support in master while this work is in flight rather than a long running feature branch. When we tag up later today, let's talk about how my folks can help with kubernetes.

jcrist added 14 commits February 7, 2020 15:51
- Fix bug in router implementation when serving on the root path
- Integrate new proxy design with the gateway server
- Drop `/gateway/` prefix, serving on root prefix by default now
- General cleanups.
Main gateway tests now pass. Also cleaned up and standardized some
interface method names.
Previously any existing tasks in the reconcilation queue would be
immediately cancelled. We now cancel all pending tasks, but tasks that
are currently being worked on are allowed to complete. This makes
shutdown cleaner.
Getting closer to having all old tests converted over.

[ci skip]
Update backend tests for new functionality. Tests now work for

- local
- yarn
- slurm
- pbs

backends.
A few tweaks to the test suite
- `asyncio.run` is python 3.7+ only, include implementation in
`compat.py` for now.
- `ssl_context` is removed in new aiohttp, use `ssl` instead
- Drop kube and docs in travis yml for now

[test-all]
@jcrist jcrist changed the title [WIP] Rearchitecture Rearchitecture Feb 11, 2020
@jcrist
Copy link
Member Author

jcrist commented Feb 11, 2020

Ok, I think this is good to merge now. I've been heads-down getting this refactor done (apologies for ignoring other things for a bit). Refactoring the world is never fun, and this PR had to touch almost everything.

For users this should have no visible changes, but this does completely change the configuration/deployment story for admins. As such the docs will have to be updated. I've turned off doc builds on master for now, since we don't have the ability to publish multiple docs versions. I think when everything is up to date we can re-enable things.

A few features that existed in the previous release have been dropped in this refactor. They can be added in subsequent PRs:

  • Kubernetes support (large amount of work, will file a separate issue detailing the design)
  • User resource limits (need to change implementation for new design)

A summary of changes included in this PR:

Dependencies

We switched from using tornado to using aiohttp for our web server/client library. Benchmarks show this to be a mild performance improvement on the server side, and a moderate one on the client side (note - this is http client requests in the server only, updating the dask-gateway client library to use aiohttp is a separate issue). The aiohttp client also doesn't depend on pycurl, which had been problematic to install in the past.

Authentication

Authentication has been rewritten to remove the need for a backing database. External authenticator classes will likely need to be changed, but the changes should be minimal (function signature, switch to aiohttp handlers).

Clusters now have a username associated with them, but a "user" isn't a persistent object in the gateway data model. This removes any inconsistency issues between dask-gateway's user table and the upstream authentication system (e.g. JupyterHub). To reduce load on the upstream system, the server will cache authenticated users for a configurable period of time (5 min by default). Since authentication is transparent to the user, forcing periodic reauthentication shouldn't cause any issues (thanks @yuvipanda for the idea!).

User groups have also been added to the authentication model, but we're not doing anything interesting with them yet. This could be used to provide view/modify access, shared clusters, etc...

Backends

Previously we abstracted on a ClusterManager interface, which provided methods for managing an individual dask cluster (each new dask cluster would get a new ClusterManager instance). The DaskGateway application contained methods for querying/managing many clusters across users.

We now extract both of these out into a new Backend class. A single Backend is created when dask-gateway starts up. The backend contains methods for starting/stopping/querying clusters in bulk (see https://github.com/dask/dask-gateway/pull/195/files#diff-a469bacde3a6b41a2553ba319fb23f91R127-R206). Note that it does not require a database, implementations are free to store information in whatever way they see fit.

We also provide an implementation of the base Backend interface that stores state in a database. This is used to implement the existing backends (local processes, YARN, SLURM, PBS, etc...). More on this in the next section.

Support for user-configurable cluster options has also been expanded to support per-user or per-group options.

Database Backend

The current non-kubernetes backends all subclass from dask_gateway_server.backends.db_base.DatabaseBackend, which uses a configurable database to manage cluster state. This is similar to the previous implementation, but has a few changes to make things more robust/scalable.

  • State transitions are managed using the reconciler pattern (as done in kubernetes controllers). Clusters and worker records each store their current status (e.g. SUBMITTED) along with a target (e.g. RUNNING). As events trigger on the server (whether from periodic tasks or incoming requests), these records may be updated and the object (cluster or worker) queued for processing. The processing queues deduplicate queued tasks, so if many events happen for cluster X, only the final state will be processed when it's popped off the queue.
  • Failed workers are now restarted, meaning each individual dask cluster is more resilient to failure.
  • Interactions between the gateway server and the individual dask clusters have been rearranged to minimize outgoing requests from the gateway server to an individual dask cluster. Most communication is now initiated by the individual clusters (as heartbeats) rather than the gateway (as polling).
  • Previously the gateway server could be killed and restarted without losing active clusters - on restart it would check the status of each cluster that was running before, and update the database state accordingly. This worked fine, but could be slow for a large number of active clusters. This failover model is still supported, but is now much much quicker. A newly started gateway process is ready to respond to requests immediately, failures that happened during the downtime are discovered and cleaned up lazily.
  • There is a limiting parallelism factor c.DatabaseBackend.parallelism that sets the number of processing workers in the gateway. This limits the number of start/stop workers/clusters that can be processing at any time. Previously we didn't limit this, requests to start a worker would queue immediately and may fail under high load. This limit may increase latency under high load (as things back up), but should reduce the cases of failure.

Proxy

A dask-gateway instance no longer has a Proxy object as a top-level configurable thing. Rather, the proxy belongs to the Backend, and may or may not be managed by it.

For non-kubernetes backends, the DatabaseBackend class manages a refactored version of our custom proxy implementation. A few changes have been made to make this more robust and scalable.

  • Both the HTTP proxy and TLS proxy have been merged into a single process, rather than 2 separate processes.
  • Instead of having the gateway server push route updates to the proxy server, the proxy server now pulls updates from the gateway server. When a proxy instance starts up, it pings a configured URL on the gateway server to request the routing table. Once received, it opens a streaming request to get updates to this table. All route change events are versioned, so if a connection fails it can restart from the last change it saw (this protocol borrowed and modified from how kubernetes handles watch streams). One perk of this design is that it allows for running multiple instances of the proxy transparently to the gateway server, as the gateway server just sees this as another request.

@jcrist
Copy link
Member Author

jcrist commented Feb 11, 2020

I don't expect anyone to be able to review a +6,608/−8,575 line PR, so I'm going to merge this for now. The master branch will be less featureful than the previous release until we re-add these features, but I'd rather work on the master branch than a feature branch.

@jcrist jcrist merged commit ec3c344 into dask:master Feb 11, 2020
@jcrist jcrist deleted the major-refactor branch February 11, 2020 19:16
This was referenced Feb 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rearchitecture

2 participants