Description
Overview
We started Wadm a long time ago, but have mostly let it sit since then as we've worked on polishing the host. However, the time has come to get this all working and productionized. This document is a proposal for rewriting and releasing wadm as a fully supported and featured part of the wasmCloud ecosystem, complete with waterslide!
Goals and Non-goals
This section covers the goals and non-goals of this scope of work. Please note that some of the items in non-goals are possible future work for wadm, but are not in scope for this work
Goals
- Make it easy to run a wasmCloud application in a lattice, including redistributing and running an application as hosts join or leave a lattice
- Give users a declarative way to run their applications
- wasmCloud operators should have a clear guide on how to run and deploy wadm
- This also includes new users who just want to run it, meaning this should be included in
wash up
- This also includes new users who just want to run it, meaning this should be included in
- Be the canonical scheduling implementation for wasmCloud without precluding the use of other custom built schedulers
- Be a project that people want to contribute to
- I know this one sounds all vision-y and corporate-y, but it is a key point. If we want to make something robust that will serve the needs of many people, we need contributors
- Have planned extension points for custom scalers (implementation is not required)
Non-goals
- Make wadm a full control plane for all things
- HTTP API for wadm
- Changing the scope of features that already exist in the current version of wadm
Key technical decisions
Language choice
For this work, I propose we rewrite Wadm in Rust. This decision was made for the following reasons, in order of importance. As part of making this decision, two other languages (Elixir and Go) were considered. Reasons for rejecting are described in the last 2 sections
Need for contributors
Schedulers and application lifecycle management are topics that many people in the cloud native space have deep knowledge of. If we are going to be writing something that does those things for wasmCloud, then we need as many eyes on it as possible. Based on current metrics of wasmCloud repos, we have very few contributors to our Elixir code and a lot more to our Rust repos. Other projects in, or consumed by, the wasm ecosystem are in Rust and also have higher numbers of contributors. Go would have also been an excellent choice here, but the other reasons listed here precluded it. We also have multiple contributors in the wasmCloud community right know who already know Rust.
The tl;dr is that we need contributors to be successful and the current language does not attract enough people.
Static Typing and Generics
One problem we've run into consistently in our elixir projects is issues with dynamic typing. Although this can be mitigated somewhat by tools like dialyzer, it requires programmer and maintainer discipline and still doesn't catch everything. Having a static type for each type of event that will drive a system like Wadm is critical for ensuring correct behavior and avoiding panics.
In addition to the need for static typing is the preference for having generics. In my experience with writing large applications for Kubernetes in both Rust and Go, a generic type system makes interacting with the API much easier. There is less generated code and need for rolling your own type system as what happens in many large Go projects. Go has added generics, but its system is nowhere near as strong as other statically typed languages such as Rust.
Support for wasm and wasmCloud libraries
To support custom scalers, we will likely be supporting at least an actor based extension point and possibly a bare wasm module. Either way, most of the wasm libraries/runtimes out in the wild are written in Rust or have first-class bindings for Rust. Also, many of our wasmCloud libraries are written in Rust, which will allow for reuse.
Static binaries and small runtime size
This is the lowest priority reason why I am suggesting Rust, but it is still an important one. The current implementation requires bundling an Erlang VM along with the compiled Elixir. That means someone who runs Wadm as it currently stands will likely need to tune a VM. It is also larger, which leads to more space requirements on disk and longer download times.
Rust (and even Go moreso) has great support for static binaries and both run lighter than a VM without much additional tuning (if any).
Disadvantages of Rust
As with any tool choice, there are tradeoffs that occur. Below is a list of disadvantages I think will be most likely to cause friction
- Rust async
- This isn't as bad of a problem as some people in the Rust community say, but we will likely need to implement things/workaround some of these rough edges
- Dealing with Rustls
- Rust still has a steeper learning curve than Go, so there will need to be some handholding on PRs from new contributors
Why not Elixir?
One of the biggest questions here is why not continue with Elixir. By far the biggest thing we are giving up is the code around the lattice observer. However, writing this in Rust gives us the advantage of creating something that we could eventually make bindings for in any language (this also helps enable the reusability described below), although that isn't a goal here.
With that said, the previous sections cover in depth the advantage of using Rust over Elixir in this case
Why not Go?
In my comparisons, I was looking for languages that would fit the requirements above. Due to the overlap of languages used for wasm as well as languages familiar to those in the Cloud Native space, that whittled things down to Go and Rust. Go in many ways excels at many of these requirements. It is much more popular that Rust and Elixir (probably combined) and has great support for statically compiling binaries. Also, things like NATS are native to Go.
It came down to a few main concerns of why Rust would be better:
- Generics + Code cleanliness
- Most of Wasm and wasmCloud are in Rust
- Familiarity and preference of core maintainers
To be clear, there are other smaller reasons, but those could be considered nitpicky.
State machine vs event-driven "filtering"
One of the items I most thought about when drafting this was whether or not we should implement wadm as a true state machine. Given the simplicity of what it is trying to do, I propose we focus more on implementing an event-driven filtering approach. Essentially, a state machine approach is going to overkill for this stage of the project and the near future.
Loosely, I am calling these "Scalers" (name subject to change). Every scaler can take a list of events (that may or not be filtered) that returns a list of actions to take.
This does not mean we might iterate into a state machine style in the future (if you are curious, you can see Krator for an example of how this could be done in code) or that a scaler implementation can't use a state machine. This only means that for this first approach, we'll filter events into actions.
I have purposefully not gone into high levels of detail of what this looks like in code as it will probably be best just to try and see how this looks like as we begin to implement it. What we currently have in wadm is probably a good way of going about this (i.e. Scalers output commands)
Scalers are commutative
One important point is that these "Scalers" should be commutative (i.e. if a+b=c then b+a=c, the order of operations doesn't matter). That means when a manfest is sent to wadm, it can run through the list of supported Scalers in any order and it will return the same output.
API is NATS-based
For this first production version, we will only be supporting a NATS API. This is because pretty much all wasmCloud tooling already uses NATS mostly transparently to an every day user. We can take advantage of that same tooling to keep things simple this time around. If we were to add an HTTP API right now, we'd have to figure out authn/z and figure out how we want to handle issuing tokens. So to keep it simple, we'll focus only on NATS to start.
One very important note here is that we definitely do want an HTTP API in the future. We know that many people will want to integrate with or extend WADM and an HTTP API is the easiest way to do that. But not for this first go around (well, second, but you get my point)
Data is stored in NATS KV
This is fairly self explanatory, but we want to store everything in one place now that NATS has KV support so we don't need any additional databases. Only the manifest data is stored in NATS. Lattice state is built up by querying all hosts on startup and then responding to events
High availability
A key requirement is that wadm can be run in high availability scenarios, which at its most basic means that multiple copies can be running.
I propose that this be done with leader election. Only one wadm process will ever be performing actions. All processes can gather the state of the lattice for purposes of fast failover, but only one performs actions. This is the simplest way to gain basic HA support
Custom scheduler support
This is purely here as a design note and is not required for completing the work, but based on experiences with tools like Kubernetes and Nomad, extending with a custom scheduler is a common ask for large deployments. In code, adding a scaler will be as simple as implementing a Scaler
trait.
For most people however, I propose that custom "Scalers" be added via a wadm manifest. The application provider must have an actor that will implement a new wasmcloud:wadm-scaler
interface, but can be as arbitrarily complex as desired. This manifest will have 2 special requirements
- It must have an annotation like
wasmcloud.dev/scaler: <scaler-name>
- It is only allowed to use the built in Scaler types to avoid chicken and egg problems
Once again, this is not going to be implemented here, and will likely be another, smaller, proposal than this one
Reusability and a canonical scheduler
One key point to stress here is that wadm is meant to be the canonical scheduler for wasmCloud. This means that it is the general purpose scheduler that most people use when running wasmCloud, but no one is forced to use it. You can choose not to use it at all, or to write your own entirely custom scheduler.
To that end, I propose we publish the key functionality as a Rust crate. Much of the functionality could be used in many other applications besides a scheduler, but it can also be used to build your own if so desired. Basically we want to avoid some of the problems of what occurred in Kubernetes where everything must go through the built-in scheduler
Basic roadmap of work
Whew, we made it to the actual work! As part of thinking through these ideas, I started a branch that has implemented some of the basic building blocks like streaming events and leader election. When we actually begin work, it will be against a new wadm_0.4
branch in the main repo until we have completed work. Please note that these are a general roadmap, I didn't want to try and give minute details here. Below is the basic overview of needed work.
Stage 1
All of this work can be worked on in parallel. This is a bit shorter because we are about 40-50% there with the branch I started work on
- Implement receiving and parsing wadm/OAM manifests, then storing them in NATS.
- Build up lattice state from queries (reconciliation loop) and events
Stage 2
This is a bit more difficult as these things must be worked on roughly in order. This work is more spike-like as it is spiking out the design of the Scaler
- Define the
Scaler
trait and implement the spreadscaler type (at least the number of replicas functionality). Scalers will need to handle manifests and state changes given by events (such as if a host stops). We want to start with implementing so we can see what kind of info is needed - Create a work loop that takes incoming manifest changes and events and gives the list of scalers all it needs to process that event
Stage 3
- Fully implement the spreadscaler type
- Implement linkdef handler
- Run e2e tests with multiple wasmCloud hosts
- Fully functional pipelines and alpha deploy candidate (crate, plain binary, and docker container)
Stage 4
This is the "tying a bow on it" stage of work
- Add wadm support to
wash up
by default with an optional flag to not use it - All flags and configuration documented
- Create deploy guides for getting started and "production level" deployments with multiple wadm's running with TLS against NATS. Should also show how to use the docker container.