Boltzmann is a distributed lightweight arg orchestrator.
Based on the Scheduler Agent Supervisor Cloud Pattern,
Boltzmann
is a master-less service used to schedule a batch of arg in a parallel and distributed way.
Depending on the configuration, a Boltzmann
node might be stateless or stateful as args states may be stored in a
embedded or external database (e.g. Redis).
Worker pools (i.e. a Boltzmann
node) are ensured for correctness even in a distributed environment by using
leases (i.e. distributed
mutex lock) and a small leader election consensus algorithm.
Moreover, Leases
are implemented using either a RedLock algorithm or through storage engine's built-in data structure
(e.g. etcd leases).
The Scheduler
arranges for the steps that make up the arg to be executed and orchestrates their operation. These steps
can be combined into a pipeline or workflow. The Scheduler is responsible for ensuring that the steps in this workflow
are performed in the right order.
As each step is performed, the Scheduler records the state of the workflow, such as "step not yet started," "step running," or "step completed." The state information should also include an upper limit of the time allowed for the step to finish, called the complete-by time.
If a step requires access to a remote service or resource, the Scheduler invokes the appropriate Agent, passing it the details of the work to be performed. The Scheduler typically communicates with an Agent using asynchronous request/response messaging.
The Agent
contains logic that encapsulates a call to a remote service, or access to a remote resource referenced by a
step in a arg. Each Agent typically wraps calls to a single service or resource, implementing the appropriate error
handling and retry logic (subject to a timeout constraint, described later).
The Supervisor monitors the status of the steps in the arg being performed by the Scheduler. It runs periodically (the frequency will be system-specific), and examines the status of steps maintained by the Scheduler. If it detects any that have timed out or failed, it arranges for the appropriate Agent to recover the step or execute the appropriate remedial action (this might involve modifying the status of a step).
Note that the recovery or remedial actions are implemented by the Scheduler and Agents. The Supervisor should simply request that these actions be performed.
Till this day, there are two ways available to use Boltzmann
(which are not mutually exclusive):
- A HTTP REST API (HTTP/1.1).
- A gRCP Streaming API (HTTP/2, multiplexed).