-
Notifications
You must be signed in to change notification settings - Fork 1
Home
Viet Anh Khoa Tran edited this page Jun 17, 2022
·
2 revisions
Computation and workflow management system for time-constrained cluster environments. This system is aimed at compute clusters, on which users are accounted for the runtime of an entire node or minimum resource allocation units (e.g. at the Jülich Supercomputing Centre (JSC)).
Work in progress and potentially unstable.
-
Runs
- Defines the command and its corresponding parameters.
- Defines an Executor which determines environment variables, virtual environments, etc...
- Commands should be robust to termination, i.e.
- Should resume from previous computation if terminated.
- If the Node shuts down/fails, the Run will be requeued.
- Upon failure, must return a non-zero status code. [will not be requeued]
- Must return status code 0 if completed. [will not be requeued]
- Should resume from previous computation if terminated.
-
Experiment
- A logical group of Runs.
-
Clusters
- Each Cluster (currently
localandslurm) defines a group of nodes. - A ClusterManager manages NodeManagers on computation nodes (e.g. via SLURM jobs).
- Each NodeManager specifies a certain number of Slots and manages the execution of Runs in Python subprocesses.
- As Runs are (un-)queued from/to the Cluster, or are completed/failed, the number of nodes is rescaled as necessary.
- For now, the system is aggressive in minimizing the number of nodes, e.g.
- Assume 4 nodes (each with 4 slots), each executing a single Run
- Then 3 nodes are cancelled (along with the runs) and rescheduled to the remaining node.
- Each Cluster (currently
pip install juqueuegit clone https://github.com/tran-khoa/JuQueue juqueue
cd juqueue
pip install -e .
# (optional) Start with example definitions
cp -r example_defs ~/defsjuqueue --def-dir [PATH] --work-dir [PATH]JuQueue can be controlled by opening localhost:51234 in your browser (The JURECA Guide provides an example on how to forward the port from a login node to your local computer).
For more advanced usage, JuQueue implements OpenAPI via FastAPI, localhost:51234/docs or localhost:51234/redoc provide further documentation.